Do LLMs Actually Follow Instructions? A Factorial Experiment

The Report

Full Report (PDF) | R Analysis (HTML) — submitted for STA305 Experimental Design, Winter 2026.

Why This Experiment

Everyone who has used ChatGPT, Claude, or Gemini for structured tasks has experienced the same frustration: you ask for exactly five bullet points and get seven. You say “do not use the word X” and it appears in the second sentence. You ask it to end with a specific phrase and it invents its own sign-off.

These aren’t knowledge failures. The model knows what supervised learning is. The model knows what a bullet point is. The failure is in compliance — the ability to hold multiple formatting constraints in working memory while generating text, and to verify each one is satisfied before committing to the output.

This raised a question that felt natural for an experimental design course: can we quantify what actually helps an LLM follow instructions? Not through vibes or anecdotal prompt engineering tips, but through a controlled factorial experiment with objective scoring.

The Setup

Three factors, two levels each, three replicates — a $2^3$ full factorial design with 24 total runs.

Factor	What It Tests	Low (−1)	High (+1)
A: Prompt Strategy	How you present the rules	Flat numbered list	Step-by-step CoT preamble
B: Context	Whether you show an example	No example	One compliant example included
C: Model	Reasoning capability	Gemini 2.5 Flash-Lite (no thinking)	Gemini 2.5 Pro (with thinking)

Every run asked the same question — “Explain the difference between supervised and unsupervised learning” — while obeying six formatting constraints: a banned word, a sentence length cap, exactly five bullet points, a required opening word, no question marks, and a specific sign-off phrase. Each constraint scored binary (pass/fail), giving a response variable $y \in \{0, 6\}$.

All six constraints are objectively verifiable by string search or counting. No subjective judgment. A Python script handled both the API calls and the scoring.

What Happened

The Pro model — the one with thinking tokens — scored 6/6 on all twelve runs. Every single one. Perfect compliance, every time.

Flash-Lite told a different story. It nailed five of six constraints consistently (word ban, sentence cap, opening word, no questions, sign-off), but stumbled on one: counting to exactly five bullet points. Sometimes it gave four. Sometimes six. The only constraint that requires holding a running count while generating — and the only one that broke.

Here’s the ANOVA:

Source	F	p
C (Model)	32.0	< 0.001
B (Context)	8.0	0.012
B × C	8.0	0.012
A (Prompt)	0.0	1.000
Everything else	0.0	1.000

Factor A — whether you phrase the prompt as a flat list or a careful step-by-step chain-of-thought — had literally zero effect. The estimate is 0.0000. Not small. Zero.

The Interaction That Tells the Story

The significant B × C interaction is where it gets interesting.

Providing a compliant example helped Flash-Lite go from a mean of 5.0 to 5.67 — specifically by improving its bullet-counting accuracy from 0% to 67%. But for Pro? The example did nothing. It was already perfect.

This is a textbook ceiling × treatment interaction. The weaker model benefits from external scaffolding; the stronger model has internalized that scaffolding through its reasoning process. The example is redundant when the model can think.

What’s Really Going On

The thinking model generates between 860 and 3,312 reasoning tokens before producing its response. It’s not just more capable — it operates through a fundamentally different mechanism. It can:

Parse each constraint individually
Draft a candidate response
Check the candidate against each constraint
Revise before committing

Flash-Lite doesn’t do any of this. It produces the response in a single forward pass. The constraints are encoded in the prompt, and the model either picks up on them during generation or it doesn’t. There’s no internal verification loop.

This maps cleanly onto the constraint that actually broke. Five of the six constraints are local — you can satisfy them token-by-token (don’t generate “important”, don’t generate “?”, start with “Interestingly”). But “exactly five bullet points” is global — you need to maintain a count across the entire response and know when to stop. That’s exactly the kind of constraint that benefits from a verify-and-revise loop, and exactly the kind that a single-pass model fumbles.

The zero effect of prompt strategy (Factor A) is equally telling. CoT prompting — “think about each rule carefully before writing” — is widely recommended in prompt engineering guides. But it made no difference here. Why? Because external CoT in the prompt is a suggestion to the model. Internal thinking tokens are an architectural capability. You can tell a model to think carefully, but if it doesn’t have the mechanism to actually do so, the instruction is just more text in the context window.

The Ceiling Effect Problem (and Why It’s Fine)

From a pure statistics standpoint, this experiment has issues. The Pro group has zero variance. Six of eight treatment combinations have zero within-group variance. The residuals are wildly non-normal — the response only takes values 5 and 6. The homogeneity of variance assumption is violated in the most extreme way possible.

But this isn’t a flaw in the experiment. It’s the finding. The task was calibrated to be challenging enough to expose differences, and it succeeded — the difference is just so large that it manifests as a ceiling effect rather than a smooth gradient. A harder task (say, 12 constraints, or constraints requiring semantic judgment) would produce more variance in the Pro group and enable subtler analyses. That’s a direction for future work, not a reason to discount the current results.

The significant effects (C and B×C) are robust to distributional assumptions because the effect sizes are enormous relative to any plausible error structure. You don’t need normality to conclude that 12/12 perfect vs. 4/12 perfect is a real difference.

Takeaways

If you need reliable instruction-following, use a thinking model. No amount of prompt engineering — not CoT preambles, not few-shot examples — substitutes for the model’s ability to self-verify. The architectural capability dominates the prompting technique by an order of magnitude.

Few-shot examples help, but only for weaker models. If you’re stuck with a non-reasoning model (for cost or latency reasons), providing a compliant example is the single most effective intervention. It gives the model a concrete template to pattern-match against, partially compensating for the lack of internal verification.

CoT prompting may be overrated for constraint satisfaction. At least for explicit, enumerable constraints, telling the model to “think carefully” adds nothing measurable. The constraints are already clear; the bottleneck is the model’s ability to track them during generation, not its ability to understand them.

This was a fun experiment to run — 24 API calls, fully automated scoring, and results that were almost comically clean. Sometimes the most interesting finding is the one that’s too clean for comfortable statistics.

Implementation

The entire experiment — API calls, scoring, and data collection — was automated with a single Python script calling the Google Vertex AI REST API. Below is the full implementation (API key redacted).

Expand: run_experiment.py

#!/usr/bin/env python3
"""
STA305 Assignment 2 — Automated Experiment Execution
Design: 2^3 replicated factorial (24 runs)
Models: gemini-2.5-pro (C=+1, thinking) vs gemini-2.5-flash-lite (C=-1, no thinking)
"""

import json, re, time, csv, os, urllib.request, urllib.error

API_KEY = os.environ.get("GEMINI_API_KEY", "YOUR_API_KEY_HERE")
BASE_URL = "https://aiplatform.googleapis.com/v1/publishers/google/models"

MODEL_MAP = {
    -1: "gemini-2.5-flash-lite",  # no thinking
     1: "gemini-2.5-pro",         # with thinking
}

# ---------- Prompt Templates ----------

CONTENT_QUESTION = "Explain the difference between supervised learning and unsupervised learning in machine learning."

EXAMPLE_BLOCK = '''
Here is an example of a response to a DIFFERENT question that correctly follows all 6 rules:

"""
Interestingly, sorting algorithms vary greatly in efficiency and use cases.
- Bubble sort repeatedly swaps adjacent elements until the list is ordered.
- Merge sort divides the list, sorts halves, then merges them back.
- Quick sort picks a pivot and partitions elements around it efficiently.
- Heap sort uses a binary heap structure to extract elements in order.
- Insertion sort builds the sorted list one element at a time gradually.
— End of response.
"""'''

RULES_DIRECT = """You must follow ALL of these rules in your response:
1. Do not use the word "important" anywhere in your response.
2. Every sentence must contain no more than 20 words.
3. Present your answer as exactly 5 bullet points.
4. Begin your response with the word "Interestingly".
5. Do not ask any questions or use question marks in your response.
6. End your response with exactly the phrase "— End of response.\""""

RULES_COT = """Before writing your response, carefully review each of the following rules one by one. Think about how to satisfy every rule simultaneously, then write your response.

Rule 1: Do not use the word "important" anywhere in your response.
Rule 2: Every sentence must contain no more than 20 words.
Rule 3: Present your answer as exactly 5 bullet points.
Rule 4: Begin your response with the word "Interestingly".
Rule 5: Do not ask any questions or use question marks in your response.
Rule 6: End your response with exactly the phrase "— End of response."

Now, write your response following all six rules above."""


def build_prompt(A, B):
    """Build prompt from factor levels."""
    parts = [CONTENT_QUESTION, ""]
    if A == -1:  # Direct
        parts.append(RULES_DIRECT)
        if B == 1:
            parts.append(EXAMPLE_BLOCK)
    else:  # CoT
        if B == 1:
            cot_with_example = """Before writing your response, carefully review each of the following rules one by one. Think about how to satisfy every rule simultaneously, then write your response.

Rule 1: Do not use the word "important" anywhere in your response.
Rule 2: Every sentence must contain no more than 20 words.
Rule 3: Present your answer as exactly 5 bullet points.
Rule 4: Begin your response with the word "Interestingly".
Rule 5: Do not ask any questions or use question marks in your response.
Rule 6: End your response with exactly the phrase "— End of response.\""""
            parts.append(cot_with_example)
            parts.append(EXAMPLE_BLOCK)
            parts.append("\nNow, write your response following all six rules above.")
        else:
            parts.append(RULES_COT)
    return "\n".join(parts)


def call_gemini(model_id, prompt):
    """Call Gemini API and return (response_text, thinking_text, usage_metadata)."""
    url = f"{BASE_URL}/{model_id}:generateContent?key={API_KEY}"
    payload = json.dumps({
        "contents": [{"role": "user", "parts": [{"text": prompt}]}],
        "generationConfig": {"temperature": 1.0}
    }).encode("utf-8")

    req = urllib.request.Request(url, data=payload, method="POST",
                                 headers={"Content-Type": "application/json"})
    resp = urllib.request.urlopen(req, timeout=120)
    data = json.loads(resp.read().decode("utf-8"))

    response_text = ""
    thinking_text = ""
    if "candidates" in data and data["candidates"]:
        parts = data["candidates"][0].get("content", {}).get("parts", [])
        for part in parts:
            if part.get("thought", False):
                thinking_text += part.get("text", "")
            else:
                response_text += part.get("text", "")

    usage = data.get("usageMetadata", {})
    return response_text.strip(), thinking_text.strip(), usage


# ---------- Scoring Functions ----------

def score_c1(text):
    """Word ban: 'important' must not appear."""
    return 1 if "important" not in text.lower() else 0

def score_c2(text):
    """Sentence cap: every sentence <= 20 words."""
    sentences = re.split(r'[.!;]\s*', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    for s in sentences:
        if len(s.split()) > 20:
            return 0
    return 1

def score_c3(text):
    """Bullet format: exactly 5 bullet points."""
    bullet_lines = re.findall(r'^\s*[-•*]\s+', text, re.MULTILINE)
    return 1 if len(bullet_lines) == 5 else 0

def score_c4(text):
    """Opening word: first word is 'Interestingly'."""
    stripped = text.lstrip()
    first_word = re.split(r'[\s,.:;]', stripped)[0] if stripped else ""
    return 1 if first_word == "Interestingly" else 0

def score_c5(text):
    """No questions: zero question marks."""
    return 1 if "?" not in text else 0

def score_c6(text):
    """Sign-off: ends with '— End of response.'"""
    stripped = text.rstrip()
    return 1 if (stripped.endswith("— End of response.") or
                 stripped.endswith("-- End of response.") or
                 stripped.endswith("\u2014 End of response.")) else 0


def score_response(text):
    """Score all 6 constraints."""
    scores = {
        "c1": score_c1(text), "c2": score_c2(text), "c3": score_c3(text),
        "c4": score_c4(text), "c5": score_c5(text), "c6": score_c6(text),
    }
    scores["y"] = sum(scores.values())
    return scores


# ---------- Run Order (from R with set.seed(305)) ----------

def generate_run_order():
    """Hardcoded from R output with set.seed(305)."""
    runs = [
        {"run_order": 1,  "A": -1, "B":  1, "C": -1, "replicate": 3},
        {"run_order": 2,  "A":  1, "B": -1, "C": -1, "replicate": 2},
        {"run_order": 3,  "A": -1, "B":  1, "C":  1, "replicate": 3},
        {"run_order": 4,  "A":  1, "B":  1, "C": -1, "replicate": 1},
        {"run_order": 5,  "A":  1, "B":  1, "C": -1, "replicate": 3},
        {"run_order": 6,  "A": -1, "B":  1, "C": -1, "replicate": 2},
        {"run_order": 7,  "A":  1, "B": -1, "C":  1, "replicate": 2},
        {"run_order": 8,  "A": -1, "B": -1, "C": -1, "replicate": 1},
        {"run_order": 9,  "A":  1, "B":  1, "C": -1, "replicate": 2},
        {"run_order": 10, "A": -1, "B": -1, "C":  1, "replicate": 2},
        {"run_order": 11, "A": -1, "B": -1, "C": -1, "replicate": 2},
        {"run_order": 12, "A": -1, "B":  1, "C":  1, "replicate": 2},
        {"run_order": 13, "A":  1, "B":  1, "C":  1, "replicate": 2},
        {"run_order": 14, "A":  1, "B": -1, "C":  1, "replicate": 1},
        {"run_order": 15, "A": -1, "B": -1, "C":  1, "replicate": 3},
        {"run_order": 16, "A":  1, "B":  1, "C":  1, "replicate": 1},
        {"run_order": 17, "A": -1, "B":  1, "C": -1, "replicate": 1},
        {"run_order": 18, "A": -1, "B":  1, "C":  1, "replicate": 1},
        {"run_order": 19, "A": -1, "B": -1, "C": -1, "replicate": 3},
        {"run_order": 20, "A": -1, "B": -1, "C":  1, "replicate": 1},
        {"run_order": 21, "A":  1, "B":  1, "C":  1, "replicate": 3},
        {"run_order": 22, "A":  1, "B": -1, "C": -1, "replicate": 3},
        {"run_order": 23, "A":  1, "B": -1, "C": -1, "replicate": 1},
        {"run_order": 24, "A":  1, "B": -1, "C":  1, "replicate": 3},
    ]
    return runs


# ---------- Main ----------

def main():
    runs = generate_run_order()
    output_dir = os.path.dirname(os.path.abspath(__file__))
    responses_dir = os.path.join(output_dir, "responses")
    os.makedirs(responses_dir, exist_ok=True)

    results = []

    for run in runs:
        ro = run["run_order"]
        A, B, C = run["A"], run["B"], run["C"]
        rep = run["replicate"]
        model_id = MODEL_MAP[C]
        prompt = build_prompt(A, B)

        print(f"\nRun {ro:2d}/24 | {'CoT' if A==1 else 'Direct'} | "
              f"{'Example' if B==1 else 'No Example'} | {model_id} | Rep {rep}")

        try:
            response_text, thinking_text, usage = call_gemini(model_id, prompt)
        except Exception as e:
            print(f"  ERROR: {e}, retrying...")
            time.sleep(10)
            response_text, thinking_text, usage = call_gemini(model_id, prompt)

        scores = score_response(response_text)
        print(f"  y={scores['y']} [{scores['c1']}{scores['c2']}{scores['c3']}"
              f"{scores['c4']}{scores['c5']}{scores['c6']}] "
              f"think={usage.get('thoughtsTokenCount', 'N/A')}")

        # Save individual response
        with open(os.path.join(responses_dir, f"response_run{ro:02d}.txt"), "w") as f:
            f.write(f"Run {ro} | A={A} B={B} C={C} Rep={rep}\n")
            f.write(f"Scores: {scores}\n\n--- RESPONSE ---\n{response_text}\n")
            if thinking_text:
                f.write(f"\n--- THINKING ---\n{thinking_text}\n")

        results.append({
            "run_order": ro, "A": A, "B": B, "C": C, "replicate": rep,
            **{k: scores[k] for k in ["c1","c2","c3","c4","c5","c6","y"]},
            "model": model_id,
            "thinking_tokens": usage.get("thoughtsTokenCount", 0),
        })
        time.sleep(2)

    # Save data.csv
    csv_path = os.path.join(output_dir, "data.csv")
    with open(csv_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=[
            "run_order","A","B","C","replicate",
            "c1","c2","c3","c4","c5","c6","y","model","thinking_tokens"
        ])
        writer.writeheader()
        for r in sorted(results, key=lambda x: x["run_order"]):
            writer.writerow(r)

    print(f"\nDone. {csv_path} saved.")


if __name__ == "__main__":
    main()

The Report#

Why This Experiment#

The Setup#

What Happened#

The Interaction That Tells the Story#

What’s Really Going On#

The Ceiling Effect Problem (and Why It’s Fine)#

Takeaways#

Implementation#