A/B test AI models on the same task before you commit
Same prompt, same inputs, two models. Surface the real differences on a task that matters to you, in 20 minutes, instead of picking based on someone else's benchmark.
- Time
- 20 minutes
- Cost
- ~$0.10 – $0.50
- Stack
- Claude CodeCodex CLIAny two LLMsMarkdown
You’re stuck with
You're about to standardize on one LLM for a workflow. Public benchmarks don't match your real task. Picking the wrong model costs you weeks of quietly-worse output before you notice.
You end up with
A comparison file showing each model's output on the same real task, per-dimension winners, wall time, and token cost. A defensible pick grounded in your work, not in MMLU.
The recipe
1. Pick one real task, not a toy
The test is worthless if the prompt is a puzzle or a trivia question. Pick something you actually do weekly, a code review, a doc audit, a planning decomposition, a customer email tone-matching, a research synthesis.
Write the prompt exactly as you'd run it in production. Include the real system message, the real inputs, the real constraints. The point is to see which model holds up under your workload.
2. Freeze every variable except the model
List everything that could change the output, then lock each one:
- Prompt: identical, byte-for-byte
- System message: identical
- Temperature / reasoning effort: matched to each model's equivalent setting
- Context: same files, same tool access, same working directory
- Output format: same constraints (word limit, markdown, JSON, etc.)
If one model gets extra tools or a longer context window and the other doesn't, you're not comparing models. You're comparing two different tests.
3. Run both, capture everything
For each model, save a file under comparison/<date>/<model-id>/output.md. Capture:
- The full output, verbatim
- Wall time from start to final token
- Token counts (input + output)
- Cost at current API pricing
- Any tool calls, file reads, or web fetches it made
If you can't automate the capture, paste outputs manually into structured files. Consistency matters more than automation here.
4. Score against dimensions that matter to you
Skip generic "helpfulness" scales. Write a 4-6 row rubric keyed to your workflow. Examples:
| Dimension | Why it matters |
|---|---|
| Cross-file contradiction detection | Doc audits care about this; code reviews often don't |
| Structured output fidelity | Matters for tool-use pipelines |
| Speed to first token | Matters for interactive UX, not for background jobs |
| Scope-gap finding | "What isn't this doing that it claims to?" |
| Token cost per completion | Only matters if you're running this thousands of times |
Each model wins different rows. That's the point. The goal is not a single-number winner, it's knowing which model to reach for in which situation.
5. Write the comparison, not the verdict
Don't title the file "Model X wins." Title it "Where each model is decisively better." Then for each dimension, write one line: "Model A, because [specific observation from the outputs]." Cite the actual outputs, quote them.
This forces honesty. If you can't write a specific observation, you don't have evidence, you have vibes.
6. Make it re-runnable
Save the prompt, inputs, and scoring rubric in the comparison folder. When a new model ships in two months, you rerun the same test and append. Your comparison library compounds. Your team stops re-litigating model picks every quarter.
Why this workflow beats public benchmarks
Benchmarks optimize for the benchmark. Once a leaderboard matters to the labs, every model gets trained to beat it. Your actual workflow is not on any leaderboard.
One task tells you more than ten scores. A single real task, run honestly, surfaces exactly the failure modes that would bite you in production. Ten synthetic scores tell you nothing about your specific edge cases.
Model strengths are directional, not absolute. One model is better at structured output and worse at cross-file reasoning. Trying to pick a single best model is the wrong question. The right question is: which model for which situation?
Steal this starter
Use this comparison.md skeleton for every test. Fill in, run both models, paste outputs.
comparison.md:
# Model comparison, <date>, <task name>
## Task
<one-paragraph description of the real task you're testing>
## Prompt (frozen)
<exact prompt, byte-for-byte>
## Rubric
1. <dimension>, <why it matters>
2. <dimension>, <why it matters>
...
## Model A, <model-id>
- Wall time:
- Input / output tokens:
- Cost:
- Output: (see model-a-output.md)
## Model B, <model-id>
- Wall time:
- Input / output tokens:
- Cost:
- Output: (see model-b-output.md)
## Where each model is decisively better
### Model A wins
- <dimension>: <one specific observation citing the output>
### Model B wins
- <dimension>: <one specific observation citing the output>
## Tie / no meaningful difference
- <dimension>: <why it didn't discriminate>
## Takeaway
When to reach for Model A: <situation>
When to reach for Model B: <situation>
Run-both shell helper:
#!/usr/bin/env bash
# ab-test.sh <prompt-file> <model-a> <model-b>
set -euo pipefail
prompt_file="$1"
model_a="$2"
model_b="$3"
date_dir="comparison/$(date +%Y-%m-%d)"
mkdir -p "$date_dir/$model_a" "$date_dir/$model_b"
echo "Running $model_a..."
time_start=$(date +%s)
<your-cli> --model "$model_a" --file "$prompt_file" \
> "$date_dir/$model_a/output.md"
echo " wall: $(($(date +%s) - time_start))s"
echo "Running $model_b..."
time_start=$(date +%s)
<your-cli> --model "$model_b" --file "$prompt_file" \
> "$date_dir/$model_b/output.md"
echo " wall: $(($(date +%s) - time_start))s"
echo "Outputs in $date_dir"
Swap <your-cli> for whatever runs your models. Claude Code, codex exec, gh copilot, or a provider SDK. The script is 12 lines; the rubric is the hard part.
That's the whole workflow. The first time you run it takes twenty minutes. Every subsequent comparison takes five, because the template and the rubric are done.
Get new workflows and breakdowns in your inbox.