WorkflowBeginnerApril 17, 2026

A/B test AI models on the same task before you commit

Same prompt, same inputs, two models. Surface the real differences on a task that matters to you, in 20 minutes, instead of picking based on someone else's benchmark.

Loading demo...

Time: 20 minutes
Cost: ~$0.10 – $0.50
Stack: Claude CodeCodex CLIAny two LLMsMarkdown

You’re stuck with

You're about to standardize on one LLM for a workflow. Public benchmarks don't match your real task. Picking the wrong model costs you weeks of quietly-worse output before you notice.

You end up with

A comparison file showing each model's output on the same real task, per-dimension winners, wall time, and token cost. A defensible pick grounded in your work, not in MMLU.

The recipe

1. Pick one real task, not a toy

The test is worthless if the prompt is a puzzle or a trivia question. Pick something you actually do weekly, a code review, a doc audit, a planning decomposition, a customer email tone-matching, a research synthesis.

Write the prompt exactly as you'd run it in production. Include the real system message, the real inputs, the real constraints. The point is to see which model holds up under your workload.

2. Freeze every variable except the model

List everything that could change the output, then lock each one:

Prompt: identical, byte-for-byte
System message: identical
Temperature / reasoning effort: matched to each model's equivalent setting
Context: same files, same tool access, same working directory
Output format: same constraints (word limit, markdown, JSON, etc.)

If one model gets extra tools or a longer context window and the other doesn't, you're not comparing models. You're comparing two different tests.

3. Run both, capture everything

For each model, save a file under comparison/<date>/<model-id>/output.md. Capture:

The full output, verbatim
Wall time from start to final token
Token counts (input + output)
Cost at current API pricing
Any tool calls, file reads, or web fetches it made

If you can't automate the capture, paste outputs manually into structured files. Consistency matters more than automation here.

4. Score against dimensions that matter to you

Skip generic "helpfulness" scales. Write a 4-6 row rubric keyed to your workflow. Examples:

Dimension	Why it matters
Cross-file contradiction detection	Doc audits care about this; code reviews often don't
Structured output fidelity	Matters for tool-use pipelines
Speed to first token	Matters for interactive UX, not for background jobs
Scope-gap finding	"What isn't this doing that it claims to?"
Token cost per completion	Only matters if you're running this thousands of times

Each model wins different rows. That's the point. The goal is not a single-number winner, it's knowing which model to reach for in which situation.

5. Write the comparison, not the verdict

Don't title the file "Model X wins." Title it "Where each model is decisively better." Then for each dimension, write one line: "Model A, because [specific observation from the outputs]." Cite the actual outputs, quote them.

This forces honesty. If you can't write a specific observation, you don't have evidence, you have vibes.

6. Make it re-runnable

Save the prompt, inputs, and scoring rubric in the comparison folder. When a new model ships in two months, you rerun the same test and append. Your comparison library compounds. Your team stops re-litigating model picks every quarter.

Why this workflow beats public benchmarks

Benchmarks optimize for the benchmark. Once a leaderboard matters to the labs, every model gets trained to beat it. Your actual workflow is not on any leaderboard.

One task tells you more than ten scores. A single real task, run honestly, surfaces exactly the failure modes that would bite you in production. Ten synthetic scores tell you nothing about your specific edge cases.

Model strengths are directional, not absolute. One model is better at structured output and worse at cross-file reasoning. Trying to pick a single best model is the wrong question. The right question is: which model for which situation?

Steal this starter

Use this comparison.md skeleton for every test. Fill in, run both models, paste outputs.

comparison.md:

# Model comparison, <date>, <task name>

## Task
<one-paragraph description of the real task you're testing>

## Prompt (frozen)
<exact prompt, byte-for-byte>

## Rubric
1. <dimension>, <why it matters>
2. <dimension>, <why it matters>
...

## Model A, <model-id>
- Wall time:
- Input / output tokens:
- Cost:
- Output: (see model-a-output.md)

## Model B, <model-id>
- Wall time:
- Input / output tokens:
- Cost:
- Output: (see model-b-output.md)

## Where each model is decisively better

### Model A wins
- <dimension>: <one specific observation citing the output>

### Model B wins
- <dimension>: <one specific observation citing the output>

## Tie / no meaningful difference
- <dimension>: <why it didn't discriminate>

## Takeaway
When to reach for Model A: <situation>
When to reach for Model B: <situation>

Run-both shell helper:

#!/usr/bin/env bash
# ab-test.sh <prompt-file> <model-a> <model-b>
set -euo pipefail

prompt_file="$1"
model_a="$2"
model_b="$3"
date_dir="comparison/$(date +%Y-%m-%d)"
mkdir -p "$date_dir/$model_a" "$date_dir/$model_b"

echo "Running $model_a..."
time_start=$(date +%s)
<your-cli> --model "$model_a" --file "$prompt_file" \
  > "$date_dir/$model_a/output.md"
echo "  wall: $(($(date +%s) - time_start))s"

echo "Running $model_b..."
time_start=$(date +%s)
<your-cli> --model "$model_b" --file "$prompt_file" \
  > "$date_dir/$model_b/output.md"
echo "  wall: $(($(date +%s) - time_start))s"

echo "Outputs in $date_dir"

Swap <your-cli> for whatever runs your models. Claude Code, codex exec, gh copilot, or a provider SDK. The script is 12 lines; the rubric is the hard part.

That's the whole workflow. The first time you run it takes twenty minutes. Every subsequent comparison takes five, because the template and the rubric are done.

Get new workflows and breakdowns in your inbox.