The Hot-Loop Benchmark
Head-to-head test of Gemma 4 E4B vs Qwen 3 VL 8B on an M4 Pro 24GB. Same correctness, 5x faster wall-clock, 4-18x shorter answers. One of them became the default.
What this proves
In this benchmark, when both models were equally correct, the faster and more concise one kept the build loop moving. That compounds.
How it works
I ran seven identical prompts through two local models on the same machine. Both got every factual answer right. One took 2.6 minutes. The other took 13.2 minutes.
That gap is not about intelligence. It is about how fast you get back to the next decision.
Results
2.6 minutes vs 13.2 minutes. Same hardware, same prompts, same correctness on every factual test.
| Test | Gemma 4 E4B | Qwen 3 VL 8B | Why Gemma wins |
|---|---|---|---|
| Coding | 776 tok, 25.7s (30 tok/s) | 7,967 tok, 3m43s (36 tok/s) | 10x more concise, same quality |
| Reasoning | 253 tok, 6.2s (43 tok/s) | 1,553 tok, 39s (40 tok/s) | 6x faster, both correct (9) |
| Math | 699 tok, 16.8s (43 tok/s) | 5,100 tok, 2m15s (38 tok/s) | 8x faster, both correct (328) |
| Structured JSON | 588 tok, 27.7s (27 tok/s) | 4,372 tok, 3m26s (38 tok/s) | 7x faster, both valid |
| Debug | 1,286 tok, 35.7s (43 tok/s) | 4,650 tok, 2m2s (38 tok/s) | 3.5x faster, both found all 3 bugs |
| Creative | 336 tok, 21s (23 tok/s) | 6,106 tok, 3m1s (38 tok/s) | Gemma rhymes clean, Qwen breaks |
| System Design | 604 tok, 25.3s (24 tok/s) | 1,985 tok, 50s (40 tok/s) | Tie |
Note the tok/s column: Qwen actually generates tokens faster on most tasks. Gemma wins on wall-clock time because it produces 5-18x fewer tokens to reach the same answer. The speed advantage is about conciseness, not raw generation speed.
Why this matters in the build loop
Google released Gemma 4 on April 3, 2026. Four model sizes, Apache 2.0 license, extended reasoning built in. The published benchmarks are dramatic: AIME jumped from 20.8% to 89.2%, coding benchmarks nearly tripled.
But published benchmarks tell you what a model can do on a leaderboard. They do not tell you what it feels like to use one in a real workflow on real hardware.
The question I actually cared about: on a 24GB Mac, which model should be the default in my local AI loop?
What I tested
Hardware: M4 Pro Mac Mini, 24GB unified memory, macOS Sequoia.
Models:
- Gemma 4 E4B (9.6 GB, Google, released April 3 2026)
- Qwen 3 VL 8B (6.1 GB, Alibaba, current local favorite)
Runtime: Ollama, both models quantized to fit in memory.
Seven tasks, same prompt each:
| # | Task | What it tests |
|---|---|---|
| 1 | Write a longest-palindrome function in Python with type hints | Coding quality + instruction following |
| 2 | "All but 9 sheep die" reasoning puzzle | Basic logic |
| 3 | Sum of primes between 1 and 50 | Step-by-step math |
| 4 | Top 5 languages as JSON array | Structured output compliance |
| 5 | Find 3 bugs in a broken binary search | Code debugging |
| 6 | 4-line rhyming poem about debugging at 3am | Creative constraint following |
| 7 | Design a URL shortener (under 300 words) | System design conciseness |
The verbosity gap
The most surprising finding was not the speed. It was the output length.
Qwen generated 6,106 tokens for a 4-line poem. It produced three drafts, critiqued each one, offered alternative rhyme schemes, explained why its choices worked, and still delivered a poem that did not fully rhyme. Gemma produced 336 tokens: a thinking trace, a clean poem, done.
This pattern repeated on every test. Qwen's extended thinking mode generates massive internal chains even for simple prompts. The answers are not wrong, they are just buried in 5-18x more text than you need to read.
For a coding assistant embedded in your workflow, that is not a style difference. It is friction. Every extra paragraph you skim is a context switch away from the thing you were building.
Why this approach worked
This was not a synthetic benchmark. The prompts are tasks I actually care about: writing functions, debugging code, generating structured output, quick reasoning checks. The metrics that matter are wall-clock time and whether you have to re-read the output to find the answer.
Both models use extended thinking internally. The difference is that Gemma's thinking stays internal and the final output is concise. Qwen exposes its full reasoning chain in the response, which is useful for education but expensive for flow.
What to steal
- Optimize for loop throughput, not theoretical peak capability. If you are using a local model for repeated build/test/decide cycles, throughput and conciseness matter more than marginal capability once correctness is good enough.
- Verbosity is not thoroughness. A 776-token correct answer is strictly better than a 7,967-token correct answer for every use case except teaching.
- Memory constraints are the real filter. The 26B MoE and 31B dense Gemma 4 variants would win on harder benchmarks, but they do not fit in 24GB. On consumer hardware, E4B is the ceiling. It matters that the ceiling is good enough.
- Test with your own tasks, not leaderboard prompts. AIME scores do not tell you if the model follows "output ONLY valid JSON" correctly. That takes 30 seconds to test.
- The practical rule: Use Gemma 4 E4B as your default local text model on 24GB Macs. Reach for Qwen when you need vision, or for bigger models when the task genuinely needs deeper reasoning.
Limits or caveats
- Seven tasks is small. This is a flow-state benchmark, not a statistical study. For long-form reasoning, multi-step agents, or RAG pipelines, the ranking could flip.
- Qwen 3 VL has vision. If your workflow involves image understanding, Qwen has a capability Gemma 4 E4B does not (though the E2B and E4B edge models do support multimodal input with different tradeoffs).
- Quantization matters. Both models are quantized to fit in 24GB. Different quantization levels would shift speed and quality.
- No tool-use test. Gemma 4's published tool-use scores are dramatic (86.4% vs 6.6% on one benchmark), but I did not test structured tool calling here.
The real bottleneck in local AI is not cost or correctness. It is loop time. When two models are equally right, the one that gets out of the way faster is the one that earns the default slot.
Get the next build and workflow breakdown.