April 5, 2026Review

5x Faster, Same Answers: Why I Switched My Default Local Model

#gemma-4 #qwen-3 #ollama #local-ai #apple-silicon

I asked a local model to return the top 5 programming languages as a JSON array. Three keys per object: rank, language, primary_use. Output only valid JSON, nothing else.

Gemma 4 E4B returned the array in 27 seconds. 588 tokens. Valid JSON. Done.

Qwen 3 VL 8B returned the same array in 3 minutes and 26 seconds. 4,372 tokens. It reasoned through language popularity trends, considered alternative rankings, debated whether TypeScript should replace C#, self-corrected twice, and then produced the same five-item JSON array that Gemma had delivered two minutes earlier.

Both answers were correct. One of them let me pipe the output into my script and move on. The other made me wait, scroll past 3,800 tokens of internal deliberation, and then do the same thing.

What changed

Google shipped Gemma 4 on April 3, 2026. The E4B variant fits in 9.6GB and runs on any 24GB Mac. What surprised me was not the benchmark uplift (AIME went from 20.8% to 89.2%). It was the output discipline.

Both models use extended thinking internally. Gemma keeps the thinking internal and delivers a concise final answer. Qwen surfaces the full reasoning chain in the response: drafts, critiques, alternatives, self-corrections. That is genuinely useful if you want to learn how the model thinks. It is expensive if you just want the answer.

The JSON task made this concrete, but the pattern held across all seven tasks I tested. On a 4-line poem prompt, Qwen generated 6,106 tokens and still broke the rhyme. Gemma produced 336 tokens and nailed it.

Why it matters

If you are calling a local model inside a build loop, the cost of verbosity is not tokens. It is attention.

Every extra paragraph you skim is a context switch. Every 3-minute wait is a tab switch. At 27 seconds, you stay in your workflow. At 3 minutes and 26 seconds, you check your phone, open a browser tab, and lose the thread.

For developers piping model output into scripts, this is even sharper. A model that wraps valid JSON in 3,800 tokens of commentary means you need to parse or trim the response. A model that returns just the JSON means your pipeline works without post-processing.

How it showed up in practice

I ran the full comparison across seven tasks: coding, reasoning, math, structured output, debugging, creative writing, and system design. Gemma completed the suite in 2.6 minutes. Qwen took 13.2 minutes. Both got every factual answer right.

The detail that makes this interesting: Qwen actually generates tokens faster than Gemma on most tasks (36-40 tok/s vs 23-43 tok/s). The wall-clock gap is not about generation speed. It is about how many tokens the model decides to produce before it reaches the answer. Gemma outputs 5-18x fewer tokens per response.

The full benchmark with all seven tasks, exact prompts, tok/s rates, and timing data: The Hot-Loop Benchmark.

The key lesson

When you are choosing a default local model, the question is not "which one is smartest?" It is "which one keeps me in the loop?"

Gemma 4 E4B is now the default on my M4 Pro. Qwen 3 VL stays installed for image understanding, but for text generation, coding, and structured output, the conciseness gap is too large to ignore.

What to watch next

Gemma 4 26B MoE activates only 3.8B parameters at inference. If it fits in 24GB quantized, it could beat E4B without the speed penalty.
Tool-use performance. Gemma 4's published tool-use scores are extraordinary (86.4% vs Gemma 3's 6.6%). If that holds locally, it changes the calculus for AI agents on consumer hardware.
Qwen's next move. If Qwen 4 tightens the verbosity gap, this comparison needs a rerun.

The model you keep loaded is not the smartest one you can find. It is the one that lets you ship.

Want to see more projects like this? Browse all builds for interactive tools, dashboards, and case studies with source and build times. Or learn more about ShipWithTez.

Get new builds and useful AI updates