Google Gemma 4: Open-Weight AI That Punches Way Up

Four model sizes, Apache 2.0, natively multimodal from 2B to 31B — and the 26B MoE variant scores 88.3% on AIME 2026 with only 3.8B active parameters.

Google Gemma 4: Open-Weight AI That Punches Way Up

The 26B MoE model is the number that should bother every closed-model vendor. It activates just 3.8 billion of its 26 billion parameters per token, runs on a single RTX 4090, and scores 88.3% on AIME 2026 — a mathematics benchmark where the full 31B dense variant only reaches 89.2%. You’re giving up less than a point of math performance for roughly 8x less compute.

Google DeepMind released Gemma 4 on April 2, 2026: four model sizes, all Apache 2.0 licensed, all natively multimodal.

The Lineup

Gemma 4 ships in four variants. The E2B (2 billion parameters) targets phones and on-device inference — it handles text, images, video, and audio natively. The E4B (4 billion) is the edge model, same modality support, more headroom. The 26B MoE uses a mixture-of-experts architecture that routes to 3.8B active parameters per token. And the 31B Dense is the straightforward large model for workstations and cloud inference.

All four share a 256K token context window and support over 140 languages.

What makes this lineup unusual is the bottom end. The E4B hits 42.5% on AIME 2026 — more than double what the previous generation’s full-size model managed. That’s not a rounding error. It means the smallest Gemma 4 models are capable enough for real reasoning tasks, not just text completion.

Benchmarks Against the Field

The 31B Dense scores 85.2% on MMLU Pro, 89.2% on AIME 2026, and ranks third on Arena AI. On LiveCodeBench v6, it posts 80.0% — ahead of Llama 4’s 77.1%. For visual tasks, Gemma 4 excels at OCR and chart understanding across variable resolutions.

The comparison that matters most: Gemma 4’s 26B MoE versus everything else at similar compute budgets. There is no open model in this weight class delivering these scores. Nvidia’s Nemotron 3 Super operates at a much larger scale with 120B total parameters but only 12B active — Gemma 4’s 26B MoE gets competitive results with a third of the active compute.

What “Natively Multimodal” Actually Means

Previous Gemma releases bolted vision onto a text model. Gemma 4 was trained multimodal from the start — text, images, and video go through the same architecture. The E2B and E4B models add native audio input, meaning speech recognition and audio understanding without a separate transcription pipeline.

For developers, the practical difference is that you can feed a Gemma 4 model a screenshot, a video clip, or a voice recording and get structured output without chaining multiple models together. One ollama run gemma4 command installs any variant locally.

Who Should Care

If you’re building on Gemini 3.1 Pro and paying per token, Gemma 4’s 26B MoE is the self-hosted escape hatch. Apache 2.0 means no revenue thresholds, no usage restrictions, no licensing surprises. The 31B Dense is for teams that need the absolute best open-weight quality and have the GPU budget. The E2B and E4B are for mobile and edge — and the fact that a 4B model can score 42.5% on a serious math benchmark means “edge AI” is no longer synonymous with “toy AI.”

Tags: AI