Meta Llama 4: Two MoE Models, 10M Token Context
Llama 4 Scout runs 109B total parameters across 16 experts with a 10 million token context window. Maverick scales to 400B across 128 experts. Both are open-weight, natively multimodal, and Meta's first MoE release.
Ten million tokens of context in an open-weight model. That’s the number that separates Llama 4 Scout from everything else in the open-source ecosystem — and from most closed models too.
Meta released Llama 4 Scout and Llama 4 Maverick on April 5, 2026. They’re the first mixture-of-experts models in the Llama family, the first natively multimodal Llama models, and they were trained on 40 trillion tokens of data. Two models, one architecture bet, and a context window that nobody else is matching at this price point.
Scout: Small, Fast, Impossibly Long Context
Llama 4 Scout has 109 billion total parameters split across 16 experts, with only 17 billion active at inference. That means it can run on hardware that would choke on a comparably capable dense model — early guides put the minimum at 12GB VRAM for quantized inference.
The context window is the story. Ten million tokens means you can feed Scout an entire codebase, a full book, or months of conversation history and get coherent responses. For comparison, GPT-5.4 offers 1 million tokens. Gemini 3.1 Pro also sits at 1 million. Scout gives you 10x that in an open-weight model you can self-host.
Performance at this parameter count is strong. Meta claims Scout rivals or exceeds models with significantly larger active parameters — the MoE architecture lets it punch above its weight class while keeping inference costs low.
Maverick: The Conversational Heavyweight
Maverick takes the same 17 billion active parameters but distributes them across 128 experts, totaling 400 billion parameters. Where Scout optimizes for context length and efficiency, Maverick optimizes for output quality — particularly in conversational and creative tasks.
It supports a 1 million token context window with the instruct-tuned variant. The MoE routing across 128 experts gives Maverick more specialization per token — different experts activate for different types of reasoning, which shows up in the texture and coherence of longer outputs.
Natively Multimodal, Finally
Both models process text, images, and video from a unified architecture. Previous Llama releases were text-only, with vision bolted on after the fact by the community. Llama 4 does multimodal from the ground up — no adapter layers, no separate CLIP encoder, just one model that handles all three modalities through the same MoE routing.
This matters for developers building agents. If your application needs to reason over screenshots, parse documents, or understand video input, you no longer need to chain Llama with a separate vision model. Google’s Gemma 4 made the same architectural choice — natively multimodal is becoming the baseline for any serious open model.
The Behemoth Question
Meta has confirmed a third model — Llama 4 Behemoth — is still in training. The name implies something substantially larger than Maverick’s 400B. Scout and Maverick are the advance guard; Behemoth is the frontier play.
For now, the practical question is whether 10 million tokens of context actually works as well as the number suggests. Long-context benchmarks are notoriously forgiving — the real test is whether Scout maintains coherence and retrieval accuracy across the full window in production workloads. If it does, every RAG pipeline using chunking and retrieval just got a serious competitor from brute-force context length.