Gemini 3.1 Flash Lite: Fast, Cheap, Capable

Google releases Gemini 3.1 Flash Lite — a model built for speed and cost efficiency. It's 2.5x faster than Gemini 2.5 Flash, costs a fraction of Pro, and still handles 1M tokens of context.

Google has released Gemini 3.1 Flash Lite, its most cost-efficient AI model yet. Launched on March 3, 2026, this is the lightweight sibling of Gemini 3.1 Pro — designed for teams that need to run AI at massive scale without burning through their budgets. It's faster, cheaper, and built for the kind of high-volume workloads where every millisecond and every token matters.

Speed and Cost

The headline numbers: Gemini 3.1 Flash Lite delivers 2.5x faster time-to-first-token and 45% faster output generation compared to Gemini 2.5 Flash, according to benchmarks from Artificial Analysis. In real terms, that means responses start arriving almost immediately and complete noticeably faster than the previous generation.

Pricing is set at $0.25 per million input tokens and $1.50 per million output tokens. That's roughly one-eighth the cost of Gemini 3.1 Pro. For applications processing millions of requests per day — think content moderation, real-time translation, or automated customer support — the savings add up fast.

Spec	Flash Lite	Flash (prev gen)
Time to first token	2.5x faster	baseline
Output speed	45% faster	baseline
Input pricing	$0.25/M tokens	—
Output pricing	$1.50/M tokens	—
Context window	1M tokens	1M tokens
Max output	64K tokens	—
Modalities	Text, image, audio, video	Text, image, audio, video

Not Just 'Lite'

Despite being the "lite" model, Flash Lite isn't stripped down. It supports a 1 million token input context window and can generate responses up to 64,000 tokens. It processes text, images, audio, and video — the full multimodal stack that Gemini is known for.

Audio processing got a specific upgrade. Automated Speech Recognition accuracy is improved over previous Flash models, making it a better fit for voice-based applications, meeting transcription, and audio content analysis.

The model also inherits the three-tier thinking system introduced with Gemini 3.1 Pro. Developers can set Low, Medium, or High thinking levels per request, trading off response time against reasoning depth. For most Flash Lite use cases, Low or Medium thinking will be the sweet spot — fast enough for real-time applications while still producing coherent, useful output.

Where It Fits

Google positions Flash Lite for high-volume, latency-sensitive tasks. The obvious use cases include translation, content moderation, UI generation, classification, and data extraction. But the 1M token context window opens up more interesting applications — like summarizing long documents, processing entire conversation histories, or analyzing large datasets in a single pass.

The key question with any "lite" model is always: what did you give up? In Flash Lite's case, the trade-off is mainly in complex multi-step reasoning. For tasks that require deep logical chains or nuanced analysis, Gemini 3.1 Pro is still the better choice. Flash Lite excels at tasks where speed and throughput matter more than deep thinking — and it competes well against other cost-efficient models like Qwen 3.5 in the enterprise space.

Availability

Gemini 3.1 Flash Lite is available now in preview through the Gemini API in Google AI Studio and Vertex AI. Enterprise customers get the usual Vertex AI guarantees around data privacy, compliance, and SLA commitments.

For developers already using the Gemini API, switching to Flash Lite is a model name change — the API interface is identical to other Gemini 3.1 models, so existing code works without modification.

Full documentation is available on the Google DeepMind model card.