Gemini Embedding 2: One Model for Text, Images, Video, Audio, and PDFs

Google launches Gemini Embedding 2 — the first multimodal embedding model in the Gemini API, mapping text, images, video, audio, and documents into a single vector space. Available now in public preview.

Gemini Embedding 2: One Model for Text, Images, Video, Audio, and PDFs

If you're building RAG, search, or classification and you're still running separate embedding models for text, images, and audio — Google just made that architecture obsolete. Gemini Embedding 2 maps all five modalities into one vector space. One API call. One model.

Announced March 10 via the Google AI Studio account on X and available immediately in public preview through the Gemini API and Vertex AI, this is Google's first natively multimodal embedding model. Not a text embedder with image support bolted on — a model built from scratch on the Gemini architecture to understand the relationships between text, images, video, audio, and PDF documents simultaneously.

What It Actually Handles

The input support is broad. Text goes up to 8,192 tokens. Images accept up to six per request in PNG or JPEG. Video handles up to 120 seconds in MP4 or MOV. Audio processes natively — no intermediate transcription step, which matters for music, ambient sound, or non-speech audio that would lose meaning in text conversion. PDFs embed directly, up to six pages.

The key capability is interleaved input. You can pass an image and a text description in a single request and get one aggregated embedding that captures both. That's a meaningful difference from pipelines that embed each modality separately and try to stitch the vectors together downstream.

Modality Limit Formats
Text 8,192 tokens
Images 6 per request PNG, JPEG
Video 120 seconds MP4, MOV
Audio 80 seconds MP3, WAV
Documents 6 pages PDF

Flexible Dimensions via Matryoshka Learning

Default output is 3,072 dimensions. But the model uses Matryoshka Representation Learning (MRL) — a technique that nests information so you can truncate the vector to smaller sizes without retraining. Google recommends 3,072, 1,536, or 768 dimensions. Their MTEB benchmark numbers show the 768-dimension output scores 67.99 versus 68.16 for the full 3,072 — a negligible drop for a 75% reduction in storage and compute.

That flexibility matters at scale. If you're indexing millions of documents, cutting vector size by 75% while losing less than 0.3% on benchmarks is an easy trade.

How It Compares

Google claims state-of-the-art performance across text, image, video, and speech embedding tasks. The benchmark chart from the announcement shows Gemini Embedding 2 outperforming leading models in multimodal depth — and introducing speech capabilities that most competing embedding models lack entirely.

OpenAI's text-embedding-3 handles text only. Cohere's Embed v3 added image support but doesn't touch audio or video. Google's own Nano Banana 2 generates images but doesn't embed them. Gemini Embedding 2 is the first from a major provider to cover all five modalities in a single model with a single vector space — meaning you can search across a mixed corpus of PDFs, images, audio clips, and text documents with one query embedding.

Early Partners Are Already Seeing Results

Everlaw, a legal technology company, is using the model for litigation discovery — searching across millions of records including images and video. Sparkonomy, a creator economy platform, reports 70% latency reduction by removing the need for separate LLM inference steps, and semantic similarity scores jumping from 0.4 to 0.8 for text-image and text-video pairs. MindLid, a personal wellness app, sees a 20% lift in top-1 recall when embedding conversational memories alongside audio and visual data.

What Developers Need to Know

The model ID is gemini-embedding-2-preview. It's available through the Gemini API and Vertex AI right now. The embedding spaces between the new model and the previous gemini-embedding-001 are incompatible — you'll need to re-embed all existing data if you migrate. That's expected for a fundamentally different architecture, but it means this isn't a drop-in upgrade.

Integration support is already wide: LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search all work out of the box. The 8,192-token context window is generous for an embedding model. And the batch API offers 50% lower pricing for non-latency-sensitive workloads — worth noting if you're re-embedding a large corpus.

For anyone building multimodal search or RAG systems, the trend is clear: the modality walls are coming down. Gemini Embedding 2 is the infrastructure layer that makes that practical.

Tags: AI