Google DeepMind Announces LaunchThe all-new "Gemini Embedding 2"This is Google's first "natively multimodal" embedding model built on the Gemini architecture. Unlike previous methods where developers had to rely on plain text models or convert different media into text for retrieval, Gemini Embedding 2 innovatively maps text, images, videos, audio, and documents directly into the same vector space. This technology is currently available for public preview through the Gemini API and Vertex AI, and is expected to completely revolutionize the development experience of underlying architectures such as RAG (Retrieval Augmentation), semantic search, and data aggregation.
Five data types are available at once, supporting the understanding of "interleaved input".
In the past, when building RAG systems, if the database contained both images and text, developers typically needed to first use another AI to "describe" the images into text before performing vectorization. This conversion process was not only time-consuming but also resulted in the loss of a large amount of original semantic details.
Gemini Embedding 2, leveraging Gemini's powerful multimodal understanding capabilities, directly supports embedding and conversion of the following five data types:
• Text:It supports a wide range of contexts for up to 8192 input tokens.
• Images:Each request can process up to 6 images (supports PNG and JPEG formats).
• Videos:Supports video input up to 120 seconds long (supports MP4 and MOV formats).
• Audio:The most groundbreaking aspect is that the model can "natively" capture and embed audio data without any intermediate text transcription steps. This means that tone of voice or ambient sounds can also be accurately captured.
• Documents:Supports direct embedding of PDF files up to 6 pages long.
Even more powerful is Gemini Embedding 2's support for "interleaved input." Developers can submit "image + text" or "video + audio" in a single API request, and the model can natively understand the complex and subtle relationships between these different media formats, thereby generating more accurate vector representations.
Introducing MRL technology: Balancing performance and storage costs
While maintaining high accuracy, Google also considered the storage costs of deploying vector databases for enterprises.
Continuing the excellent tradition of its predecessors, Gemini Embedding 2 also employs the "Russian doll representation learning" (MRL) technique. This technique allows important information to be "nested" at the beginning of the vector, enabling developers to dynamically reduce the output dimension of the vector.
Although the system defaults to and recommends using the highest quality dimensions of 3072, 1536, or 768, developers can flexibly adjust the dimensions downwards based on the project's tolerance for storage space and search latency, achieving a perfect balance between performance and cost.
Seamlessly integrate with the current mainstream AI developer ecosystem
To enable developers to integrate this powerful technology into existing projects as soon as possible, Gemini Embedding 2 is ready to interface with the most popular open-source frameworks and vector libraries.
The official statement indicates that the model can be directly integrated into development frameworks such as LangChain, LlamaIndex, and Haystack, and perfectly supports mainstream vector databases such as Weaviate, QDrant, ChromaDB, and Google's own Vector Search.
Analysis of viewpoints
Over the past two years, the industry's attention has been almost entirely focused on large language models (LLMs) that are "eloquent," but the key to determining whether enterprise-level AI applications (such as enterprise internal knowledge base customer service and intelligent search) are smart or not is actually the "embedded model" that is responsible for converting massive amounts of data into a machine-understandable format.
Google's biggest weapon this time lies in the word "natively." In particular, audio can be vectorized directly without first converting it to verbatim transcript. This means that AI is beginning to truly "understand" the emotions and frequency differences in sound, rather than just reading cold, impersonal text. When text, images, and audio-visual content can all be accurately compared within the same coordinate system, we are about to usher in a next-generation "multimodal RAG" explosion, capable of truly understanding design drawings, comprehending legal speeches, recording audio, and even directly searching for specific video clips.




