Google unveils Gemini Embedding 2 for multimodal retrieval across various types of media

Google unveils Gemini Embedding 2 for multimodal retrieval across various types of media

Google has released Gemini Embedding 2, its first fully multimodal embedding model trained on the Gemini architecture. This new model maps text, images, video, audio, and documents into a single, unified embedding space. It is designed for broad international use, supporting over 100 languages, and it enhances a wide array of multimodal tasks including Retrieval-Augmented Generation, semantic search, sentiment analysis, and data clustering.

Gemini Embedding 2 is capable of processing multiple modalities in one request, allowing interleaved inputs such as image and text combinations. This enables the model to understand and capture complex relationships between different types of data, which is critical for real-world applications. Building on its multimodal core, Gemini Embedding 2 offers extensive input capabilities: it processes up to 8192 text tokens, can handle up to 6 images per request in PNG or JPEG formats, supports 120 seconds of video in MP4 or MOV formats, natively ingests audio, and embeds PDF documents of up to 6 pages directly, all without requiring format conversion.

The model uses Matryoshka Representation Learning, a technique for dynamically scaling embedding dimensions, and introduces advanced speech understanding. Gemini Embedding 2 outperforms established models across text, image, and video tasks and is currently available in public preview via the Gemini API and Vertex AI.

by Paul

  • ...

Google Gemini is an AI chatbot providing direct access to Google AI for assistance with writing, planning, learning, and more. Rated 3, it features AI-powered capabilities, operates ad-free, and requires no coding. Gemini offers a streamlined user experience for those seeking AI-driven support in various tasks.

No comments so far, maybe you want to be first?
Gu