Google recently launched a new open-source device-side multimodal artificial intelligence model, Gemma 3n, which boasts that it can deploy high-performance models on the device side, allowing devices such as mobile phones, tablets and laptops to also have multimodal computing capabilities that were previously only available on cloud-based models.
The Gemma 3n model is now available on Hugging Face, with complete technical documentation and development guidelines provided simultaneously.
Multimodal architecture design fully supports text, image, audio and video
Gemma 3n最大亮點在於其原生支援影像、音訊、視訊與文字輸入,並且能輸出自然語言文字結果。此次發表版本提供E2B (有效參數約20億組)與E4B (約40億組)兩種版本,具備極高的運算效率,但實際效能卻可達傳統50億組與80億組參數規模的模型級別。
Furthermore, Gemma 3n utilizes the new MatFormer (Matryoshka Transformer) architecture, featuring elastic inference. It also allows developers to freely switch model scales through a Mix-n-Match approach, creating the appropriate model version based on device resources. It can run smoothly with only 2GB or 3GB of device memory.
Redesigned memory architecture for device-side: PLE per-layer embedded technology
Gemma 3n utilizes a technology called PLE (Per-Layer Embedding), which allocates some parameters to the CPU and memory, retaining only the most critical Transformer weights in the AI accelerator. This significantly improves memory usage efficiency and enables entry-level devices to perform near-cloud-level model inference capabilities.
Supports faster long text processing and speech translation: KV Cache and speech encoder fully upgraded
For long text and multimedia sequence input, Gemma 3n introduces a new KV Cache Sharing mechanism to accelerate first-word generation response times and provide more immediate processing for video or voice streams. The voice module incorporates a speech codec derived from Google's USM, supporting both automatic speech recognition (ASR) and speech translation (AST). Initial support includes translation from English to Spanish, French, Italian, Portuguese, and other languages.
New MobileNet-V5: Real-time image analysis on-device
In terms of visual processing, Gemma 3n is equipped with a newly designed MobileNet-V5 visual encoder, which supports multi-resolution input of 256-768 pixels. It also incorporates the MobileNet-V4 foundation and multi-scale fusion architecture, achieving a 13x acceleration and a 4x reduction in memory usage on the Google Pixel Edge TPU. At the same time, the accuracy also exceeds that of the SoViT solution without distillation.
Gemma 3n represents a significant advancement in Google's AI-on-Device strategy, strengthening its technological leadership in multimodal models and paving the way for future AI-on-device computing. Going forward, the Gemma series is expected to continue its goal of smaller models and greater performance, enabling more native AI experiences on mobile devices.










