Google Research Team announces the launch ofNew VaultGemma Model, claiming to be the most powerful large-scale language model currently available, trained entirely from scratch and protected by differential privacy (DP). The model weights are simultaneously released to the Hugging Face and Kaggle platforms, allowing developers and academia to freely use, verify, and improve them.

As generative AI becomes more commonplace, privacy protection has become a crucial issue in AI development. Differential privacy reduces the risk of models memorizing individual data by introducing "noise" during training. However, this also presents challenges such as reduced training stability, increased batch size, and increased computational costs.
Google said that the research conducted in collaboration with DeepMind has established for the first time "differential privacy model scaling laws," which can accurately predict the optimal training configuration under different computing, privacy, and data budgets, becoming an important guide for training high-performance differential privacy models.

VaultGemma is a new version based on Gemma 10, designed with a billion parameters. Through systematic experiments, the Google research team quantified the relationship between model size, number of training iterations, and noise ratio, and concluded that the optimal strategy for differentially private training is "smaller models with larger batch sizes." This strategy enables VaultGemma to achieve performance close to that of non-private models while maintaining higher privacy, with performance comparable to non-differentially private models from five years ago.
Technically, VaultGemma utilizes a scalable DP-SGD algorithm and an improved Poisson sampling method to ensure consistent batch size while maintaining strong privacy guarantees. The resulting model achieves sequence-level differential privacy (ε ≤ 2.0, δ ≤ 1.1e-10), ensuring that even if a single training example is queried, it is virtually impossible for the model to reproduce it. Google also conducted memorization tests, which showed that VaultGemma virtually "endorses" no training data.

Google noted that while the performance of differentially private models is still slightly lower than that of fully non-private versions, the gap has been narrowed and there are clear research paths for further improvement. VaultGemma not only demonstrates Google's long-term commitment to privacy-preserving operations but also provides a reproducible and verifiable benchmark for industry and academia, driving the development of the next generation of privacy-focused AI.

For developers, the release of VaultGemma not only provides pre-trained models but also comes with a comprehensive technical report and optimization recommendations, allowing businesses and research teams to tailor their models to their computing and privacy needs. This means that in the future, we can expect to see more businesses adopt AI with lower privacy risks, meet regulatory requirements, and protect user data, while still benefiting from high-performance models.
Finally, Google emphasized that Vault Gemma is only the first step. In the future, it will continue to improve the differential privacy training mechanism, further enhance performance, and lower the computing threshold, so that "both safe and smart" AI will become the norm in the market.
Comparison of parameters and performance of VaultGemma, non-differentially private Gemma, and early GPT-2 models:
| model | VaultGemma 1B | Gemma 3 1B | GPT-2 1.5B |
| 參數規模 | 10 billion parameters | 10 billion parameters | 15 billion parameters |
| Privacy protection | Differential privacy (ε ≤ 2.0, δ ≤ 1.1e-10) | No differential privacy | No differential privacy |
| Training method | DP-SGD + Poisson Sampling Optimization | Standard non-DP training | Traditional large batch non-DP training |
| Performance (relative to non-DP models) | Close to the non-DP model 5 years ago (same level as GPT-2) | Slightly higher than VaultGemma | Lower than modern non-DP models, but close to VaultGemma |
| Data memory risk | Almost no memorization detected | There is a certain risk of memorization | High risk of memoization (verified many times) |
| Release status | Released, open source (Hugging Face & Kaggle) | Released, open source | Historical models, publicly available for download |


