Data shows that Asia currently has over 2300 local languages, representing approximately 32% of the global population. However, most of these languages lack digital resources and face marginalization or extinction. Google is working to digitally enable more local languages through a series of AI projects.
Project Vaani: 21500 hours of voice data, deep into India
Three years ago, Google and the Indian Institute of Science launchedA project called "Project Vaani"The goal is to include language variants from 773 regions across China. Currently, 21500 hours of audio files and 835 hours of transcription data have been collected, covering 86 languages from 11.2 speakers.
This data is not limited to specific projects, but is made available to the general public free of charge through the Indian National Language Mission Bhashini and the HuggingFace platform, thereby promoting the development and application of more AI models.
The project leader explained that languages in India are not uniform across states. For example, Bihar, India's second-most populous and 100th-largest state, has over XNUMX local dialects and their variants. Population mobility further complicates language differences, so capturing these subtle variations is crucial to ensuring that services are usable across India.
Project Vaani has completed the first and second phases of data collection, covering 160 districts and counties, and is collaborating with Megdap, Karya and other units to continuously expand the scale of corpus collection.
Project SEALD and Aquarium: Database of 1200 Southeast Asian languages
Southeast Asia has a total of 11 countries, a population of over 6.5 million, and 1200 languages. In Indonesia alone, there are over 700 local languages. To cope with such a complex language environment, Google andAI Singaporejointly promoteProject SEALD, the core tool is the Aquarium platform.
The goal of the Aquarium platform is to build a complete catalog of Southeast Asian language data, allowing anyone to contribute and use data, and promote AI tools and applications that meet local needs.
The project team also developed strategies for low-resource and endangered languages. This includes collaborating with local institutions to digitize paper or oral sources and verify them with native speakers. For languages nearing extinction, native speakers' audio content and transcriptions are collected through images or text prompts and stored in a corpus.
CHAD 2: Breaking the language barrier in Japanese comedy with AI
Language AI not only preserves content but also promotes cultural output. Yoshimoto Kogyo, Japan's largest entertainment agency, partnered with Google to develop the CHAD 2 system, based on Gemini 2.0 Flash and designed specifically for the translation of "お笑い" (Owarai, a Japanese comedy).
As long as you upload a video, CHAD 2 can automatically generate Chinese, English, and Korean subtitles. Its transcription and translation accuracy rate reaches 90%, which is much higher than the 60%-75% of general models. At the same time, it shortens the translation process from months to minutes.
The system includes over 200 comedy-specific dictionaries, capable of processing cultural allusions and punchlines. Future expansion will allow for expansion into anime, drama, or sports translation simply by adding more dictionaries. Yoshimoto Kogyo is also working to commercialize the system, enabling global audiences to instantly understand Japanese comedy punchlines.
A future that bridges the digital divide through AI
Whether it's Project Vaani's focus on Indian dialects, SEALD's focus on Southeast Asian languages, or CHAD 2's cross-cultural applications, AI is becoming a crucial tool for language preservation and cultural dissemination. As data scale expands and models evolve, the language digitization revolution driven by Google will enable more Asian languages to emerge from the brink of silence and gain a place in the global digital world.
Mozilla has a similar plan
Similar projects include the open source speech recognition engine project promoted by Mozilla since July 2017.Simultaneous Voice Project (Common Voice), in 2017, it has accumulated 7226 hours of voice content, including 14 more niche languages, bringing the number of languages included to 54. In late February of this year, it was announced that8 Taiwanese Aboriginal languages, including Atayal, Bunun, Paiwan, Rukai, Wanshan, Maolin, Seediq and Sakilaya, with a cumulative data length of more than 60 hours. It includes more than 200 languages around the world, including Taiwanese traditional Chinese and Taiwanese Hokkien.












