To avoid AI models developed in Taiwan speaking with a strong Beijing accent or lacking local cultural awareness, the Ministry of Digital Development (MODA) recently announced the launch of..."Taiwan Sovereignty AI Corpus" (Taiwan Sovereign AI Corpus) Beta version.
The first wave of data releases includes over 200 government agencies such as the Ministry of Culture, the Ministry of Education, the Hakka Affairs Council, the Council of Indigenous Peoples, and the Ministry of Transportation and Communications. It releases over 2000 datasets, totaling approximately 6 million tokens of high-quality Traditional Chinese data, covering fields such as culture and arts, geography, language, medicine, and transportation. Applications from industry, academia, and research institutions are open for use starting today.
Why do we need "sovereign AI"?
Hou Yi-hsiu, Deputy Minister of the Ministry of Digital Development, stated that all countries are developing AI, and the real competitive advantage is not computing power (because GPUs can be bought as long as you have money), but rather "data" and "talent." Taiwan's unique culture, language, and values mean that if we don't do it ourselves, no other country or tech giant will do it for us.
Chuang Ming-fen, Director of the Data Innovation Division, cited a classic example: the word "potato." In the Chinese context, "potato" refers to "potato" (马铃薯); however, in Taiwan, it refers to "peanut" (落花生). If AI is fed incorrect data, the trained model will give wrong answers, even leading to confusion in cultural understanding. Strengthening the proportion of Traditional and Classical Chinese characters is crucial for Large Language Models (LLMs) to truly understand Taiwan's political, economic, cultural, and value-based systems.
Two main categories of documents are required for the application; an "ID card" is required.
The currently available corpus is divided into two parts:
• Open Data:Open and freely downloadable.
• Authorization materials (Restricted):For AI training purposes only; application and approval are required.
To ensure data is not misused, external parties wishing to use authorized data must verify their identity through a natural person certificate or business certificate and state their purpose for use. The Data Development Department takes approximately 7 business days to review the application before providing an authorized account for download. Currently, the provided file formats include the common PDF and JSON formats, complying with the FAIR principles of international data sharing (searchable, accessible, interoperable, and reusable).
Solving the most troublesome "copyright" problem: One-time licensing
For developers, the biggest fear when training AI is stepping on copyright landmines. In response, the Department of Digital Development and the Intellectual Property Office of the Ministry of Economic Affairs have collaborated to develop exclusive licensing terms.
The "one-time license" model allows the provided corpus to be legally used for AI training (including reproduction, modification, and editing) with the licensor's consent. In return, the licensee (developer) is obligated to indicate the source of the data, and the produced content must be marked as AI-generated. Furthermore, the licensee must ensure that the training results are not "substantially similar" to the original corpus in order to protect the market value of the original creator.
Analysis: Data is the oil of the AI era, but "quantity" and "quality" remain challenges.
In my opinion, the launch of the Sovereign AI Corpus by the Ministry of Data Development is a crucial piece of the puzzle in the infrastructure for Taiwan's AI development.
Over the past year, we've seen many Traditional Chinese models fine-tuned based on Llama or GPT. While these models offer fluent dialogue, they often falter when it comes to Taiwanese law, history, indigenous culture, or local terminology. Official intervention to integrate high-quality, manually reviewed government data has indeed significantly improved the "purity" of domestically developed models.
However, 6 million tokens is still a drop in the ocean compared to the training volume of modern LLM (which often starts at trillions of tokens). The future challenge lies in how to expand from the "central government" to "local governments" and even "private enterprises." Only when more private sector data (such as news media, publishers, and academic institutions) are willing to be added under reasonable authorization and profit-sharing mechanisms can this corpus truly become the brain of Taiwan's AI, and not just a database of government regulations.
