• Topics
  • Artificial wisdom
  • Autopilot
  • network
  • Processor
  • 手機
  • exhibition activities
    • CES
      • CES 2014
      • CES 2015
      • CES 2016
      • CES 2017
      • CES 2018
      • CES 2019
      • CES 2020
    • MWC
      • MWC 2014
      • MWC 2015
      • MWC 2016
      • MWC 2017
      • MWC 2018
      • MWC 2019
    • Computex
      • Computex 2014
      • Computex 2015
      • Computex 2016
      • Computex 2017
      • Computex 2018
      • Computex 2019
    • E3
      • E3 2014
      • E3 2015
      • E3 2016
      • E3 2017
    • IFA
      • IFA 2014
      • IFA 2015
      • IFA 2016
      • IFA 2017
    • TGS
      • TGS 2016
  • About us
    • About mashdigi
    • mashdigi website contact details
2026/01/14 11:43 Wednesday
  • Login
mashdigi-Technology, new products, interesting news, trends
  • Topics
  • Artificial wisdom
  • Autopilot
  • network
  • Processor
  • 手機
  • exhibition activities
    • CES
      • CES 2014
      • CES 2015
      • CES 2016
      • CES 2017
      • CES 2018
      • CES 2019
      • CES 2020
    • MWC
      • MWC 2014
      • MWC 2015
      • MWC 2016
      • MWC 2017
      • MWC 2018
      • MWC 2019
    • Computex
      • Computex 2014
      • Computex 2015
      • Computex 2016
      • Computex 2017
      • Computex 2018
      • Computex 2019
    • E3
      • E3 2014
      • E3 2015
      • E3 2016
      • E3 2017
    • IFA
      • IFA 2014
      • IFA 2015
      • IFA 2016
      • IFA 2017
    • TGS
      • TGS 2016
  • About us
    • About mashdigi
    • mashdigi website contact details
No Result
View All Result
  • Topics
  • Artificial wisdom
  • Autopilot
  • network
  • Processor
  • 手機
  • exhibition activities
    • CES
      • CES 2014
      • CES 2015
      • CES 2016
      • CES 2017
      • CES 2018
      • CES 2019
      • CES 2020
    • MWC
      • MWC 2014
      • MWC 2015
      • MWC 2016
      • MWC 2017
      • MWC 2018
      • MWC 2019
    • Computex
      • Computex 2014
      • Computex 2015
      • Computex 2016
      • Computex 2017
      • Computex 2018
      • Computex 2019
    • E3
      • E3 2014
      • E3 2015
      • E3 2016
      • E3 2017
    • IFA
      • IFA 2014
      • IFA 2015
      • IFA 2016
      • IFA 2017
    • TGS
      • TGS 2016
  • About us
    • About mashdigi
    • mashdigi website contact details
No Result
View All Result
mashdigi-Technology, new products, interesting news, trends
No Result
View All Result
Home Market dynamics

The Department of Data Science and Technology has launched the Beta version of the "Taiwan Sovereignty AI Corpus," releasing 6 million tokens of Traditional Chinese data in the first wave.
It aggregates data from over 200 government agencies, including those in culture and transportation; it adopts a real-name application system to address the cultural bias of turning "potato" into "Chinese potato".

Author: Mash Yang
December 2025, 12 - Updated on December 24, 2025
in Market dynamics, App, Life, network, software
A A
0
Share to FacebookShare on TwitterShare to LINE

To avoid AI models developed in Taiwan speaking with a strong Beijing accent or lacking local cultural awareness, the Ministry of Digital Development (MODA) recently announced the launch of..."Taiwan Sovereignty AI Corpus" (Taiwan Sovereign AI Corpus) Beta version.

The Department of Data Science and Technology has launched the Beta version of the "Taiwan Sovereignty AI Corpus," releasing 6 million tokens of Traditional Chinese data in the first wave.

The first wave of data releases includes over 200 government agencies such as the Ministry of Culture, the Ministry of Education, the Hakka Affairs Council, the Council of Indigenous Peoples, and the Ministry of Transportation and Communications. It releases over 2000 datasets, totaling approximately 6 million tokens of high-quality Traditional Chinese data, covering fields such as culture and arts, geography, language, medicine, and transportation. Applications from industry, academia, and research institutions are open for use starting today.

Why do we need "sovereign AI"?

Hou Yi-hsiu, Deputy Minister of the Ministry of Digital Development, stated that all countries are developing AI, and the real competitive advantage is not computing power (because GPUs can be bought as long as you have money), but rather "data" and "talent." Taiwan's unique culture, language, and values ​​mean that if we don't do it ourselves, no other country or tech giant will do it for us.

Chuang Ming-fen, Director of the Data Innovation Division, cited a classic example: the word "potato." In the Chinese context, "potato" refers to "potato" (马铃薯); however, in Taiwan, it refers to "peanut" (落花生). If AI is fed incorrect data, the trained model will give wrong answers, even leading to confusion in cultural understanding. Strengthening the proportion of Traditional and Classical Chinese characters is crucial for Large Language Models (LLMs) to truly understand Taiwan's political, economic, cultural, and value-based systems.

Two main categories of documents are required for the application; an "ID card" is required.

The currently available corpus is divided into two parts:

• Open Data:Open and freely downloadable.

• Authorization materials (Restricted):For AI training purposes only; application and approval are required.

To ensure data is not misused, external parties wishing to use authorized data must verify their identity through a natural person certificate or business certificate and state their purpose for use. The Data Development Department takes approximately 7 business days to review the application before providing an authorized account for download. Currently, the provided file formats include the common PDF and JSON formats, complying with the FAIR principles of international data sharing (searchable, accessible, interoperable, and reusable).

Solving the most troublesome "copyright" problem: One-time licensing

For developers, the biggest fear when training AI is stepping on copyright landmines. In response, the Department of Digital Development and the Intellectual Property Office of the Ministry of Economic Affairs have collaborated to develop exclusive licensing terms.

The "one-time license" model allows the provided corpus to be legally used for AI training (including reproduction, modification, and editing) with the licensor's consent. In return, the licensee (developer) is obligated to indicate the source of the data, and the produced content must be marked as AI-generated. Furthermore, the licensee must ensure that the training results are not "substantially similar" to the original corpus in order to protect the market value of the original creator.

Analysis: Data is the oil of the AI ​​era, but "quantity" and "quality" remain challenges.

In my opinion, the launch of the Sovereign AI Corpus by the Ministry of Data Development is a crucial piece of the puzzle in the infrastructure for Taiwan's AI development.

Over the past year, we've seen many Traditional Chinese models fine-tuned based on Llama or GPT. While these models offer fluent dialogue, they often falter when it comes to Taiwanese law, history, indigenous culture, or local terminology. Official intervention to integrate high-quality, manually reviewed government data has indeed significantly improved the "purity" of domestically developed models.

However, 6 million tokens is still a drop in the ocean compared to the training volume of modern LLM (which often starts at trillions of tokens). The future challenge lies in how to expand from the "central government" to "local governments" and even "private enterprises." Only when more private sector data (such as news media, publishers, and academic institutions) are willing to be added under reasonable authorization and profit-sharing mechanisms can this corpus truly become the brain of Taiwan's AI, and not just a database of government regulations.

Tags: AIOpen DataArtificial wisdomTaiwan Sovereignty AI CorpusDigital Development DepartmentDigital Development Department正體中文Open Data
ShareTweetShare
Mash Yang

Mash Yang

Founder and editor of mashdigi.com, and student of technology journalism.

Leave a comment Cancel reply

Your e-mail address Will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

mashdigi-Technology, new products, interesting news, trends

Copyright © 2017 mashdigi.com

  • About mashdigi.com
  • Place ads
  • Contact mashdigi.com

Follow us

Welcome back!

Login to your account below

Forgotten Password?

Retrieve your password

Hãy nhập tên người dùng hoặc địa chỉ email để mở mật khẩu

Log In
No Result
View All Result
  • About mashdigi.com
  • Place ads
  • Contact mashdigi.com

Copyright © 2017 mashdigi.com