In an era where generative AI applications are becoming increasingly popular, the quality and openness of knowledge sources are becoming key to driving innovation.AnnounceThrough the Wikidata Embedding Project, the company will make the vast knowledge database more suitable for use in generative AI models, lower the threshold for small and medium-sized developers to introduce and use it, and reduce the situation where generative AI technology is monopolized by only a few technology giants.
Wikipedia has previously structured its data through Wikidata, encompassing approximately 120 million entries, making it theoretically easier for machines to read. However, because generative AI prefers processing natural language content rather than raw structured data, Wikidata is difficult to use directly. The newly launched embedded project aims to convert Wikidata into a "vector" format that AI models can understand.
Vectorization maps the relationships between words into a coordinate space. For example, the relationship between "dog" and "puppy" will be closer, while the relationship between "dog" and "bank account" will be smaller or even unrelated. This data conversion allows AI to better understand the natural meaning and context of the data, thereby improving the accuracy of natural language processing.
More importantly, previous AI training often relied solely on static data, making it difficult to timely reflect subsequent updates to Wikipedia's content. However, through this project, Wikidata has also integrated a "RAG" (Retrieval Augmented Generation) mechanism, enabling AI models to access the latest data in real time, significantly improving the timeliness and reliability of answers.
Wikimedia Germany emphasized in a press release that the project's core goal is to "enable AI models to access high-quality information to enhance the credibility of their outputs." They also noted that most AI systems currently rely on opaque, proprietary data, lacking transparency and verifiability. Opening up vectorized Wikidata will not only promote fairness in AI development but also help smaller teams reduce the development burden, preventing generative AI technology from being monopolized by a few tech giants.
In reality, vectorizing massive amounts of data requires extremely high computing and storage resources, making it challenging for small and medium-sized enterprises and independent developers. The Wikipedia Embedded Project collaborates with German artificial intelligence startup Jina AI and IBM subsidiary DataStax. Jina AI will develop the vectorization system, while DataStax will store the data in its Astra DB vector database. This means developers can directly leverage Wikipedia's knowledge base for their applications without having to build complex infrastructure.
As Wikimedia Germany stated, "Powerful AI shouldn't be monopolized by a few companies." This project isn't just a technological upgrade; it's a declaration of open, collaborative AI development. As generative AI becomes more widespread, this open-source and shared model may become a key step in promoting a more diverse AI ecosystem.
