17.09.2024

Wikimedia Deutschland develops semantic search for non-profit AI applications

Berlin, 17th September 2024 – Wikimedia Deutschland has launched a semantic search concept in collaboration with search experts from DataStax and Jina AI. The concept makes Wikidata’s freely licensed data available in an easier-to-use format for developers of AI applications. The objectives are simplifying the process of developing open-source, non-profit AI applications and contributing to a more reliable information ecosystem.

 

As an open knowledge graph with over 112 million human- and machine-readable entries, Wikidata represents a valuable treasure trove of data for developers and society. With the constant contribution of over 12.000 active editors, Wikidata’s data is diverse, well-maintained and constantly growing.

The need for access to large amounts of high-quality data has significantly increased in the last decade. Generative AI, in particular, requires vast amounts of training data, which is often scraped from the internet. This scraping requires workforce and time resources that are primarily available to large commercial organizations. This leads to a closed ecosystem for data utilization, which is contrary to the ideals of open-source. Wikidata wants to contribute to opening this closed system up by transforming Wikidata’s crowd-sourced, validated entries into an easy-to-access data source for open-source AI application development.

Furthermore, once Wikidata is integrated into more open-source machine learning workflows, the quality of the information ecosystem can be improved: Gen AI mistakes can be reduced, and the output from Large Language Models could become more reliable. In the long run, the wider public could benefit from having more reliable alternatives based on Wikidata’s data to commercial generative AI providers.

“We’re focused on helping developers who share our values. However, many developers find accessing Wikidata challenging, and our current methods don’t support the data volume required for some of the most recent generative AI development needs”, states Dr. Jonathan Fraine, Head of Software Development at Wikimedia Deutschland, who initiated the project together with Lydia Pintscher, the Portfolio Lead Product Manager of Wikidata who is convinced that better access to the data volume of Wikidata can be a game-changer for open-source generative AI communities. Pintscher says: “By providing high-quality data, we support the communities with their work and realization of new ideas that are not profit-driven but have the intention to serve humanity with valid information.”

Now, with the support of DataStax and Jina AI, Wikidata’s data will be transformed and made more convenient for AI developers as semantic vectors in a vector database. DataStax provides the vector database while Jina AI provides the open-source embedding model for vectorizing the text data.

The vectorisation will enable direct semantic analysis and could help facilitate the detection of vandalism in the knowledge graph. The vectorisation also simplifies the process of using Wikidata in RAG (retrieval-augmented generation) applications in the future – this can reduce AI mistakes by including current, verified facts in the results. Wikimedia Deutschland started creating the concept in December 2023. The first beta tests of a prototype are planned for 2025.

About Wikimedia Deutschland

Wikimedia Deutschland is a non-profit organisation with over 111,000 members and 180 employees that is committed to promoting freely available knowledge in the digital space. As the largest country representative of the international Wikimedia Movement, the organisation promotes the volunteer communities of Wikipedia and other Wiki projects in Germany. Wikimedia Deutschland also develops free software and the free Wikidata database, and is involved in political and educational activities to promote free access to knowledge and data.

 

Press Contact

Franziska Kelch

Communications Manager Political Framework

Franziska.kelch@wikimedia.de

00491577-135 49 52