About

About Wikidata Quality Toolkit

Wikidata is a significant data asset launched in 2012 by the Wikimedia Foundation, containing machine-readable factual information about over 100 million topics. It is widely used in various applications, including web search engines, virtual assistants, and fact checkers, as well as in over 800 projects in the Wikimedia ecosystem, such as Wikipedia.

However, the quality of data in Wikidata is crucial, especially considering its extensive use in Wikipedia articles, which receive 24 billion daily visits. Poor-quality data can have detrimental effects, particularly when used to train AI systems, potentially reinforcing biases and stereotypes.

To address these challenges, we are developing the Wikidata Quality Toolkit (WQT). This toolkit aims to support a diverse set of editors in curating and validating Wikidata records at scale. It draws on research findings and conceptual prototypes from AI, data management, and social computing, responding to the data assurance needs of the Wikidata community.

The focus of the WQT project includes reassessing data assurance requirements in the age of large language models (LLMs), improving and integrating existing code, extensive evaluation with the Wikidata community, and developing a sustainable research software strategy.

The toolkit will be open-source, providing data, software, and guidance to the community, researchers, and AI developers. Besides benefiting the community of 24,000 editors directly, there are significant economic and societal implications from downstream AI applications using Wikidata.