CIS-A2K/Research/Understanding the data gaps in Wikidata concerning West Bengal

Background/Context

Wikidata is an online collaborative data repository built by volunteers from all around the world under public domain for everyone on the planet. It contains more than 67 million data items about different topics and its growing bigger every second. These items, stored in RDF format, are linked with each other, with other Wikimedia sister projects including Wikipedia and external websites and databases on different subjects. In spite of the diverse topics it has already covered, there are still huge data gaps, mostly in the Global South countries or regions.

We choose West Bengal from different Global South regions as volunteers here are being ingesting huge amounts of data concerning this area for the last few years into Wikidata. In this research, we are going to find out, if there are any major data gaps prevailing in Wikidata regarding the state of West Bengal and if there are data gaps, what are they and why they are there.

Research Problem/Objectives

The objectives of the research are to understand:

Problems caused by the existing data gaps in Wikidata concerning West Bengal ;
Strategies for filling up the data gaps

Proposed Methodology

Although Wikidata has more than 67 million items, it is not difficult to search anything from this huge database. Even finding answers to extremely specific questions are not much difficult as Wikidata is a structured database. This can be done by using SPARQL language, which are specifically designed to get answers from triplets, which is how Wikidata items are structured. Wikidata has a very powerful tool called Wikidata Query Service (WDQS) where users can write SPARQL queries for their questions and get answers from Wikidata in a structured way. WDQS can also be used for data visualizations in the form of maps, graphs, bubble charts, image grids, treemaps etc. In this research, we will find out data gaps prevailing in West Bengal using SPARQL based queries in WDQS. The queries will be on datasets on 5 topics; these will be selected based on criteria such as size and diversity of the dataset.

Research Outputs

Short essay analysing Wikidata gaps based on the visualizations created by SPARQL query.

Timeline

December: Concept note finalised
January: Literature review
February - March: Queries and Data visualisations completed
April: Final essay completed, along with a short blog-post in Bengali

Updates

June 2020: For this project, data gaps related to heritage structures in West Bengal were analysed through SPARQL queries. Interviews were also conducted with Wikidata editors in India to gain a better understanding of some of the limitations and challenges of the work on this topic. A final draft in the form of a blogpost has been submitted to the team and persons interviewed for their feedback. Once the feedback has been incorporated, the blog post is expected to be published.
November 2020: after meticulous internal review and feedback, the report has been published on meta here and on CIS blog here.