Research talk:Automated classification of article importance/Work log/2017-04-17
Monday, April 17, 2017
editToday I'll follow up with WPMED on candidates for reassessment, complete looking into candidate WikiProjects, update our code so we handle inlinks with redirects properly, and start working on gathering global data.
Candidate WikiProjects
editI altered my code slightly to better handle capitalization, and now only have 207 projects where a project page is not found. Those I can handle manually. I have also decided to use the category structure as the way to find candidate projects, mainly because it makes it very easy for us to identify exactly which articles are within the scope of the project, but also because it restricts our data gathering to projects that we know use importance ratings. If we instead go through Wikidata, we'll get all projects and from there have to skip those that do not have the necessary categories. Lastly, we are also picking up task forces with the current approach, and it might be easier to go from there to the parent WikiProject (e.g. through redirects and page title parsing), than the other way around.
Some observations after manually inspecting categories that did not have a parent WikiProject in the dataset:
- Singular/plural is a common reason for not identifying the associated WikiProject. For example, there's Category:Top-importance bridge articles but WikiProject Bridges.
- Some WikiProjects add "-related" to their main topic in the category name, but that naming convention also refers to several GLAM-related projects. For example there's Category:Top-importance Israel-related articles and Wikipedia:WikiProject Israel, but also Category:Top-importance Johns Hopkins University-related articles and Wikipedia:GLAM/Johns Hopkins University
- Several WikiProjects have categories that are intersections of importance and article quality categories. WikiProject Russia have several of these, for example Category:Top-importance FA-Class Russia articles and Category:Top-importance GA-Class Russia articles.
- Including unassessed articles in the dataset is a problem. It means we pick up many task forces and such that do not have importance-rated articles, and many of these again required manual cleanup.
I've cleaned the dataset by removing all categories that had a "-class" name, or had "unassessed" in the name. Through R I also removed all categories that only had unassessed pages.