Research talk:Automated classification of article importance/Work log/2017-03-20
Monday, March 20, 2017
editToday I will work on the WPMED classifier, looking at the articles the new & improved classifier does not predict correctly. Once that is ready I want to sketch out a post to WPMED so we can start chatting about what importance means. Lastly, I'll go through our list of sources of signal and start making some decisions about where to move next.
WPMED prediction errors
editLike we did last week, I'll generate lists of the perhaps most interesting articles.
WPMED disambiguation pages
editHow many of the pages in my dataset are actually disambiguation pages? I need to go figure that out!
I ran this SQL query on Quarry to get a TSV of all disambiguation pages in WikiProject Medicine. There are 108 of them in total. Out of these, 18 have a different prediction from their actual WPMED importance rating. All of them are predicted to be Low-importance, 17 are rated Mid-importance, and one (Drug use) is rated High-importance. Only the last article shows up in our lists.
Inspecting the data I find that all of them have low number of inlinks, reasonably low number of views, and often none of the inlinks come from WPMED. I suspect the latter is because WPMED generally cleans up their articles and makes sure they do not link to disambiguation pages. In other words, a rating of Low-importance seems reasonable (although we might discuss why 18 of these appear to have importance ratings?)
Comparing these importance-rated disambiguation pages with those that did not have a rating suggests that all of them should have been marked as disambiguation pages and not gotten a rating. I went ahead and changed them, partly because that makes a dataset of WPMED importance ratings better.
WPMED communication
editI posted to WPMED's assessment talk page with an introduction and some examples of articles we might want to talk about. Hopefully they'll have some comments.