Research talk:Automated classification of article importance/Work log/2017-04-20
Thursday, April 20, 2017
editToday I plan to train a new classifier for WPMED, study WikiProject activity in more detail, follow up on some project communication, and start working on global data.
WPMED classifier
editI've gathered new data on the number of inlinks to articles, taking (single) redirects into account. The new dataset also corrects some changes that have been done since my first data gathering, many articles were reassessed either by me or members of WPMED. My data flow also accounts for the Wikidata instances that are to be declared Low-importance (hereby referred to as "Low-importance by default").
I built a 3,600 article training set consisting of a random sample of 90 of the 91 Top-importance articles, 810 synthetic Top-importance samples, and 900 random articles from each of the other three importance categories, with the "Low-importance by default" articles removed. I then trained a GBM classifier using this dataset, finding that it should have a minimum node size of 128, and a forest size of 3,701 trees. Using this model to predict the entire new WPMED dataset gives the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 77 | 13 | 1 | 0 | 84.62% |
High | 179 | 584 | 136 | 77 | 59.84% |
Mid | 188 | 1,897 | 4,002 | 2,755 | 45.26% |
Low | 50 | 1,043 | 3,674 | 14,735 | 75.56% |
Average | 65.95% |
The performance of this classifier is much stronger than what we saw a week ago, suggesting that considering redirects in these calculations add a significant amount of signal. Performance takes a solid step forwards for Top- and High-importance articles, and is roughly equal for Low-importance articles. Accuracy for Mid-importance articles is, however, greatly improved, as it doubles.
I manually inspected some of the misclassified articles, and the classifier appears to make reasonably sane predictions. I'll create a table of misclassified examples to discuss with WPMED.
Project Communication
editWe recently announced the project on wiki-research-l, so I've been following up on responses there. Particularly useful was Pine's reply with some pointers to various practices around importance on the English Wikipedia.
I've also responded to a thread on our project's talk page, and gotten in touch with WPMED about the predictions we've made.