Research talk:Automated classification of article importance/Work log/2017-06-05
Monday, June 5, 2017
editToday I'll continue working on a global prediction model, start planning of some kind of API or something, and follow up on some project communication.
Global model
editThe idea behind our global model, the one that can make importance predictions in the context of the entire English Wikipedia, is that we can take WikiProject-specific models and project them onto Wikipedia as a whole. Because we only use relative measures in our model (e.g. rank percentile of views, or proportion of active inlinks), they should work in multiple contexts. However, the model will be affected by the projects that we train it on. This can mean that we have a trade off between high performance and diversity, because we have seen that some WikiProjects have less variation in their ratings leading to better performance. We have also seen that some projects (e.g. WP:MED) have a model where article views largely determine importance, while in other projects (e.g. WP:NFL), inlinks are strongly related to importance.
We test three models in this scenario:
- Using all data from all six projects (Africa, China, Judaism, Medicine, NFL, and Politics)
- Using all data from projects where classifier performance is high
- Using all data from projects where classifier performance is high and the models are similar
An additional strategy could be to sample equally from all projects, at which point we are limited in size to the size of the smallest category, which is the 92 articles in WP:MED's Top-importance class. If these other strategies do not appear to make improvements, we will consider going down that route.
All data from all six WikiProjects
editWe first grab data from all projects, giving us the following distribution of data across importance ratings:
Rating | N articles |
---|---|
Top | 3,479 |
High | 6,033 |
Mid | 31,191 |
Low | 85,949 |
We hold out a random sample of 345 articles per class for a test set, and sample 3,100 articles from each class for the training set. The size of the dataset and the variation in features leads us to memory constraints when training, we therefore cap off the model at 2,000 trees in order to not run out of memory during training. This cap was determined by checking model accuracy at various model sizes, finding that there is little to gain in model accuracy by building a larger model.
Using this model to predict the articles in the test set gives the following confusion matrix (rows are true rating, columns are predicted rating):
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 249 | 49 | 34 | 13 | 72.17% |
High | 108 | 108 | 73 | 56 | 31.30% |
Mid | 58 | 71 | 132 | 84 | 38.26% |
Low | 20 | 29 | 65 | 231 | 66.96% |
Average | 52.17% |
Performance is fairly good for Top- and Low-importance articles, and we see that a low number of articles in those categories are predicted to belong to distant classes. High-importance appears to be a difficult class to predict, accuracy is not largely above random chance and we see a lot of articles being predicted to belong to neighboring classes. Mid-importance is also not easy to predict, we see more articles predicted in neighboring classes than the class itself. Based on these results and the lower performance compared to what we've seen for some of the WikiProjects, we would definitely like to test other approaches.
High-performing WikiProjects
editWe go through the models for each WikiProject and check the accuracy for each importance class. If it's low, e.g. around 30%, we consider that project a candidate for removal. In the end, this results in only a single project being removed, WP:Africa, where performance on Mid-importance articles is significantly lower than other projects.
We split the dataset in a 90%/10% training/testing scheme using random selection as before, followed by training a GBM using 10-fold cross-validation to identify the correct minimum node size and number of trees to use. Using minimum node size of 4 and 2,500 trees, we train a model on the training set and get the following performance on the test set:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 86 | 27 | 6 | 1 | 71.67% |
High | 36 | 48 | 20 | 16 | 40.00% |
Mid | 10 | 32 | 46 | 32 | 38.33% |
Low | 2 | 13 | 20 | 85 | 70.83% |
Average | 55.21% |
We have slightly lower performance on Top-importance articles, but see an increase in accuracy for the other articles. Long-distance errors on Top- and Low-importance articles are still few, but we still struggle with neighboring classes for High- and Mid-importance articles. The latter is something that has been a recurring challenge in many of the WikiProjects, meaning I suspect it won't be a challenge we can easily solve. Another thing to keep in mind is that the training and test sets are different than from the previous case, so the results are not directly comparable.
Focusing on views
editLastly, we look into whether using projects that have models that are similar leads to better performance (while keeping the performance requirement from before). Examining the features' impact on accuracy, we find that WP:Judaism and WP:NFL have models that are mainly affected by number of inlinks, meaning we drop those two and keep the others. Similarly as before, we split the dataset into a test set (with 50 articles per class), and a training set (with 550 articles per class). While we might want to experiment with SMOTE for increasing the training set size, I decided to stay away from that for now until I know more about how the model behaves.
Similarly as before, we use 10-fold cross-validation to find the right minimum node size (64) and the number of trees (1,091). Testing the model on the test set results in the following confusion matrix:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 36 | 9 | 3 | 2 | 72.00% |
High | 10 | 18 | 14 | 8 | 36.00% |
Mid | 2 | 14 | 19 | 15 | 38.00% |
Low | 9 | 3 | 15 | 32 | 64.00% |
Average | 52.50% |
Overall performance is comparable to what we've seen before, although somewhat lower than the previous model. The decrease in performance comes in the High- and Low-importance classes, where accuracy is quite a bit lower than before. The other two classes are largely the same. We can also see that a fair amount of Low-importance articles (9, or almost 20%) are predicted to be Top-importance, which is not beneficial. We also examined how the various features relate to predicting importance, and find that the influence of views and inlinks are largely comparable to what they were for the previous model with data from five of the WikiProjects. In other words, it does not appear to be any gain in performance by not including the projects where inlinks play a larger part.