Research talk:Automated classification of article importance/Work log/2017-05-08
Monday, May 8, 2017
editToday my goal is to finish up building models for the candidate WikiProjects by building models for WikiProject Africa and WikiProject Politics. Once that is done, I'll write introductions to the various work lists we've generated, and adapt the introduction to the rerating candidates from WikiProject Medicine. Then I'll put those somewhere on enwiki and compose messages to the various projects' assessment pages.
WikiProject Politics
editThe distribution of articles per importance rating for WikiProject Politics is as follows:
Rating | N articles |
---|---|
Top | 111 |
High | 1,107 |
Mid | 4,068 |
Low | 18,469 |
The number of Top-importance articles will limit the size of our test set, so I decided to set away 30 articles to the test set, and use 80 for the training set. We have had reasonable success with creating fairly large proportion of synthetic samples, so having an additional 800 synthetic samples should be okay, giving us 880 articles per class.
Performance on the 120 article test set was about the same as for other projects, so I created a larger training set with 100 Top-importance articles, 1,000 synthetic samples, and 1,100 articles from the three other classes for a total of 4,400 articles. Using 10-fold cross-validation I found a minimum node size of 64 to have the lowest error, and that I should use 5,088 trees for predictions. With this setup, we get the following performance across the complete dataset:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 83 | 8 | 16 | 3 | 75.45% |
High | 221 | 439 | 271 | 175 | 39.69% |
Mid | 586 | 936 | 1,733 | 812 | 42.61% |
Low | 476 | 1,250 | 2,880 | 13,859 | 75.06% |
Average | 67.85% |
Overall performance is on par with what we have seen in other projects. We also see that Top- and Low-importance are the classes that appear easier to predict, while performance on the other two classes is quite a lot lower. About 20% of the High-importance articles are predicted to be Low-importance, and about 10% of Mid-importance is predicted to be High-importance, something I wonder if will be picked up later during discussions. Inspecting some of the predicted reratings also indicates that they are not as clearly related to number of views and inlinks as we have seen in other projects, curious to see how that affects things too.
WikiProject Africa
editThe distribution of articles per importance rating for WikiProject Africa is as follows:
Rating | N articles |
---|---|
Top | 2,264 |
High | 1,249 |
Mid | 4,054 |
Low | 25,835 |
Whereas we before had a low number of Top-importance articles, we now have a large project with less limitation on dataset sizes. It's the number of High-importance articles that limits dataset size, but with almost 1,250 articles, it's not really a limitation compared to what we have seen previously. We first split the dataset up in separate training and test sets, and find classifier performance to be on par with what we have seen previously. While doing this, we did find several articles that were tagged multiple times by the project and given two different importance ratings, and I created a work list table for those as well.
I chose to sample 1,240 articles from each category for the final training set. Using 10-fold cross-validation as before, I found that a minimum node size of 4 had the best performance, using 4,543 trees for the predictions. This resulted in the following confusion matrix for the full dataset:
Top | High | Mid | Low | Accuracy | |
---|---|---|---|---|---|
Top | 1,672 | 360 | 160 | 72 | 73.9% |
High | 440 | 422 | 155 | 232 | 33.8% |
Mid | 921 | 882 | 958 | 1,293 | 23.6% |
Low | 1,807 | 3,104 | 3,802 | 17,122 | 66.3% |
Average | 60.4% |
Overall performance is comparable to other projects we have modelled. We see fairly strong performance for Top-importance articles, and good performance on Low-importance articles as well. High- and Mid-importance articles are not predicted as well. There is some indication that the High-importance articles look like Top-importance, which we've encountered in other projects too. It is somewhat worrying that predictions of Mid-importance articles are spread out across the board, but this is also a result of the input data as the project's importance ratings do not seem to map closely to neither number of views nor inlinks.