Research talk:Automated classification of article importance/Work log/2017-05-08

Monday, May 8, 2017

Today my goal is to finish up building models for the candidate WikiProjects by building models for WikiProject Africa and WikiProject Politics. Once that is done, I'll write introductions to the various work lists we've generated, and adapt the introduction to the rerating candidates from WikiProject Medicine. Then I'll put those somewhere on enwiki and compose messages to the various projects' assessment pages.

WikiProject Politics

The distribution of articles per importance rating for WikiProject Politics is as follows:

Rating	N articles
Top	111
High	1,107
Mid	4,068
Low	18,469

The number of Top-importance articles will limit the size of our test set, so I decided to set away 30 articles to the test set, and use 80 for the training set. We have had reasonable success with creating fairly large proportion of synthetic samples, so having an additional 800 synthetic samples should be okay, giving us 880 articles per class.

Performance on the 120 article test set was about the same as for other projects, so I created a larger training set with 100 Top-importance articles, 1,000 synthetic samples, and 1,100 articles from the three other classes for a total of 4,400 articles. Using 10-fold cross-validation I found a minimum node size of 64 to have the lowest error, and that I should use 5,088 trees for predictions. With this setup, we get the following performance across the complete dataset:

	Top	High	Mid	Low	Accuracy
Top	83	8	16	3	75.45%
High	221	439	271	175	39.69%
Mid	586	936	1,733	812	42.61%
Low	476	1,250	2,880	13,859	75.06%
Average					67.85%

Overall performance is on par with what we have seen in other projects. We also see that Top- and Low-importance are the classes that appear easier to predict, while performance on the other two classes is quite a lot lower. About 20% of the High-importance articles are predicted to be Low-importance, and about 10% of Mid-importance is predicted to be High-importance, something I wonder if will be picked up later during discussions. Inspecting some of the predicted reratings also indicates that they are not as clearly related to number of views and inlinks as we have seen in other projects, curious to see how that affects things too.

WikiProject Africa

The distribution of articles per importance rating for WikiProject Africa is as follows:

Rating	N articles
Top	2,264
High	1,249
Mid	4,054
Low	25,835

Whereas we before had a low number of Top-importance articles, we now have a large project with less limitation on dataset sizes. It's the number of High-importance articles that limits dataset size, but with almost 1,250 articles, it's not really a limitation compared to what we have seen previously. We first split the dataset up in separate training and test sets, and find classifier performance to be on par with what we have seen previously. While doing this, we did find several articles that were tagged multiple times by the project and given two different importance ratings, and I created a work list table for those as well.

I chose to sample 1,240 articles from each category for the final training set. Using 10-fold cross-validation as before, I found that a minimum node size of 4 had the best performance, using 4,543 trees for the predictions. This resulted in the following confusion matrix for the full dataset:

	Top	High	Mid	Low	Accuracy
Top	1,672	360	160	72	73.9%
High	440	422	155	232	33.8%
Mid	921	882	958	1,293	23.6%
Low	1,807	3,104	3,802	17,122	66.3%
Average					60.4%

Overall performance is comparable to other projects we have modelled. We see fairly strong performance for Top-importance articles, and good performance on Low-importance articles as well. High- and Mid-importance articles are not predicted as well. There is some indication that the High-importance articles look like Top-importance, which we've encountered in other projects too. It is somewhat worrying that predictions of Mid-importance articles are spread out across the board, but this is also a result of the input data as the project's importance ratings do not seem to map closely to neither number of views nor inlinks.

Add topic