Research talk:Automated classification of article importance/Work log/2017-03-07
Tuesday, March 7, 2017
editToday I will continue the data analysis that I was unable to complete yesterday. For future reference, doing string manipulation in R is not a great idea. I'll write some Python to munge my data and aim to tackle the last four research questions:
- How is the number of ratings distributed?
- How many ratings are unanimous?
- How many are rated by more than one project and unanimously rated?
- What is the overlap between ratings?
- How many have more than two ratings?
RQ4: How is the number of ratings distributed?
editI decided to insert a new RQ4 as I was interested in understanding what the rating distribution looks like. Unsurprisingly most articles have only a few ratings, as we can see in the histogram below. This is also the case when examining the quantiles; the median is 2, 85% is 3, and 95% 5.
The article with the highest number of ratings is African, Caribbean and Pacific Group of States with 86. Second is Women in Europe with 55.
We also want to know how many articles have a given rating (Top, High, Mid, Low). Counting the total occurrences of those gives the following table:
Rating | N ratings |
---|---|
Top | 53,104 |
High | 210,665 |
Mid | 913,394 |
Low | 4,929,624 |
Because we are here counting each occurrence of a rating, the total number is much larger than the number of articles in our dataset, since as we have already seen, articles frequently have multiple ratings.
RQ5: How many ratings are unanimous?
editThis research question regards articles that are rated by at least one WikiProject as unanimous. Note that some articles are categorized as having unknown importance or the importance is "not available" (NA). We remove these articles from our dataset, meaning that we only regard articles where the categorization points to the WikiProjects being in full agreement about the rating.
n_unanimous_1 = data.table( rating=c('Top', 'High', 'Mid', 'Low'), n_unanimous=c( length(only_articles[n_top > 0 & n_high == 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id), length(only_articles[n_top == 0 & n_high > 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id), length(only_articles[n_top == 0 & n_high == 0 & n_mid > 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id), length(only_articles[n_top == 0 & n_high == 0 & n_mid == 0 & n_low > 0 & n_unknown == 0 & n_na == 0]$talk_page_id) ) ); n_unanimous_1$rating = ordered(n_unanimous_1$rating, c('Top', 'High', 'Mid', 'Low')); > n_unanimous_1 rating n_unanimous 1: Top 7991 2: High 42572 3: Mid 231329 4: Low 1795436
Rating | N articles |
---|---|
Top | 7,991 |
High | 42,572 |
Medium | 231,329 |
Low | 1,795,436 |
RQ6: How many are rated by more than one project and unanimously rated?
editThis RQ only concerns itself with articles that are rated by at least two WikiProjects and where they all agree on the rating. Similarly as for RQ5 we remove articles with "unknown" or "NA" importance.
n_unanimous = data.table( rating=c('Top', 'High', 'Mid', 'Low'), n_unanimous=c( length(only_articles[n_top > 1 & n_high == 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id), length(only_articles[n_top == 0 & n_high > 1 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id), length(only_articles[n_top == 0 & n_high == 0 & n_mid > 1 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id), length(only_articles[n_top == 0 & n_high == 0 & n_mid == 0 & n_low > 1 & n_unknown == 0 & n_na == 0]$talk_page_id) ) ); n_unanimous$rating = ordered(n_unanimous$rating, c('Top', 'High', 'Mid', 'Low')); > n_unanimous rating n_unanimous 1: Top 1900 2: High 9905 3: Mid 65365 4: Low 855760
Rating | N articles |
---|---|
Top | 1,900 |
High | 9,905 |
Medium | 65,365 |
Low | 855,760 |
From the table it is evident that removing articles rated by a single project drastically lowers the number of articles. For example, over 6,000 articles (RQ5: 7,991, RQ6: 1,900) have a Top-importance rating, but only from a single project. Similarly, we remove about 32,000 High-importance articles. We see this as clear indications of how WikiProject assessments are localized to a given project, which in turn suggests that using these as global indicators of importance without any kind of filtering is a problematic approach. Since we only count unanimous ratings by multiple projects, the number of articles are reduced, thereby suggesting that this might be a useful way of identifying importance at a larger scale (e.g. the edition as a whole).
RQ7: What is the overlap between ratings?
editWe investigate this by first creating a confusion matrix of the counts of pairs of ratings, and then creating a confusion matrix with triplets (the question of how many articles span all ratings will be answered in RQ8).
High | Mid | Low | |
---|---|---|---|
Top | 5,347 | 4,224 | 2,402 |
High | 23,904 | 21,626 | |
Mid | 177,397 |
Mid | Low | |
---|---|---|
Top + High | 2,698 | 1,083 |
High + Mid | 12,334 |
We find that pairs of ratings are not uncommon among articles that have High- or Mid-importance as their highest rating. 45,530 articles (1.37% of our entire dataset) are rated High-importance as well as one of the other lower ones, while 177,397 articles (5.3%) are rated both Mid- and Low-importance. Top-importance articles are more rarely rated with one of the other ratings, 11,973 articles in total (0.36%).
RQ8: How many articles have two or more ratings?
editI rephrased this question slightly so we measure the number of articles with two, three, and four ratings. Again we discard articles with "unknown" or "NA" ratings. To find the number of pairs and triplets, we sum the numbers from the RQ7 tables. Then we grab the articles with ratings across the board from the dataset:
> n_pairs = 5348+4224+2403+23904+21626+177397; > n_triplets = 2698+1083+12334; > n_quads = length(only_articles[n_top > 0 & n_high > 0 & n_mid > 0 & n_low > 0 & n_unknown == 0 & n_na == 0]$talk_page_id); > n_pairs [1] 234900 > n_triplets [1] 16115 > n_quads [1] 1127
So, 234,900 articles (7.07%) have two (and only two) ratings, 16,115 (0.48%) have three, and 1,127 articles (0.03%) span all ratings. Some examples of articles spanning all four ratings are First Persian invasion of Greece and Second Persian invasion of Greece, who are both in Category:Low-importance Featured topics articles due to their promotion to being part of a featured portal. Perhaps more interesting are the fact that several US presidents show up on this list, for example Franklin Pierce, Ulysses S. Grant, Grover Cleveland, Woodrow Wilson, Dwight D. Eisenhower, Herbert Hoover, and Jimmy Carter. Taking Carter as an example, we find that it is rated Top-importance because of him being a US president, but at the same time rated Low-importance by WikiProject US governors and WikiProject US State Legislatures. Similarly as we saw for RQ6, this indicates some of the difficulty of using these importance ratings without further analysis.
Sample of articles with unanimous ratings
editBased on the results from RQ6, we're interested in understanding more about the articles that have unanimous ratings from at least two WikiProjects. We therefore randomly sample a dozen articles from each of the four rating categories. Here is the sample we used:
Note that Rostec, Isilkulsky District, and Andrei Ivanovich Gorchakov reveal an issue with the category structure. Those articles are only rated by a single WikiProject (Russia), but because their category system puts it in quality-based categories that are picked up by our schema, we count it as being rated twice. I went back and updated the Python script we use for counting the various ratings and added a check for these types of quality-based categories and removed them. All statistics up to this point have been updated to reflect the new data, and all of these three Russia-related articles are no longer incorrectly counted (they all only have a single project rating them).