Research talk:Automated classification of article importance/Work log/2017-03-07

Tuesday, March 7, 2017

Today I will continue the data analysis that I was unable to complete yesterday. For future reference, doing string manipulation in R is not a great idea. I'll write some Python to munge my data and aim to tackle the last four research questions:

How is the number of ratings distributed?
How many ratings are unanimous?
How many are rated by more than one project and unanimously rated?
What is the overlap between ratings?
How many have more than two ratings?

RQ4: How is the number of ratings distributed?

I decided to insert a new RQ4 as I was interested in understanding what the rating distribution looks like. Unsurprisingly most articles have only a few ratings, as we can see in the histogram below. This is also the case when examining the quantiles; the median is 2, 85% is 3, and 95% 5.

Histogram of importance ratings

The article with the highest number of ratings is African, Caribbean and Pacific Group of States with 86. Second is Women in Europe with 55.

We also want to know how many articles have a given rating (Top, High, Mid, Low). Counting the total occurrences of those gives the following table:

Rating	N ratings
Top	53,104
High	210,665
Mid	913,394
Low	4,929,624

Because we are here counting each occurrence of a rating, the total number is much larger than the number of articles in our dataset, since as we have already seen, articles frequently have multiple ratings.

RQ5: How many ratings are unanimous?

This research question regards articles that are rated by at least one WikiProject as unanimous. Note that some articles are categorized as having unknown importance or the importance is "not available" (NA). We remove these articles from our dataset, meaning that we only regard articles where the categorization points to the WikiProjects being in full agreement about the rating.

n_unanimous_1 = data.table(
  rating=c('Top', 'High', 'Mid', 'Low'),
  n_unanimous=c(
    length(only_articles[n_top > 0 & n_high == 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high > 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid > 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid == 0 & n_low > 0 & n_unknown == 0 & n_na == 0]$talk_page_id)
  )
);
n_unanimous_1$rating = ordered(n_unanimous_1$rating,
                               c('Top', 'High', 'Mid', 'Low'));
> n_unanimous_1
   rating n_unanimous
1:    Top        7991
2:   High       42572
3:    Mid      231329
4:    Low     1795436

Rating	N articles
Top	7,991
High	42,572
Medium	231,329
Low	1,795,436

RQ6: How many are rated by more than one project and unanimously rated?

This RQ only concerns itself with articles that are rated by at least two WikiProjects and where they all agree on the rating. Similarly as for RQ5 we remove articles with "unknown" or "NA" importance.

n_unanimous = data.table(
  rating=c('Top', 'High', 'Mid', 'Low'),
  n_unanimous=c(
    length(only_articles[n_top > 1 & n_high == 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high > 1 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid > 1 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid == 0 & n_low > 1 & n_unknown == 0 & n_na == 0]$talk_page_id)
  )
);
n_unanimous$rating = ordered(n_unanimous$rating,
                             c('Top', 'High', 'Mid', 'Low'));
> n_unanimous
   rating n_unanimous
1:    Top        1900
2:   High        9905
3:    Mid       65365
4:    Low      855760

Barchart of unanimous importance ratings

Rating	N articles
Top	1,900
High	9,905
Medium	65,365
Low	855,760

From the table it is evident that removing articles rated by a single project drastically lowers the number of articles. For example, over 6,000 articles (RQ5: 7,991, RQ6: 1,900) have a Top-importance rating, but only from a single project. Similarly, we remove about 32,000 High-importance articles. We see this as clear indications of how WikiProject assessments are localized to a given project, which in turn suggests that using these as global indicators of importance without any kind of filtering is a problematic approach. Since we only count unanimous ratings by multiple projects, the number of articles are reduced, thereby suggesting that this might be a useful way of identifying importance at a larger scale (e.g. the edition as a whole).

RQ7: What is the overlap between ratings?

We investigate this by first creating a confusion matrix of the counts of pairs of ratings, and then creating a confusion matrix with triplets (the question of how many articles span all ratings will be answered in RQ8).

	High	Mid	Low
Top	5,347	4,224	2,402
High		23,904	21,626
Mid			177,397

	Mid	Low
Top + High	2,698	1,083
High + Mid		12,334

We find that pairs of ratings are not uncommon among articles that have High- or Mid-importance as their highest rating. 45,530 articles (1.37% of our entire dataset) are rated High-importance as well as one of the other lower ones, while 177,397 articles (5.3%) are rated both Mid- and Low-importance. Top-importance articles are more rarely rated with one of the other ratings, 11,973 articles in total (0.36%).

RQ8: How many articles have two or more ratings?

I rephrased this question slightly so we measure the number of articles with two, three, and four ratings. Again we discard articles with "unknown" or "NA" ratings. To find the number of pairs and triplets, we sum the numbers from the RQ7 tables. Then we grab the articles with ratings across the board from the dataset:

> n_pairs = 5348+4224+2403+23904+21626+177397;
> n_triplets = 2698+1083+12334;
> n_quads = length(only_articles[n_top > 0 & n_high > 0 & n_mid > 0 & n_low > 0 & n_unknown == 0 & n_na == 0]$talk_page_id);
> n_pairs
[1] 234900
> n_triplets
[1] 16115
> n_quads
[1] 1127

So, 234,900 articles (7.07%) have two (and only two) ratings, 16,115 (0.48%) have three, and 1,127 articles (0.03%) span all ratings. Some examples of articles spanning all four ratings are First Persian invasion of Greece and Second Persian invasion of Greece, who are both in Category:Low-importance Featured topics articles due to their promotion to being part of a featured portal. Perhaps more interesting are the fact that several US presidents show up on this list, for example Franklin Pierce, Ulysses S. Grant, Grover Cleveland, Woodrow Wilson, Dwight D. Eisenhower, Herbert Hoover, and Jimmy Carter. Taking Carter as an example, we find that it is rated Top-importance because of him being a US president, but at the same time rated Low-importance by WikiProject US governors and WikiProject US State Legislatures. Similarly as we saw for RQ6, this indicates some of the difficulty of using these importance ratings without further analysis.

Sample of articles with unanimous ratings

Based on the results from RQ6, we're interested in understanding more about the articles that have unanimous ratings from at least two WikiProjects. We therefore randomly sample a dozen articles from each of the four rating categories. Here is the sample we used:

Top	High	Mid	Low
Albania–Kosovo relations	Graphyne	Désiré Munyaneza	Tha Hall of Game
Small business	Utamaro	Disseminated superficial actinic porokeratosis	Stephanie Sheh
Massina Empire	Gopinath	Poland at the 1996 Summer Olympics	Department of State Development
Women in the Middle Ages	FedEx	Sun Jianguo	Ridhima Ghosh
Prophecy	Competitor analysis	Orleans Canal	Australian Natives' Association
Cinema of Algeria	The Mind Is a Terrible Thing to Taste	Elliptic curve point multiplication	Badminton railway station
United Kingdom	Rostec	Isilkulsky District	Indefinite detention without trial
Sejm	Distinction	Parque de la Costa	Pyeonyuk
Jean Metzinger	Protein family	Ataxia	Andrei Ivanovich Gorchakov
Political status of Kosovo	Indigenous peoples of the Philippines	AH82	Uroballus henicurus
Gaborone	Origin of replication	Port of Jiaxing	Sabine
Fiduciary	Resident Evil	Owensboro Community and Technical College	Singikat

Note that Rostec, Isilkulsky District, and Andrei Ivanovich Gorchakov reveal an issue with the category structure. Those articles are only rated by a single WikiProject (Russia), but because their category system puts it in quality-based categories that are picked up by our schema, we count it as being rated twice. I went back and updated the Python script we use for counting the various ratings and added a check for these types of quality-based categories and removed them. All statistics up to this point have been updated to reflect the new data, and all of these three Russia-related articles are no longer incorrectly counted (they all only have a single project rating them).

Add topic