Research talk:Automated classification of article importance/Work log/2017-04-04

Tuesday, April 4, 2017

edit

Today I will continue my work from yesterday, analyzing the Wikidata-based graph of WPMED articles. The goal for today is to identify key non-default properties used by WPMED to hook articles into the network, and to identify key overarching classes that can be used to categorize Low-importance articles.

Key classes

edit

Key properties

edit

Gephi makes it easy for us to get a sorted list of the properties that are most commonly used. In this case we follow any type of property from an item in our initial dataset, but then only follow "instance of", "subclass of", and "part of" relationships from there on. The properties used by more than 0.1% of the edges (roughly more than 100) are:

QID Name Proportion
P31 instance of 28.62%
P279 subclass of 22.27%
P106 occupation 4.18%
P1995 medical specialty 3.52%
P21 sex or gender 3.40%
P735 given name 2.83%
P2176 drug used for treatment 2.59%
P361 part of 2.28%
P69 educated at 2.14%
P27 country of citizenship 2.07%
P2293 genetic association 1.92%
P910 topic's main category 1.64%
P19 place of birth 1.49%
P17 country 1.40%
P2175 medical condition treated 1.37%
P1343 described by source 1.36%
P166 award received 1.20%
P463 member of 0.96%
P20 place of death 0.94%
P108 employer 0.91%
P129 physically interacts with 0.67%
P769 significant drug interaction 0.65%
P3489 pregnancy category 0.58%
P780 symptoms 0.51%
P1412 languages spoken, written or signed 0.47%
P828 has cause 0.43%
P364 original language of work 0.39%
P159 headquarters location 0.38%
P123 publisher 0.35%
P138 named after 0.30%
P684 ortholog 0.29%
P171 parent taxon 0.29%
P101 field of work 0.27%
P105 taxon rank 0.27%
P703 found in taxon 0.25%
P734 family name 0.24%
P1542 cause of 0.23%
P131 located in the administrative territorial entity 0.23%
P3780 active ingredient in 0.20%
P688 encodes 0.19%
P689 afflicts 0.17%
P495 country of origin 0.16%
P1196 manner of death 0.16%
P1057 chromosome 0.15%
P2548 strand orientation 0.15%
P360 is a list of 0.13%
P1346 winner 0.13%
P937 work location 0.13%
P40 child 0.12%
P509 cause of death 0.12%
P927 anatomical location 0.12%
P636 route of administration 0.11
P39 position held 0.11%
P607 conflict 0.11%
P119 place of burial 0.11%
P2643 Carnegie Classification of Institutions of Higher Education 0.11%

I continued pruning the list, then generated a new network and analyzed it. I am planning to complete the pruning tomorrow, then again analyze the network. However, it seems that it generally clusters into three distinct parts with regards to Low-importance articles: humans, organizations (companies etc…), and publications (books, scientific journals).

Return to "Automated classification of article importance/Work log/2017-04-04" page.