Research talk:Automated classification of article importance/Work log/2017-04-04
Tuesday, April 4, 2017
editToday I will continue my work from yesterday, analyzing the Wikidata-based graph of WPMED articles. The goal for today is to identify key non-default properties used by WPMED to hook articles into the network, and to identify key overarching classes that can be used to categorize Low-importance articles.
Key classes
editKey properties
editGephi makes it easy for us to get a sorted list of the properties that are most commonly used. In this case we follow any type of property from an item in our initial dataset, but then only follow "instance of", "subclass of", and "part of" relationships from there on. The properties used by more than 0.1% of the edges (roughly more than 100) are:
QID | Name | Proportion |
---|---|---|
P31 | instance of | 28.62% |
P279 | subclass of | 22.27% |
P106 | occupation | 4.18% |
P1995 | medical specialty | 3.52% |
P21 | sex or gender | 3.40% |
P735 | given name | 2.83% |
P2176 | drug used for treatment | 2.59% |
P361 | part of | 2.28% |
P69 | educated at | 2.14% |
P27 | country of citizenship | 2.07% |
P2293 | genetic association | 1.92% |
P910 | topic's main category | 1.64% |
P19 | place of birth | 1.49% |
P17 | country | 1.40% |
P2175 | medical condition treated | 1.37% |
P1343 | described by source | 1.36% |
P166 | award received | 1.20% |
P463 | member of | 0.96% |
P20 | place of death | 0.94% |
P108 | employer | 0.91% |
P129 | physically interacts with | 0.67% |
P769 | significant drug interaction | 0.65% |
P3489 | pregnancy category | 0.58% |
P780 | symptoms | 0.51% |
P1412 | languages spoken, written or signed | 0.47% |
P828 | has cause | 0.43% |
P364 | original language of work | 0.39% |
P159 | headquarters location | 0.38% |
P123 | publisher | 0.35% |
P138 | named after | 0.30% |
P684 | ortholog | 0.29% |
P171 | parent taxon | 0.29% |
P101 | field of work | 0.27% |
P105 | taxon rank | 0.27% |
P703 | found in taxon | 0.25% |
P734 | family name | 0.24% |
P1542 | cause of | 0.23% |
P131 | located in the administrative territorial entity | 0.23% |
P3780 | active ingredient in | 0.20% |
P688 | encodes | 0.19% |
P689 | afflicts | 0.17% |
P495 | country of origin | 0.16% |
P1196 | manner of death | 0.16% |
P1057 | chromosome | 0.15% |
P2548 | strand orientation | 0.15% |
P360 | is a list of | 0.13% |
P1346 | winner | 0.13% |
P937 | work location | 0.13% |
P40 | child | 0.12% |
P509 | cause of death | 0.12% |
P927 | anatomical location | 0.12% |
P636 | route of administration | 0.11 |
P39 | position held | 0.11% |
P607 | conflict | 0.11% |
P119 | place of burial | 0.11% |
P2643 | Carnegie Classification of Institutions of Higher Education | 0.11% |
I continued pruning the list, then generated a new network and analyzed it. I am planning to complete the pruning tomorrow, then again analyze the network. However, it seems that it generally clusters into three distinct parts with regards to Low-importance articles: humans, organizations (companies etc…), and publications (books, scientific journals).