Research talk:Automated classification of article importance/Work log/2017-04-05
Wednesday, April 5, 2017
editToday I plan to wrap up the pruning of the properties we use for building the relationship network for WPMED, then analyze the resulting graph and see what we come up with.
Key properties
editAfter having gone through the properties used in the graph, I identified the most used ones that were related to medicine, and specifically unrelated to people or locations. Since I had been through a couple of iterations on this, there were not that many left, and there was a reasonable cutoff at 100 items using a given property (above it is P2643, Carnegie Classification of Institutions of Higher Education, applied to 120 items, and below is P2597, Gram staining, with 89 applications).
I think a key way to look at how this applies to the WPMED graph is that we seek properties that describes items in focus of the project, in order to create a tightly connected network of those items, and then have a more sparsely populated network of less related items. This is particularly the case for WPMED, where we end up having a tight cluster of medicine topics, and then some smaller clusters of other types of topics (e.g. "humans", "scientific journals").
Determining clusters
editGephi has a built in method for "community detection", which we'll use to identify key clusters associated with Low-importance items. The algorithm has a "resolution" parameter, so we run a few iterations of the algorithm to see how it affects performance as measured by the "Modularity" and "Modularity with resolution" measures of communities. We're aiming to maximize these, but that might result in a suboptimal set of communities. Based on my reading of pages related to the issue, there are some improved methods available, but these are not implemented in Gephi.
Resolution | Modularity | Mod. w/res. | N communities |
---|---|---|---|
1.5 | 0.736 | 1.184 | 104 |
1.4 | 0.732 | 1.087 | 105 |
1.3 | 0.735 | 0.994 | 106 |
1.2 | 0.739 | 0.914 | 104 |
1.1 | 0.745 | 0.832 | 110 |
1.0 | 0.740 | 0.740 | 110 |
0.9 | 0.742 | 0.656 | 112 |
0.8 | 0.731 | 0.567 | 115 |
0.7 | 0.732 | 0.491 | 123 |
0.6 | 0.731 | 0.413 | 123 |
0.5 | 0.717 | 0.332 | 136 |
Based on this is looks like we get good results with a resolution of 1.1, with 110 communities. Let's inspect those communities in more detail.
Communities
editComm. number | % of nodes | Description |
---|---|---|
39 | 19.27 | Diseases |
90 | 14.13 | Humans |
0 | 10.93% | Pharmaceutical drugs |
25 | 9.97 | Miscellaneous (but also books) |
At this point, I decided to investigate if a slightly lower resolution setting would do better, because I would prefer if "books" were in a different community. I changed the resolution to 0.8 and investigated the new communities:
Comm. number | % of nodes | Description |
---|---|---|
114 | 15.69 | Diseases |
40 | 14.10 | Humans |
77 | 9.28 | Pharmaceutical drugs |
55 | 8.86 | Genes |
2 | 6.7 | Miscellaneous (but also statutes, legislation) |
105 | 4.52 | Miscellaneous medicine |
45 | 4.25 | Miscellaneous (but software, website) |
75 | 3.83 | Proteins |
64 | 3.7 | Taxons |
I'm not sure I'm happy with this either. Is this community detection something we want to use? Is there perhaps a clustering algorithm that's better?
Majority Low-importance parents
editAs a first step, I wrote a Python script that iterates through the graph, finds all nodes with at least three neighbours, checks if they have a majority of Low-importance amongst its rated articles, and then writes out a sorted list of those. If we keep the more obviously non-core categories, we get the sorted table below. "Low prop" is the proportion of Low-importance articles amongst the rated articles, rounded to two decimal places, "N articles" only counts the rated ones.
QID | Label | N articles | Low prop |
---|---|---|---|
Q5 | human | 3,809 | 1.00 |
Q43229 | organization | 696 | 0.97 |
Q5633421 | scientific journal | 420 | 0.98 |
Q4830453 | business enterprise | 264 | 0.97 |
Q494230 | medical school | 135 | 0.97 |
Q571 | book | 99 | 0.92 |
Q3918 | university | 87 | 1.00 |
Q163740 | nonprofit organization | 83 | 0.98 |
Q327333 | government agency | 81 | 0.89 |
Q31855 | research institute | 69 | 0.81 |
Q16917 | hospital | 68 | 0.75 |
Q1002697 | periodical literature | 51 | 0.98 |
Q17524420 | aspect of history | 49 | 0.67 |
Q10729872 | health association | 36 | 0.97 |
Q708676 | charitable organization | 32 | 0.97 |
Q618779 | award | 30 | 0.97 |
Q7397 | software | 30 | 0.97 |
Q35127 | website | 27 | 0.89 |
Q6954197 | NHS trust | 22 | 0.95 |
Q23002054 | private not-for-profit educational institution | 19 | 1.00 |
Q17362920 | Wikimedia duplicated page | 19 | 0.53 |
Q476068 | Act of Congress | 19 | 1.00 |
Q2334719 | legal case | 19 | 1.00 |
Q157031 | foundation | 16 | 1.00 |
Q6498663 | fire department | 16 | 1.00 |
Q19869268 | medical society | 14 | 1.00 |
Q11424 | film | 14 | 0.86 |
Q11000047 | health system | 14 | 0.64 |
Q1110684 | professional association | 14 | 0.93 |
Q6954187 | NHS foundation trust | 12 | 1.00 |
Q189004 | college | 12 | 1.00 |
Q4677783 | Act of Parliament of the United Kingdom | 11 | 0.82 |
Q4260475 | medical facility | 11 | 0.64 |
Q176799 | military unit | 11 | 1.00 |
Q341 | free software | 10 | 1.00 |
Q5398426 | television series | 10 | 0.90 |
Q3914 | school | 10 | 1.00 |
Q5691113 | health organization | 10 | 0.90 |
Q33506 | museum | 9 | 1.00 |
Q2385804 | educational institution | 9 | 0.67 |
Q484652 | international organization | 9 | 0.89 |
Q23002039 | public educational institution of the United States | 9 | 0.89 |
Q2558684 | world day | 9 | 1.00 |
Q41298 | magazine | 9 | 1.00 |
Q7094076 | online database | 9 | 1.00 |
Q502074 | heliport | 8 | 1.00 |
Q1664720 | institute | 7 | 1.00 |
Q483242 | laboratory | 7 | 0.86 |
Q21538537 | medical database | 7 | 1.00 |
Q1774587 | hospital network | 7 | 1.00 |
Q7075 | library | 7 | 1.00 |
Q17072837 | medical college in India | 6 | 1.00 |
Q41176 | building | 6 | 0.84 |
Q820655 | statute | 6 | 0.84 |
Q618123 | geographical object | 5 | 1.00 |
Q811979 | architectural structure | 5 | 0.80 |
Q79913 | non-governmental organization | 5 | 1.00 |
Q7725634 | literary work | 5 | 1.00 |
Q46970 | airline | 5 | 1.00 |
Q48204 | voluntary association | 5 | 0.80 |
Q737498 | academic journal | 5 | 1.00 |
Q8513 | database | 4 | 0.75 |
Q1519799 | Ministry of Health | 4 | 1.00 |
Q38723 | higher education institution | 4 | 1.00 |
Q180958 | faculty | 4 | 0.75 |
Q16334295 | group of humans | 4 | 0.75 |
Q11266439 | Wikimedia template | 4 | 0.75 |
Q183816 | master's degree | 4 | 0.75 |
Q87167 | manuscript | 4 | 1.00 |
Q431603 | advocacy group | 4 | 1.00 |
Q11448906 | science award | 4 | 1.00 |
Q748019 | scientific society | 4 | 1.00 |
Q47574 | unit of measurement | 4 | 0.75 |
Q4287745 | medical organization | 4 | 1.00 |
Q16026109 | technologist | 3 | 1.00 |
Q15416 | television program | 3 | 1.00 |
Q8016240 | trial | 3 | 1.00 |
Q3305213 | painting | 3 | 0.67 |
Q1194970 | dot-com company | 3 | 1.00 |
Q2772772 | military museum | 3 | 1.00 |
Q178790 | trade union | 3 | 1.00 |
Q18574946 | annual event | 3 | 1.00 |
Q694554 | emergency telephone number | 3 | 0.67 |
Q7653906 | social insurance | 3 | 1.00 |
Q811430 | construction | 3 | 1.00 |
Q506240 | television film | 3 | 0.67 |
Q1774898 | clinic | 3 | 0.67 |
Q9078534 | honor society | 3 | 0.67 |
Q18534571 | medical research centre | 3 | 0.67 |