Jump to content

Research talk:Automated classification of article importance/Work log/2017-04-05

Add topic
From Meta, a Wikimedia project coordination wiki

Wednesday, April 5, 2017

[edit]

Today I plan to wrap up the pruning of the properties we use for building the relationship network for WPMED, then analyze the resulting graph and see what we come up with.

Key properties

[edit]

After having gone through the properties used in the graph, I identified the most used ones that were related to medicine, and specifically unrelated to people or locations. Since I had been through a couple of iterations on this, there were not that many left, and there was a reasonable cutoff at 100 items using a given property (above it is P2643, Carnegie Classification of Institutions of Higher Education, applied to 120 items, and below is P2597, Gram staining, with 89 applications).

I think a key way to look at how this applies to the WPMED graph is that we seek properties that describes items in focus of the project, in order to create a tightly connected network of those items, and then have a more sparsely populated network of less related items. This is particularly the case for WPMED, where we end up having a tight cluster of medicine topics, and then some smaller clusters of other types of topics (e.g. "humans", "scientific journals").

Determining clusters

[edit]

Gephi has a built in method for "community detection", which we'll use to identify key clusters associated with Low-importance items. The algorithm has a "resolution" parameter, so we run a few iterations of the algorithm to see how it affects performance as measured by the "Modularity" and "Modularity with resolution" measures of communities. We're aiming to maximize these, but that might result in a suboptimal set of communities. Based on my reading of pages related to the issue, there are some improved methods available, but these are not implemented in Gephi.

Resolution Modularity Mod. w/res. N communities
1.5 0.736 1.184 104
1.4 0.732 1.087 105
1.3 0.735 0.994 106
1.2 0.739 0.914 104
1.1 0.745 0.832 110
1.0 0.740 0.740 110
0.9 0.742 0.656 112
0.8 0.731 0.567 115
0.7 0.732 0.491 123
0.6 0.731 0.413 123
0.5 0.717 0.332 136

Based on this is looks like we get good results with a resolution of 1.1, with 110 communities. Let's inspect those communities in more detail.

Communities

[edit]
Comm. number % of nodes Description
39 19.27 Diseases
90 14.13 Humans
0 10.93% Pharmaceutical drugs
25 9.97 Miscellaneous (but also books)

At this point, I decided to investigate if a slightly lower resolution setting would do better, because I would prefer if "books" were in a different community. I changed the resolution to 0.8 and investigated the new communities:

Comm. number % of nodes Description
114 15.69 Diseases
40 14.10 Humans
77 9.28 Pharmaceutical drugs
55 8.86 Genes
2 6.7 Miscellaneous (but also statutes, legislation)
105 4.52 Miscellaneous medicine
45 4.25 Miscellaneous (but software, website)
75 3.83 Proteins
64 3.7 Taxons

I'm not sure I'm happy with this either. Is this community detection something we want to use? Is there perhaps a clustering algorithm that's better?

Majority Low-importance parents

[edit]

As a first step, I wrote a Python script that iterates through the graph, finds all nodes with at least three neighbours, checks if they have a majority of Low-importance amongst its rated articles, and then writes out a sorted list of those. If we keep the more obviously non-core categories, we get the sorted table below. "Low prop" is the proportion of Low-importance articles amongst the rated articles, rounded to two decimal places, "N articles" only counts the rated ones.

QID Label N articles Low prop
Q5 human 3,809 1.00
Q43229 organization 696 0.97
Q5633421 scientific journal 420 0.98
Q4830453 business enterprise 264 0.97
Q494230 medical school 135 0.97
Q571 book 99 0.92
Q3918 university 87 1.00
Q163740 nonprofit organization 83 0.98
Q327333 government agency 81 0.89
Q31855 research institute 69 0.81
Q16917 hospital 68 0.75
Q1002697 periodical literature 51 0.98
Q17524420 aspect of history 49 0.67
Q10729872 health association 36 0.97
Q708676 charitable organization 32 0.97
Q618779 award 30 0.97
Q7397 software 30 0.97
Q35127 website 27 0.89
Q6954197 NHS trust 22 0.95
Q23002054 private not-for-profit educational institution 19 1.00
Q17362920 Wikimedia duplicated page 19 0.53
Q476068 Act of Congress 19 1.00
Q2334719 legal case 19 1.00
Q157031 foundation 16 1.00
Q6498663 fire department 16 1.00
Q19869268 medical society 14 1.00
Q11424 film 14 0.86
Q11000047 health system 14 0.64
Q1110684 professional association 14 0.93
Q6954187 NHS foundation trust 12 1.00
Q189004 college 12 1.00
Q4677783 Act of Parliament of the United Kingdom 11 0.82
Q4260475 medical facility 11 0.64
Q176799 military unit 11 1.00
Q341 free software 10 1.00
Q5398426 television series 10 0.90
Q3914 school 10 1.00
Q5691113 health organization 10 0.90
Q33506 museum 9 1.00
Q2385804 educational institution 9 0.67
Q484652 international organization 9 0.89
Q23002039 public educational institution of the United States 9 0.89
Q2558684 world day 9 1.00
Q41298 magazine 9 1.00
Q7094076 online database 9 1.00
Q502074 heliport 8 1.00
Q1664720 institute 7 1.00
Q483242 laboratory 7 0.86
Q21538537 medical database 7 1.00
Q1774587 hospital network 7 1.00
Q7075 library 7 1.00
Q17072837 medical college in India 6 1.00
Q41176 building 6 0.84
Q820655 statute 6 0.84
Q618123 geographical object 5 1.00
Q811979 architectural structure 5 0.80
Q79913 non-governmental organization 5 1.00
Q7725634 literary work 5 1.00
Q46970 airline 5 1.00
Q48204 voluntary association 5 0.80
Q737498 academic journal 5 1.00
Q8513 database 4 0.75
Q1519799 Ministry of Health 4 1.00
Q38723 higher education institution 4 1.00
Q180958 faculty 4 0.75
Q16334295 group of humans 4 0.75
Q11266439 Wikimedia template 4 0.75
Q183816 master's degree 4 0.75
Q87167 manuscript 4 1.00
Q431603 advocacy group 4 1.00
Q11448906 science award 4 1.00
Q748019 scientific society 4 1.00
Q47574 unit of measurement 4 0.75
Q4287745 medical organization 4 1.00
Q16026109 technologist 3 1.00
Q15416 television program 3 1.00
Q8016240 trial 3 1.00
Q3305213 painting 3 0.67
Q1194970 dot-com company 3 1.00
Q2772772 military museum 3 1.00
Q178790 trade union 3 1.00
Q18574946 annual event 3 1.00
Q694554 emergency telephone number 3 0.67
Q7653906 social insurance 3 1.00
Q811430 construction 3 1.00
Q506240 television film 3 0.67
Q1774898 clinic 3 0.67
Q9078534 honor society 3 0.67
Q18534571 medical research centre 3 0.67