Research talk:Automated classification of article importance/Work log/2017-03-30
Add topicThursday, March 30, 2017
[edit]Today I'll continue working on the categorization of WPMED articles. First I'll continue my analysis of the most common instances, then I will start looking into whether we can somehow categorize those that are not instances of anything but have a Wikidata page.
WPMED categorization
[edit]Yesterday I gathered data on the "instance of" property of WPMED articles, and found that 54.4% of the WPMED dataset did not have this property. Digging around a bit I found Wikidata's help on basic membership properties, which explains the three key ones: instance of, subclass of, and part of. I therefore rewrote my Python script so that it could gather data on all three of these.
I find that 12,368 articles (42.1%) have neither of these properties set. While that is still a large number, it is about 3,000 fewer than if we just look at "instance of".
Next question is, are there properties of the remaining articles on Wikidata that can help us categorize them? I decided to sample 250 of them to see if I could find some patterns. I stopped after checking about 30, because the vast majority didn't contain any information apart from links to the Wikipedia articles.
I therefore wrote a Python script that would go through all the 12,368 articles and store only those that have at least one claim or property in Wikidata. There are 4,854 articles (39.2%) that do, leaving us with 7,514 articles for which we cannot learn anything from Wikidata. It might be that we can use Wikipedia's category structure for those, something I'll look into later.
Across the 4,854 articles there are 207 distinct claims/properties used for those. I wrote a one-liner to get their Wikidata IDs, then used my previously written labelling script to get their labels, before finally writing a short Python script to count and sort the claim/property usage and write it out with their labels. There are 25 labels that are used more than 100 times, they are:
ID | Label | Number of uses |
---|---|---|
P646 | Freebase ID | 2,489 |
P373 | Commons category | 1,067 |
P3827 | JSTOR topic ID | 928 |
P494 | ICD-10 | 722 |
P3417 | Quora topic ID | 670 |
P1995 | medical specialty | 645 |
P493 | ICD-9 | 631 |
P910 | topic's main category | 488 |
P557 | DiseasesDB | 428 |
P17 | country | 366 |
P673 | eMedicine | 259 |
P492 | OMIM ID | 227 |
P856 | official website | 225 |
P1343 | described by source | 224 |
P508 | BNCF Thesaurus | 208 |
P486 | MeSH ID | 179 |
P1705 | native label | 160 |
P604 | MedlinePlus ID | 149 |
P571 | inception | 147 |
P18 | image | 140 |
P1402 | Foundational Model of Anatomy ID | 139 |
P227 | GND ID | 128 |
P349 | NDLAuth ID | 122 |
P1323 | Terminologia Anatomica 98 | 116 |
P625 | coordinate location | 115 |
While some of these are not helpful (e.g. the Freebase ID property), some point to various indexes of medical information (e.g. P494, ICD-10, or P1995, medical specialty), suggesting that we can use some of these to identify things that are in scope of WPMED.