Research talk:Automated classification of article importance/Work log/2017-04-03
Add topicMonday, April 3, 2017
[edit]Today I'll continue my work on categorization and WPMED articles, looking into whether a clustering algorithm based on Wikidata relationships is the way to go. Secondly, I'll start investigating other WikiProjects to see if we can determine whether this type of categorical labelling exists elsewhere as well.
Wikidata clustering
[edit]We're interested in identifying key overarching properties of articles in WPMED. So far we've mostly approached this by looking at specific outgoing connections from these articles, using the standard types of relationships between Wikidata items ("instance of", "subclass of" and "part of"). However, we also know from studying our WPMED articles that some of them do not have either of those relationships, but might have others. The examples of using Wikidata for visualizing or identifying network all a single type of relationship, and typically default to "subclass of". While that allows for fast network traversal using for example the GAS library supported in the Wikidata Query Service, it will not solve our problem.
The approach we'll be using is instead to do a discovery traversal of the network starting from the WPMED article items. The traversal will terminate when we are no longer discovering any new items. In order to properly terminate, we will probably have to identify key inverse relationships (e.g. "has part" is the inverse of "part of"), or restrict ourselves to specific relationships once we move beyond the WPMED article items. For example, we will use any relationship from a WPMED article item, but restrict ourselves to "instance of", "subclass of", and "part of" for anything beyond those.
The plan is then to follow these relationships, visualize the network, and see if we can use network algorithms to identify key parent properties (e.g. specific superclasses) for WPMED articles, particularly the Low-importance ones.