Jump to content

Research talk:Automated classification of article importance/Work log/2017-05-16

Add topic
From Meta, a Wikimedia project coordination wiki

Tuesday, May 16, 2017

[edit]

Today I'll write up a summary of the reading I did yesterday, learn more about TensorFlow, and start programming the article view rate pipeline.

Reading report

[edit]

Leila was kind enough to direct me to User:The Land/Thinking about the impact of the Wikimedia movement. The essay discusses impact and importance of contributions, and posits that these are topics that the community has so far dodged. In addition to discussing various ways of measuring impact (e.g. amount of content, reach, importance), it also discusses secondary aspects such as the community, partnerships and technology. There are six proposed measures for impact:

  1. Amount of content
  2. Reach
  3. Importance
  4. Diversity and content gaps
  5. Number of other resources accessible on a subject
  6. Quality of content

Some of these relate closely to the current project. While the essay proposes/discusses these, it does not go into more detail with regards to how to define them. For example, it mainly discusses the edges of importance (e.g. the article about the current US president is more important than the article about MN state highway 371). Either way, it is a useful essay for the current project in that it shows that some of the ideas that guide the current project are also found in the community.

The mentioned essay has a section for further reading, where it links to A new metric for Wikimedia, a Signpost op-ed by Denny Vrandečić. In this op-ed, Denny discusses a proposed alternative metric the community can use, because number of articles, unique visitors, pageviews, and active editors only goes so far. One of the key points in the op-ed is that what one has access to is limited, for example by what languages one can read, or by the fact that one doesn't have Internet access at home. We then have multiple ways of having an impact since the latter is measured by the area under the curve of shared knowledge. We can for example impact a large group of people by providing them access to Wikipedia, but they might not get access to a huge amount of knowledge. In other words: moving the bar a little bit upwards but widely. Other contributions might move the bar a lot but in a narrower band (e.g. provide a large amount of information to few people). In the end, is this something we can measure? A research project was started here on meta, but has since seen little activity. Related to this is also Dario's presentation from Wikimania 2014 about metrics.

From the discussions around these two writings, I also came across an interesting discussion about the Foundation's mission versus that of other organizations. Lastly, I read Meet the world’s most powerful doctor: Bill Gates (linked from the talk page of The Land's essay). While much of the article on Gates is talking about his influence (or lack thereof) on the WHO, it also discusses (although briefly) the question of whether the priority should be measurable projects (e.g. the fight against polio) or infrastructure building (e.g. medical facilities). The Land's essay brings up both of these in some way, because we can measure impact for example through whether significant quality changes have been made to articles with a large audience, but we can also focus on infrastructure by building technology and supporting the communities.

I also read two research papers from WWW 2017:

  • Singer, P., Lemmerich, F., West, R., Zia, L., Wulczyn, E., Strohmaier, M., and Leskovec, J. "Why We Read Wikipedia"
  • Dimitrov, D., Singer, P., Lemmerich, F., and Strohmaier, M. "What Makes a Link Successful on Wikipedia?"

The Singer et al. paper is in some ways similar to Lehmann et al, although the more recent paper is a mixed-methods paper with access to arguably better data. Using surveys they identify and validate a taxonomy of Wikipedia readership with three main dimensions: motivation, information need, and prior knowledge. They then use log data to describe how different log patterns corresponds with different categories along the three dimensions in the taxonomy.

The Dimitrov et al. paper is perhaps more closely related to the current project. Using the 2015 clickstream dataset, a corresponding dump, and renderings of pages, they identify which links are more likely to be followed. They apply mixed-effects hurdle models to find what predicts link following, and their results suggest that links from the top and left side of the screen, those that move from the core to the periphery of the network, and those that lead to semantically similar articles are more successful. They also modify PageRank using this type of information and compare correlation between it and article view data, finding that incorporating their findings leads to statistically significantly higher correlation. Their work spawns many questions, but nonetheless is a solid reference we can use in the current work with regards to how we use clickstream data to improve our models.