Research talk:Reading time/Work log/2018-10-02
Add topicAppearance
Monday, October 1, 2018 / Tuesday, October 2, 2018
[edit]Investigate the 40 second period error DONE
[edit]It turns out at the 40 second intervals were due to a bug computing the deltas. So it's not a problem.
Made cleaner plots of discrepancies
[edit]
This chart shows the distribution of discrepancies between event timestamps and measured dwell times on Wikipedia. The 40 second intervals are positive (except around 0) and appear to decay exponentially. The discrepancies compare the time between server log events and the total length of time recorded by the browser. The discrepancies between visible length and the timestamps are similar.
Look by IP block
[edit]When grouping by IPv4 block, there are not any obvious discrepancies. When comparing IPv4 to IPv6 it becomes clear that most of the errors are coming from IPv4.
TODO: Do these as a proportion of all events in the group.
- Look by Geolocation (Mountain View, Redmond, Country, region, city)

Variation in the proportion of client timers that measure more time than logevents suggest is possible.. The y axis is the average proportion of views where times on the server logs are shorter than the client side timers. The x axis shows country codes. There doesn't appear to be much variation. The countries at the high and low end have smaller sample sizes.
- Inform engineering of findings
- Maybe fallback to Webrequest table if we need more information
Improve Workflow
[edit]- Check jupyter notebooks.
- Integrate queries into notebook.
- Use https://yarn.wikimedia.org/cluster/scheduler to monitor hive queries.
- Try out R in jupyter notebooks.
Filtering data for analysis
[edit]- Exclude bots and spiders.