Research talk:Reading time/Work log/2018-11-01
Add topicThursday, November 1, 2018
[edit]Joining readingdepth data with mediawiki history
[edit]I had some trouble merging reading depth with the history of revisions. The initial problem is that the reading depth schema does not have revision_id. It only has page_title. This meant that to find the version of the page that a reader viewed we have to join on page_title and then filter to find the nearest previous revision. My initial attempt was a hive query that used a window function and a join on page. This approach proved intractable. I got some help from jaol who started an approach using a window function with a range join in SparkSQL. His query wasn't perfect and I re-wrote it pyspark and tweaked it until it was correct.
My version of the query
[edit]dt = r.table("nathante.readingDataModel_Stage1")
.select(['*', f.expr("translate(event.pagetitle,' ','_') AS page_title")])
.filter(f.col("year") > 0)
dt = df.filter(dt.event['namespaceid'] == 0)
hist = r.table("wmf.mediawiki_history")
.select([f.expr('wiki_db AS hist_wiki'),
f.expr('page_title_historical AS hist_page_title'),
f.expr("unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts"),
'revision_text_bytes',
'revision_text_bytes_diff',
'page_id'])
.filter((f.col("snapshot") == "2018-09") &
(f.col("event_entity")=="revision") &
(f.col("event_type")=="create") &
(f.col("page_id") > 0) &
(f.length(f.col("page_title_historical")) > 0))
title_win = Window.partitionBy(["hist_page_title",'hist_wiki']).orderBy("hist_current_ts")
# 1520881200 is the beginning of the reading depth data
# 4102444800 is the year 2100
hist = hist.withColumn("hist_next_ts",f.lead(f.col("hist_current_ts"),1,4102444800).over(title_win))
.filter(f.col("hist_next_ts") > 1520881200)
join_cond = [dt.wiki == hist.hist_wiki,
dt.page_title == hist.hist_page_title,
dt.unix_timestamp_2 >= hist.hist_current_ts,
dt.unix_timestamp_2 < hist.hist_next_ts]
dt2 = dt.join(hist,on=join_cond,how='left_outer')
dt2.write.partitionBy(['year','month']).mode("overwrite").saveAsTable("nathante.tableReadingDataModel_Stage2")
Views where we can't find a page
[edit]There are still about 0.5% of reading events where we can't find a corresponding page in wmf.history
. From spot checks these appear to be red links. However, I'm not totally sure that some are not red links. If we have page_id
or if we add a bit indicating if the view is on a red link to the reading depth schema then we could verify this more easily. I created a [https://phabricator.wikimedia.org/T208478 bug for this.