Research talk:Reading time/Work log/2018-10-28

Sunday, October 28, 2018

Latest comment: 6 years ago3 comments2 people in discussion

Checking off some minor tasks

Regenerated cleanReadingDepth tables with updated data and filters
Updated Schema_talk:ReadingDepth#Events_where_unloaded_event_is_logged_before_the_loaded_event with a plot showing periodic spikes.

Modeling stuff

Gave up on using spark for doing very complex queries. It's just too brittle. I'm rewriting the parts that don't work in Hive.

Troubleshooting Slow Hive Query

I made nathante.readingDataModel_Stage1 using spark. This was a pretty simple process without any joins.

I wrote the following complex query based on spark code that was running pretty fast, but was also failing when I ran it on the full dataset (I think due memory issues). In Hive however, it seems pretty slow! Reducers are running for > 1.5 hours.

@Tbayer (WMF)— Any ideas?

I know the LEFT OUTER JOIN isn't necessary (INNER would be fine), but I want it for now so I can inspect any nulls.

CREATE TABLE nathante.readingDataModel_Stage2 AS 
SELECT * FROM (
SELECT *, MIN(timedelta) OVER (PARTITION BY pagetoken) AS min_timedelta FROM (
SELECT *, dt.unix_timestamp_2 - unix_timestamp(hist.event_timestamp, "yyyy-MM-dd HH:mm:ss.S") AS timedelta
       FROM
       (SELECT *, translate(event.pagetitle," ","_") AS page_title FROM nathante.readingDataModel_Stage1 dt WHERE year > 0) AS dt
       LEFT OUTER JOIN
       (SELECT wiki_db, event_entity, event_timestamp, page_title AS hist_page_title, revision_text_bytes FROM wmf.mediawiki_history WHERE snapshot == "2018-09" AND event_entity == "revision" AND event_type=="create") AS  hist
       ON dt.page_title == hist.hist_page_title AND dt.wiki==hist.wiki_db) AS merged
        ) merged_2
       WHERE timedelta == min_timedelta

— The preceding unsigned comment was added by groceryheist (talk) 06:38, 29 October 2018 (UTC)Reply

Not sure how much it will improve performance, but one probably needs to filter by event entity and event type, cf. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history#Important_Fields . Regards, Tbayer (WMF) (talk) 12:43, 29 October 2018 (UTC)Reply

Modified the query to do the filtering. I'll try again in nice mode. Groceryheist (talk) 16:24, 29 October 2018 (UTC)Reply

User:Joalpe came up with this Spark query:

CREATE TABLE nathante.readingDataModel_Stage2 STORED AS PARQUET AS
WITH

rdm1 AS (
    SELECT
        *,
        translate(event.pagetitle,' ','_') AS page_title
    FROM nathante.readingDataModel_Stage1
    WHERE year > 0
),

hist AS (
    SELECT
        wiki_db AS hist_wiki,
        page_title_historical AS hist_page_title,
        unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts,
        -- Get the next event timestamp (or a fake big one if none)
        LEAD(event_timestamp, 1, '2100-01-01 00:00:00')
          OVER (PARTITION BY page_id ORDER BY event_timestamp) AS hist_next_ts_str,
        revision_text_bytes,
        revision_text_bytes_diff
    FROM wmf.mediawiki_history
    WHERE snapshot == "2018-09"
        -- Consider revision events
        AND event_entity == "revision"
        AND event_type=="create"
        -- Don't consider page having no id nor page_title_historical
        AND page_id > 0
        AND LENGTH(page_title_historical) > 0
)

SELECT
  *
FROM rdm1
LEFT OUTER JOIN (
    -- Only consider events whose next event is after the experiment min ts
    SELECT
        *,
        unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') AS hist_next_ts

    FROM hist
    -- Consider only rows whose next timestamp is after the beginning of the experiment
    WHERE unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') > 1520881200
) h
    ON rdm1.wiki = h.hist_wiki
      AND rdm1.page_title = h.hist_page_title
      AND rdm1.unix_timestamp_2 >= h.hist_current_ts
      AND rdm1.unix_timestamp_2 < h.hist_next_ts

Spark supports range joins which make it possible to create a 1:1 mapping between reading events and revisions.