User:MPopov (WMF)/Notes/Querying JSON-containing data
2.7.50309-r-2020-01-08 | 2.7.50322-r-2020-06-08 |
---|---|
{
"add-caption" : {
"cancels" : 0,
"clicks" : 0,
"failures" : 0,
"impressions" : 0,
"successes" : 0,
"suggestions_clicked" : 0
},
"add-description" : {
"cancels" : 0,
"clicks" : 1,
"failures" : 0,
"impressions" : 0,
"successes" : 1,
"suggestions_clicked" : 1
},
"translate-caption" : {
"cancels" : 0,
"clicks" : 0,
"failures" : 0,
"impressions" : 0,
"successes" : 0,
"suggestions_clicked" : 0
}
}
|
{
"a-c" : {
"cxl" : 0,
"clk" : 1,
"fl" : 0,
"imp" : 7,
"suc" : 1,
"sg" : 1
},
"a-d" : {
"cxl" : 2,
"clk" : 2,
"fl" : 0,
"imp" : 7,
"suc" : 0,
"sg" : 1
},
"t-c" : {
"cxl" : 0,
"clk" : 0,
"fl" : 0,
"imp" : 9,
"suc" : 1,
"sg" : 0
}
}
|
These are my notes for querying JSON-containing event data with Presto, which can be easily used in Superset's SQL Lab. HiveQL's library of UDFs has very limited support[1] for JSON data while Presto has a lot of functions.[2]
Data
[edit]For this example we will be working with the schema MobileWikiAppSuggestedEdits which has a field edit_tasks
that the Suggested Edits instrument in the Wikipedia Android app populates with a JSON string containing counts of suggestions seen ("impressions"), clicked ("unique clicks"), and edited ("successes") split by edit task – the type of suggestions (image captions to add/translate, article descriptions to add/translate). Two examples of this data can be found in the table on the right.
Note: because this field is specified as a free-form string in the legacy EventLogging schema, the instrument can put whatever in there – including changing the names of the fields between versions without changing the schema's revision. In a modern Event Platform schema, that's certainly possible but it would be better to define the edit_tasks
field as an object with sub-fields, so that the event data can be automatically made into a map in the table.[3]
Query in Hive
[edit]
First, the JSON string is processed via a user-defined table generating function (UDTF) JSON_TUPLE
and then it needs to be used in conjunction with Lateral View, which turns the fields into tables and makes those fields' sub-fields available as columns:
SELECT
SUBSTR(dt, 1, 10) AS `date`, event.source,
SUM(COALESCE(caption_addition.successes, 0L)) AS captions_added,
SUM(COALESCE(description_addition.successes, 0L)) AS descriptions_added
FROM mobilewikiappsuggestededits
LATERAL VIEW
JSON_TUPLE(event.edit_tasks, 'a-d', 't-d', 'a-c', 't-c', 'i-t') edit_tasks
AS add_desc, translate_desc, add_cap, translate_cap, image_tags
LATERAL VIEW
JSON_TUPLE(edit_tasks.add_desc, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') description_addition
AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
LATERAL VIEW
JSON_TUPLE(edit_tasks.add_cap, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') caption_addition
AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08'
GROUP BY SUBSTR(dt, 1, 10), event.source
ORDER BY `date`
LIMIT 1000;
Note: While there is a GET_JSON_OBJECT
to retrieve a single key's value, this solution is more efficient than calling GET_JSON_OBJECT
multiple times.
date | source | captions_added | descriptions_added |
---|---|---|---|
2020-06-15 | suggestedEdits | 9 | 87 |
2020-06-15 | feed | 1 | 22 |
2020-06-16 | feed | 5 | 37 |
2020-06-16 | suggestedEdits | 16 | 100 |
2020-06-17 | feed | 3 | 47 |
2020-06-17 | suggestedEdits | 22 | 119 |
2020-06-18 | feed | 1 | 26 |
2020-06-18 | suggestedEdits | 17 | 166 |
2020-06-19 | feed | 0 | 39 |
2020-06-19 | suggestedEdits | 14 | 148 |
2020-06-20 | feed | 3 | 30 |
2020-06-20 | suggestedEdits | 20 | 74 |
2020-06-21 | feed | 2 | 24 |
2020-06-21 | suggestedEdits | 26 | 86 |
2020-06-22 | feed | 7 | 28 |
2020-06-22 | suggestedEdits | 16 | 69 |
2020-06-23 | feed | 4 | 30 |
2020-06-23 | suggestedEdits | 19 | 88 |
Query in Presto
[edit]
The process is vaguely similar to Hive. First, we need to process the JSON string and only then can we operate on it. But unlike Hive's UDTF which generates a table and requires LATERAL VIEW
, we can cast the JSON data into a map of maps. Presto has extensive set of functions for working with maps.[4]
WITH parsed_json AS (
SELECT
SUBSTR(dt, 1, 10) AS "date", event.source,
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
FROM mobilewikiappsuggestededits
WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
AND event.edit_tasks != '{}'
AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
)
SELECT
"date", source,
SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) AS captions_added,
SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) AS descriptions_added
FROM parsed_json
GROUP BY "date", source
ORDER BY "date", source;
Note: had the instrument not omitted the fields and we could rely on the data in edit_tasks
to be complete (even if the values were all 0s) and use underscores as separators, we would be able to use the JSON_EXTRACT
function in Presto with a JSONPath like '$.a_c.suc'
to retrieve the exact values from the parsed JSON without converting it to a map.
date | source | captions_added | descriptions_added |
---|---|---|---|
2020-06-15 | feed | 1 | 22 |
2020-06-15 | suggestedEdits | 9 | 87 |
2020-06-16 | feed | 5 | 37 |
2020-06-16 | suggestedEdits | 16 | 100 |
2020-06-17 | feed | 3 | 47 |
2020-06-17 | suggestedEdits | 22 | 119 |
2020-06-18 | feed | 1 | 26 |
2020-06-18 | suggestedEdits | 17 | 166 |
2020-06-19 | feed | 0 | 39 |
2020-06-19 | suggestedEdits | 14 | 148 |
2020-06-20 | feed | 3 | 30 |
2020-06-20 | suggestedEdits | 20 | 74 |
2020-06-21 | feed | 2 | 24 |
2020-06-21 | suggestedEdits | 26 | 86 |
2020-06-22 | feed | 7 | 28 |
2020-06-22 | suggestedEdits | 16 | 69 |
2020-06-23 | feed | 4 | 30 |
2020-06-23 | suggestedEdits | 19 | 88 |
Reducing maps and arrays
[edit]Suppose we're interested in counting the total number of interactions. The edit_tasks
map looks like
{"i-t": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 5}, "a-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "a-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 3}, "t-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "t-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 6}}
And we want to sum across all of those to yield a single number: 18. Here's how we can accomplish this in Presto:
WITH parsed_json AS (
SELECT
SUBSTR(dt, 1, 10) AS "date",
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
FROM mobilewikiappsuggestededits
WHERE year = 2020 AND month = 9 AND day = 24 AND hour = 12
AND event.edit_tasks != '{}'
AND useragent.wmf_app_version >= '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
LIMIT 100
)
SELECT
date, edit_tasks,
-- REDUCE(MAP_VALUES(TRANSFORM_VALUES(edit_tasks, (k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s))), 0, (s, x) -> s + x,s -> s) AS total_actions
-- ^ Deconstructed:
REDUCE(
MAP_VALUES( -- Second: Now that we have a map like { "i-t": 4, ..., "a-c": 0 }, let's extract those counts into an array [4, ..., 0]
TRANSFORM_VALUES( -- First: We're going to aggregate within each type of task (i-t, a-c, etc.)
edit_tasks,
(k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s)) -- this sums across suc, clk, sg, cxl, fl, imp
),
0, (s, x) -> s + x, s -> s -- Third: Reduce/sum that array (e.g. [4, ..., 0]) to give us the total number of actions performed
) AS total_actions
FROM parsed_json;
The above query uses map functions map_values()
and transform_values()
to operate on the map and extract values to arrays so that the array function reduce()
can be used.