User:MPopov (WMF)/Notes/Querying JSON-containing data
2.7.50309-r-2020-01-08 | 2.7.50322-r-2020-06-08 |
---|---|
{
"add-caption" : {
"cancels" : 0,
"clicks" : 0,
"failures" : 0,
"impressions" : 0,
"successes" : 0,
"suggestions_clicked" : 0
},
"add-description" : {
"cancels" : 0,
"clicks" : 1,
"failures" : 0,
"impressions" : 0,
"successes" : 1,
"suggestions_clicked" : 1
},
"translate-caption" : {
"cancels" : 0,
"clicks" : 0,
"failures" : 0,
"impressions" : 0,
"successes" : 0,
"suggestions_clicked" : 0
}
}
|
{
"a-c" : {
"cxl" : 0,
"clk" : 1,
"fl" : 0,
"imp" : 7,
"suc" : 1,
"sg" : 1
},
"a-d" : {
"cxl" : 2,
"clk" : 2,
"fl" : 0,
"imp" : 7,
"suc" : 0,
"sg" : 1
},
"t-c" : {
"cxl" : 0,
"clk" : 0,
"fl" : 0,
"imp" : 9,
"suc" : 1,
"sg" : 0
}
}
|
These are my notes for querying JSON-containing event data with Presto, which can be easily used in Superset's SQL Lab. HiveQL's library of UDFs has very limited support[1] for JSON data while Presto has a lot of functions.[2]
Data
editFor this example we will be working with the schema MobileWikiAppSuggestedEdits which has a field edit_tasks
that the Suggested Edits instrument in the Wikipedia Android app populates with a JSON string containing counts of suggestions seen ("impressions"), clicked ("unique clicks"), and edited ("successes") split by edit task – the type of suggestions (image captions to add/translate, article descriptions to add/translate). Two examples of this data can be found in the table on the right.
Note: because this field is specified as a free-form string in the legacy EventLogging schema, the instrument can put whatever in there – including changing the names of the fields between versions without changing the schema's revision. In a modern Event Platform schema, that's certainly possible but it would be better to define the edit_tasks
field as an object with sub-fields, so that the event data can be automatically made into a map in the table.[3]
Query in Hive
edit
First, the JSON string is processed via a user-defined table generating function (UDTF) JSON_TUPLE
and then it needs to be used in conjunction with Lateral View, which turns the fields into tables and makes those fields' sub-fields available as columns:
SELECT
SUBSTR(dt, 1, 10) AS `date`, event.source,
SUM(COALESCE(caption_addition.successes, 0L)) AS captions_added,
SUM(COALESCE(description_addition.successes, 0L)) AS descriptions_added
FROM mobilewikiappsuggestededits
LATERAL VIEW
JSON_TUPLE(event.edit_tasks, 'a-d', 't-d', 'a-c', 't-c', 'i-t') edit_tasks
AS add_desc, translate_desc, add_cap, translate_cap, image_tags
LATERAL VIEW
JSON_TUPLE(edit_tasks.add_desc, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') description_addition
AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
LATERAL VIEW
JSON_TUPLE(edit_tasks.add_cap, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') caption_addition
AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08'
GROUP BY SUBSTR(dt, 1, 10), event.source
ORDER BY `date`
LIMIT 1000;
Note: While there is a GET_JSON_OBJECT
to retrieve a single key's value, this solution is more efficient than calling GET_JSON_OBJECT
multiple times.
date | source | captions_added | descriptions_added |
---|---|---|---|
2020-06-15 | suggestedEdits | 9 | 87 |
2020-06-15 | feed | 1 | 22 |
2020-06-16 | feed | 5 | 37 |
2020-06-16 | suggestedEdits | 16 | 100 |
2020-06-17 | feed | 3 | 47 |
2020-06-17 | suggestedEdits | 22 | 119 |
2020-06-18 | feed | 1 | 26 |
2020-06-18 | suggestedEdits | 17 | 166 |
2020-06-19 | feed | 0 | 39 |
2020-06-19 | suggestedEdits | 14 | 148 |
2020-06-20 | feed | 3 | 30 |
2020-06-20 | suggestedEdits | 20 | 74 |
2020-06-21 | feed | 2 | 24 |
2020-06-21 | suggestedEdits | 26 | 86 |
2020-06-22 | feed | 7 | 28 |
2020-06-22 | suggestedEdits | 16 | 69 |
2020-06-23 | feed | 4 | 30 |
2020-06-23 | suggestedEdits | 19 | 88 |
Query in Presto
edit
The process is vaguely similar to Hive. First, we need to process the JSON string and only then can we operate on it. But unlike Hive's UDTF which generates a table and requires LATERAL VIEW
, we can cast the JSON data into a map of maps. Presto has extensive set of functions for working with maps.[4]
WITH parsed_json AS (
SELECT
SUBSTR(dt, 1, 10) AS "date", event.source,
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
FROM mobilewikiappsuggestededits
WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
AND event.edit_tasks != '{}'
AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
)
SELECT
"date", source,
SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) AS captions_added,
SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) AS descriptions_added
FROM parsed_json
GROUP BY "date", source
ORDER BY "date", source;
Note: had the instrument not omitted the fields and we could rely on the data in edit_tasks
to be complete (even if the values were all 0s) and use underscores as separators, we would be able to use the JSON_EXTRACT
function in Presto with a JSONPath like '$.a_c.suc'
to retrieve the exact values from the parsed JSON without converting it to a map.
date | source | captions_added | descriptions_added |
---|---|---|---|
2020-06-15 | feed | 1 | 22 |
2020-06-15 | suggestedEdits | 9 | 87 |
2020-06-16 | feed | 5 | 37 |
2020-06-16 | suggestedEdits | 16 | 100 |
2020-06-17 | feed | 3 | 47 |
2020-06-17 | suggestedEdits | 22 | 119 |
2020-06-18 | feed | 1 | 26 |
2020-06-18 | suggestedEdits | 17 | 166 |
2020-06-19 | feed | 0 | 39 |
2020-06-19 | suggestedEdits | 14 | 148 |
2020-06-20 | feed | 3 | 30 |
2020-06-20 | suggestedEdits | 20 | 74 |
2020-06-21 | feed | 2 | 24 |
2020-06-21 | suggestedEdits | 26 | 86 |
2020-06-22 | feed | 7 | 28 |
2020-06-22 | suggestedEdits | 16 | 69 |
2020-06-23 | feed | 4 | 30 |
2020-06-23 | suggestedEdits | 19 | 88 |
Reducing maps and arrays
editSuppose we're interested in counting the total number of interactions. The edit_tasks
map looks like
{"i-t": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 5}, "a-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "a-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 3}, "t-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "t-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 6}}
And we want to sum across all of those to yield a single number: 18. Here's how we can accomplish this in Presto:
WITH parsed_json AS (
SELECT
SUBSTR(dt, 1, 10) AS "date",
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
FROM mobilewikiappsuggestededits
WHERE year = 2020 AND month = 9 AND day = 24 AND hour = 12
AND event.edit_tasks != '{}'
AND useragent.wmf_app_version >= '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
LIMIT 100
)
SELECT
date, edit_tasks,
-- REDUCE(MAP_VALUES(TRANSFORM_VALUES(edit_tasks, (k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s))), 0, (s, x) -> s + x,s -> s) AS total_actions
-- ^ Deconstructed:
REDUCE(
MAP_VALUES( -- Second: Now that we have a map like { "i-t": 4, ..., "a-c": 0 }, let's extract those counts into an array [4, ..., 0]
TRANSFORM_VALUES( -- First: We're going to aggregate within each type of task (i-t, a-c, etc.)
edit_tasks,
(k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s)) -- this sums across suc, clk, sg, cxl, fl, imp
),
0, (s, x) -> s + x, s -> s -- Third: Reduce/sum that array (e.g. [4, ..., 0]) to give us the total number of actions performed
) AS total_actions
FROM parsed_json;
The above query uses map functions map_values()
and transform_values()
to operate on the map and extract values to arrays so that the array function reduce()
can be used.