User:MPopov (WMF)/Notes/Querying JSON-containing data

2.7.50309-r-2020-01-08 2.7.50322-r-2020-06-08

{
	"add-caption" : {
		"cancels" : 0,
		"clicks" : 0,
		"failures" : 0,
		"impressions" : 0,
		"successes" : 0,
		"suggestions_clicked" : 0
	},
	"add-description" : {
		"cancels" : 0,
		"clicks" : 1,
		"failures" : 0,
		"impressions" : 0,
		"successes" : 1,
		"suggestions_clicked" : 1
	},
	"translate-caption" : {
		"cancels" : 0,
		"clicks" : 0,
		"failures" : 0,
		"impressions" : 0,
		"successes" : 0,
		"suggestions_clicked" : 0
	}
}

{
	"a-c" : {
		"cxl" : 0,
		"clk" : 1,
		"fl" : 0,
		"imp" : 7,
		"suc" : 1,
		"sg" : 1
	},
	"a-d" : {
		"cxl" : 2,
		"clk" : 2,
		"fl" : 0,
		"imp" : 7,
		"suc" : 0,
		"sg" : 1
	},
	"t-c" : {
		"cxl" : 0,
		"clk" : 0,
		"fl" : 0,
		"imp" : 9,
		"suc" : 1,
		"sg" : 0
	}
}

These are my notes for querying JSON-containing event data with Presto, which can be easily used in Superset's SQL Lab. HiveQL's library of UDFs has very limited support^[1] for JSON data while Presto has a lot of functions.^[2]

Data

For this example we will be working with the schema MobileWikiAppSuggestedEdits which has a field edit_tasks that the Suggested Edits instrument in the Wikipedia Android app populates with a JSON string containing counts of suggestions seen ("impressions"), clicked ("unique clicks"), and edited ("successes") split by edit task – the type of suggestions (image captions to add/translate, article descriptions to add/translate). Two examples of this data can be found in the table on the right.

Note: because this field is specified as a free-form string in the legacy EventLogging schema, the instrument can put whatever in there – including changing the names of the fields between versions without changing the schema's revision. In a modern Event Platform schema, that's certainly possible but it would be better to define the edit_tasks field as an object with sub-fields, so that the event data can be automatically made into a map in the table.^[3]

Query in Hive

First, the JSON string is processed via a user-defined table generating function (UDTF) JSON_TUPLE and then it needs to be used in conjunction with Lateral View, which turns the fields into tables and makes those fields' sub-fields available as columns:

SELECT
  SUBSTR(dt, 1, 10) AS `date`, event.source,
  SUM(COALESCE(caption_addition.successes, 0L)) AS captions_added,
  SUM(COALESCE(description_addition.successes, 0L)) AS descriptions_added
FROM mobilewikiappsuggestededits
LATERAL VIEW
  JSON_TUPLE(event.edit_tasks, 'a-d', 't-d', 'a-c', 't-c', 'i-t') edit_tasks
  AS add_desc, translate_desc, add_cap, translate_cap, image_tags
LATERAL VIEW
  JSON_TUPLE(edit_tasks.add_desc, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') description_addition
  AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
LATERAL VIEW
  JSON_TUPLE(edit_tasks.add_cap, 'imp', 'clk', 'sg', 'cxl', 'suc', 'fl') caption_addition
  AS impressions, total_clicks, suggestions_clicked, cancels, successes, failures
WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
  AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08'
GROUP BY SUBSTR(dt, 1, 10), event.source
ORDER BY `date`
LIMIT 1000;

Note: While there is a GET_JSON_OBJECT to retrieve a single key's value, this solution is more efficient than calling GET_JSON_OBJECT multiple times.

The results in 112.011 seconds
date	source	captions_added	descriptions_added
2020-06-15	suggestedEdits	9	87
2020-06-15	feed	1	22
2020-06-16	feed	5	37
2020-06-16	suggestedEdits	16	100
2020-06-17	feed	3	47
2020-06-17	suggestedEdits	22	119
2020-06-18	feed	1	26
2020-06-18	suggestedEdits	17	166
2020-06-19	feed	0	39
2020-06-19	suggestedEdits	14	148
2020-06-20	feed	3	30
2020-06-20	suggestedEdits	20	74
2020-06-21	feed	2	24
2020-06-21	suggestedEdits	26	86
2020-06-22	feed	7	28
2020-06-22	suggestedEdits	16	69
2020-06-23	feed	4	30
2020-06-23	suggestedEdits	19	88

Query in Presto

The process is vaguely similar to Hive. First, we need to process the JSON string and only then can we operate on it. But unlike Hive's UDTF which generates a table and requires LATERAL VIEW, we can cast the JSON data into a map of maps. Presto has extensive set of functions for working with maps.^[4]

WITH parsed_json AS (
    SELECT
        SUBSTR(dt, 1, 10) AS "date", event.source,
        CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
    FROM mobilewikiappsuggestededits 
    WHERE year = 2020 AND month = 6 AND day > 14 AND day < 24
      AND event.edit_tasks != '{}'
      AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
)
SELECT
    "date", source,
    SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) AS captions_added,
    SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) AS descriptions_added
FROM parsed_json
GROUP BY "date", source
ORDER BY "date", source;

Note: had the instrument not omitted the fields and we could rely on the data in edit_tasks to be complete (even if the values were all 0s) and use underscores as separators, we would be able to use the JSON_EXTRACT function in Presto with a JSONPath like '$.a_c.suc' to retrieve the exact values from the parsed JSON without converting it to a map.

The results in 3.47 seconds
date	source	captions_added	descriptions_added
2020-06-15	feed	1	22
2020-06-15	suggestedEdits	9	87
2020-06-16	feed	5	37
2020-06-16	suggestedEdits	16	100
2020-06-17	feed	3	47
2020-06-17	suggestedEdits	22	119
2020-06-18	feed	1	26
2020-06-18	suggestedEdits	17	166
2020-06-19	feed	0	39
2020-06-19	suggestedEdits	14	148
2020-06-20	feed	3	30
2020-06-20	suggestedEdits	20	74
2020-06-21	feed	2	24
2020-06-21	suggestedEdits	26	86
2020-06-22	feed	7	28
2020-06-22	suggestedEdits	16	69
2020-06-23	feed	4	30
2020-06-23	suggestedEdits	19	88

Reducing maps and arrays

Suppose we're interested in counting the total number of interactions. The edit_tasks map looks like

{"i-t": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 5}, "a-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "a-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 3}, "t-d": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 2}, "t-c": {"suc": 0, "clk": 0, "sg": 0, "cxl": 0, "fl": 0, "imp": 6}}

And we want to sum across all of those to yield a single number: 18. Here's how we can accomplish this in Presto:

WITH parsed_json AS (
    SELECT
        SUBSTR(dt, 1, 10) AS "date",
        CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks
    FROM mobilewikiappsuggestededits 
    WHERE year = 2020 AND month = 9 AND day = 24 AND hour = 12
      AND event.edit_tasks != '{}'
      AND useragent.wmf_app_version >= '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
    LIMIT 100
)
SELECT
  date, edit_tasks,
  -- REDUCE(MAP_VALUES(TRANSFORM_VALUES(edit_tasks, (k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s))), 0, (s, x) -> s + x,s -> s) AS total_actions
  -- ^ Deconstructed:
  REDUCE(
    MAP_VALUES( -- Second: Now that we have a map like { "i-t": 4, ..., "a-c": 0 }, let's extract those counts into an array [4, ..., 0]
      TRANSFORM_VALUES( -- First: We're going to aggregate within each type of task (i-t, a-c, etc.)
        edit_tasks,
        (k, v) -> REDUCE(MAP_VALUES(v), 0, (s, x) -> s + x, s -> s)) -- this sums across suc, clk, sg, cxl, fl, imp
      ),
    0, (s, x) -> s + x, s -> s -- Third: Reduce/sum that array (e.g. [4, ..., 0]) to give us the total number of actions performed
  ) AS total_actions
FROM parsed_json;

The above query uses map functions map_values() and transform_values() to operate on the map and extract values to arrays so that the array function reduce() can be used.

References

[1] ttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple

[2] ttps://prestodb.io/docs/current/functions/json.html

[3] ttps://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#map_types

[4] ttps://prestodb.io/docs/current/functions/map.html

[1]

[2]

[3]

[4]