Research talk:Reading time/Work log/2018-10-28
Sunday, October 28, 2018
editChecking off some minor tasks
edit- Regenerated cleanReadingDepth tables with updated data and filters
- Updated Schema_talk:ReadingDepth#Events_where_unloaded_event_is_logged_before_the_loaded_event with a plot showing periodic spikes.
Modeling stuff
edit- Gave up on using spark for doing very complex queries. It's just too brittle. I'm rewriting the parts that don't work in Hive.
Troubleshooting Slow Hive Query
editI made nathante.readingDataModel_Stage1
using spark. This was a pretty simple process without any joins.
I wrote the following complex query based on spark code that was running pretty fast, but was also failing when I ran it on the full dataset (I think due memory issues). In Hive however, it seems pretty slow! Reducers are running for > 1.5 hours.
@Tbayer (WMF)— Any ideas?
I know the LEFT OUTER JOIN
isn't necessary (INNER
would be fine), but I want it for now so I can inspect any null
s.
CREATE TABLE nathante.readingDataModel_Stage2 AS
SELECT * FROM (
SELECT *, MIN(timedelta) OVER (PARTITION BY pagetoken) AS min_timedelta FROM (
SELECT *, dt.unix_timestamp_2 - unix_timestamp(hist.event_timestamp, "yyyy-MM-dd HH:mm:ss.S") AS timedelta
FROM
(SELECT *, translate(event.pagetitle," ","_") AS page_title FROM nathante.readingDataModel_Stage1 dt WHERE year > 0) AS dt
LEFT OUTER JOIN
(SELECT wiki_db, event_entity, event_timestamp, page_title AS hist_page_title, revision_text_bytes FROM wmf.mediawiki_history WHERE snapshot == "2018-09" AND event_entity == "revision" AND event_type=="create") AS hist
ON dt.page_title == hist.hist_page_title AND dt.wiki==hist.wiki_db) AS merged
) merged_2
WHERE timedelta == min_timedelta
— The preceding unsigned comment was added by groceryheist (talk) 06:38, 29 October 2018 (UTC)
- Not sure how much it will improve performance, but one probably needs to filter by event entity and event type, cf. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history#Important_Fields . Regards, Tbayer (WMF) (talk) 12:43, 29 October 2018 (UTC)
- Modified the query to do the filtering. I'll try again in nice mode. Groceryheist (talk) 16:24, 29 October 2018 (UTC)
User:Joalpe came up with this Spark query:
CREATE TABLE nathante.readingDataModel_Stage2 STORED AS PARQUET AS
WITH
rdm1 AS (
SELECT
*,
translate(event.pagetitle,' ','_') AS page_title
FROM nathante.readingDataModel_Stage1
WHERE year > 0
),
hist AS (
SELECT
wiki_db AS hist_wiki,
page_title_historical AS hist_page_title,
unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts,
-- Get the next event timestamp (or a fake big one if none)
LEAD(event_timestamp, 1, '2100-01-01 00:00:00')
OVER (PARTITION BY page_id ORDER BY event_timestamp) AS hist_next_ts_str,
revision_text_bytes,
revision_text_bytes_diff
FROM wmf.mediawiki_history
WHERE snapshot == "2018-09"
-- Consider revision events
AND event_entity == "revision"
AND event_type=="create"
-- Don't consider page having no id nor page_title_historical
AND page_id > 0
AND LENGTH(page_title_historical) > 0
)
SELECT
*
FROM rdm1
LEFT OUTER JOIN (
-- Only consider events whose next event is after the experiment min ts
SELECT
*,
unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') AS hist_next_ts
FROM hist
-- Consider only rows whose next timestamp is after the beginning of the experiment
WHERE unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') > 1520881200
) h
ON rdm1.wiki = h.hist_wiki
AND rdm1.page_title = h.hist_page_title
AND rdm1.unix_timestamp_2 >= h.hist_current_ts
AND rdm1.unix_timestamp_2 < h.hist_next_ts
Spark supports range joins which make it possible to create a 1:1 mapping between reading events and revisions.