User:MPopov (WMF)/Notes/ETL User Guide
This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation. |
This page instructs how to schedule ETL jobs using the Product Analytics jobs repo and Puppet systemd timer. There are other ways to schedule ETL jobs, but this approach exists because the others are currently impractical.
Prerequisites
editYou need to have production data access and be in the analytics-product-users
user group. You can verify membership by reviewing this file. This allows you to run commands on the stat boxes as analytics-product
system user.
You also need a Gerrit account and SSH configured (see page on advanced usage) and git-review installed.
You need to have clones of the following repos:
Step 1: Create Job
editThese instructions assume you are using wmfdata and a Jupyter notebook.
Structure
editYou want the following structure:
jobs/ |- movement_metrics/ |- ... |- your_job_name/ |- main.sh |- hive/ |- notebooks/
It is strongly recommended that you also create a CHANGES.md and a README.md and document everything thoroughly.
The notebooks/ directory is for, well, notebook files having "ipynb" file extension – that's important. If you want your notebooks to be executed in a specific order, name them appropriately.
Create table queries
editThe hive/ directory is for CREATE TABLE
queries. Do not specify the database in the query as this will require the user to specify it when they execute the query via Hive CLI. As a best practice, you should name the file as create_table_name.hql
For example, to create a table called my_table
:
hive -f hive/create_my_table.hql --database your_database
Example query
editAs you review the query below, pay attention to the style (e.g. spacing, capitalization, every non-obvious column documented).
-- Usage
-- hive -f create_my_table.hql --database wmf_product
CREATE TABLE IF NOT EXISTS `my_table` (
`action` STRING COMMENT 'Type of interaction',
`sessions` BIGINT COMMENT 'Total number of sessions with a specific interaction',
`events` BIGINT COMMENT 'Total number of interactions'
)
PARTITIONED BY (
`year` INT,
`month` INT
)
STORED AS PARQUET
main.sh
editThe contents of main.sh will be:
#!/bin/bash
notebooks_dir="/srv/product_analytics/jobs/your_job_name/notebooks"
# Clean up nbconvert notebooks from *previous* run:
rm $notebooks_dir/*.nbconvert.ipynb
# Activate Anaconda-WMF base environment:
source /usr/lib/anaconda-wmf/bin/activate
for notebook in $notebooks_dir/*.ipynb
do
python -m jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute $notebook || exit 1
done
The only thing you should change is your_job_name
in the second line.
Run the following command to make that shell script executable:
chmod +x main.sh
Permissions
edit Most jobs will load data into our team's centralized database wmf_product
.
In the last cell of the notebook you should update the permissions on your table in HDFS so that it can be read in Superset by other users.
import os
os.system("hdfs dfs -chmod -R o+r /user/hive/warehouse/wmf_product.db/my_table")
We can now upload the patch to Gerrit for review.
Step 2: Test Notebook
editBefore merging you'll need to verify that the ETL actually works. To do that, run the `CREATE TABLE` query using your own database. For example:
hive -f hive/create_my_table.hql --database bearloga
Change all the references to wmf_product
in the queries in the notebook(s) to bearloga
(but YOUR database, not mine). Don't forget to also adjust that last permissions-fixing cell, too!
Then run ./main.sh to see if all the notebooks can be executed successfully. Afterwards, query all the affected tables to check that the data in them is what you expect.
Step 3: Prepare Database
editNow that you've verified the ETL works, it's time to ready the centralized database. Following the Kerberos user guide, on stat1007.eqiad.wmnet:
sudo -u analytics-product kerberos-run-command analytics-product hive -f hive/create_my_table.hql --database wmf_product
Step 4: Scheduling
editTo schedule the job, you will need to add it to the product_analytics manifest.
kerberos::systemd_timer { 'product-analytics-your-job-name':
ensure => 'present',
description => 'Product Analytics monthly ???? run',
command => "${jobs_dir}/your_job_name/main.sh",
interval => '*-*-1 00:00:00',
user => $user,
logfile_basedir => $log_dir,
logfile_name => 'your-job-name.log',
logfile_owner => $user,
logfile_group => $group,
monitoring_enabled => true,
monitoring_contact_groups => 'team-product-analytics',
syslog_force_stop => true,
syslog_identifier => 'product-analytics-your-job-name',
slice => 'user.slice',
require => [
Class['::statistics::compute'],
Git::Clone['analytics/wmf-product/jobs']
],
}
The lines to change:
- Line 1:
product-analytics-your-job-name
- Note the use of hyphens instead of underscores
- Line 3:
Product Analytics monthly ???? run
- Change 'monthly' to 'daily' if appropriate
- Line 4:
${jobs_dir}/your_job_name/main.sh
- Line 5:
*-*-1 00:00:00
- The format is
DayOfWeek Year-Month-Day Hour:Minute:Second
. - See https://wiki.archlinux.org/title/Systemd/Timers for examples
- Don't forget to have line 3 reflect the frequency of the timer (e.g. monthly, daily)
- The format is
- Line 8:
your-job-name.log
- Line 14:
product-analytics-your-job-name
- This should be the same as line 1
After jenkins-bot verifies the patch, ping an SRE from Data Engineering for code review. For now, that is only Ben Tullis.