User:MPopov (WMF)/Notes/Refinery
This page provides a brief introduction to working with Analytics Engineering's Refinery source code for the purpose of a Product Analytics skillshare on UDF development.
Setup
edit- Required
- Analytics Refinery source repository:
git clone ssh://gerrit.wikimedia.org:29418/analytics/refinery/source
- Java Development Kit (JDK) 8
- macOS & Linux binaries are available from Oracle
- Linux users have the option of OpenJDK:
sudo apt-get install openjdk-8-jre
- Once installed I recommend setting the
JAVA_HOME
environment variable in your ~/.bash_profile or ~/.bashrc:export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home export PATH="$JAVA_HOME/bin:$PATH"
- Apache Maven (Mac with Homebrew:
brew install maven
; Linux:sudo apt-get install maven
)
- Analytics Refinery source repository:
- Recommended IDE for Java: IntelliJ IDEA (Community Edition)
- There are currently issues with some of the stuff in the repo and IDEA not recognizing sources, so installing the Apache Avro plugin wouldn't hurt (but also doesn't seem to help)
- Python users might recognize the company as the makers of PyCharm
Basics
editFirst, run mvn package
while in the directory where you cloned the repo to. This should download all the necessary dependencies into ~/.m2/
and build the refinery source code into binary JARs.
The generated refinery-hive/target/refinery-hive-X.Y.ZZ
-SNAPSHOT.jar is what you would import in your Hive query via statements like:
ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
Refer to Analytics/Systems/Cluster/Hive/QueryUsingUDF for more details.
Development
editImporting project into IntelliJ IDEA
editImport Project and select the cloned repo directory. Pick Maven under "Import project from external model" and proceed with all the default choices until the project has been imported.
If at any point you're at an SDK selection screen, you need to pick JDK8 that you installed earlier.
- If you don't see JDK8 that you installed earlier and need to + it to the list:
- On a Mac, add:
/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
- On a Linux PC, the directory is something like
/usr/lib/jvm/java-8-openjdk/
but I suggest runningwhich javac
to confirm
- On a Mac, add:
- Refer to Working with SDKs and Configuring IntelliJ Platform Plugin SDK for help
IDEA will then index the files and give you an error about org.wikimedia.analytics.schema
symbol not resolving. Ignore it – Nuria and I have no idea how to fix this as our best bet of installing the Apache Avro plugin didn't work. You can at least write code and get all kinds of helpful hints & code completion suggestions in IDEA, and then just test/build in CLI with mvn package
.
Unit Tests
editThey're good and you should write them.