Job Environment Setup Cookbook¶
Many jobs have significant external dependencies, both libraries and other source code.
Combining shell syntax with Hadoop’s DistributedCache notation, mrjob’s setup option provides a powerful, dynamic alternative to pre-installing your Hadoop dependencies on every node.
All our mrjob.conf
examples below are for the hadoop
runner,
but these work equally well with the emr
runner. Also, if you are using
EMR, take a look at the EMR Bootstrapping Cookbook.
Uploading your source tree¶
Note
If you’re using mrjob 0.6.4 or later, check out Using other python modules and packages first.
mrjob can automatically tarball your source directory and include it
in your job’s working directory. We can use setup scripts to upload the
directory and then add it to PYTHONPATH
.
Run your job with:
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'
The /
before the #
tells mrjob that your-src-code
is a
directory. You may optionally include a /
after the #
as well
(e.g. export PYTHONPATH=$PYTHONPATH:your-source-code/#/your-lib
).
If every job you run is going to want to use your-src-code
, you can do
this in your mrjob.conf
:
runners:
hadoop:
setup:
- export PYTHONPATH=$PYTHONPATH:your-src-code/#
Uploading your source tree as an archive¶
Prior to mrjob 0.5.8, you had to archive directories yourself before uploading them.
tar -C your-src-code -f your-src-code.tar.gz -z -c .
Then, run your job with:
--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/'
The /
after the #
(without one before it) is what tells mrjob that
your-src-code.tar.gz
is an archive that Hadoop should unpack.
To do the same thing in mrjob.conf
:
runners:
hadoop:
setup:
- export PYTHONPATH=$PYTHONPATH:your-src-code.tar.gz#/
Running a makefile inside your source dir¶
--setup 'cd your-src-dir.tar.gz#/' --setup 'make'
or, in mrjob.conf:
runners:
hadoop:
setup:
- cd your-src-dir.tar.gz#/
- make
If Hadoop runs multiple tasks on the same node, your source dir will be shared between them. This is not a problem; mrjob automatically adds locking around setup commands to ensure that multiple copies of your setup script don’t run simultaneously.
Making data files available to your job¶
Best practice for one or a few files is to use passthrough options; see
add_passthru_arg()
.
You can also use upload_files to upload file(s) into a task’s working directory (or upload_archives for tarballs and other archives).
If you’re a setup purist, you can also do something like this:
--setup 'true your-file#desired-name'
since true has no effect and ignores its arguments.
Using a virtualenv¶
What if you can’t install the libraries you need on your Hadoop cluster?
You could do something like this in your mrjob.conf
:
runners:
hadoop:
setup:
- virtualenv venv
- . venv/bin/activate
- pip install mr3po
However, now the locking feature that protects make becomes a liability; each task on the same node has its own virtualenv, but one task has to finish setting up before the next can start.
The solution is to share the virtualenv between all tasks on the same machine, something like this:
runners:
hadoop:
setup:
- VENV=/tmp/$mapreduce_job_id
- if [ ! -e $VENV ]; then virtualenv $VENV; fi
- . $VENV/bin/activate
- pip install mr3po
With Hadoop 1, you’d want to use $mapred_job_id
instead of
$mapreduce_job_id
.
Other ways to use pip to install Python packages¶
If you have a lot of dependencies, best practice is to make a
pip requirements file
and use the -r
switch:
--setup 'pip install -r path/to/requirements.txt#'
Note that pip can also install from tarballs (which is useful for custom-built packages):
--setup 'pip install $MY_PYTHON_PKGS/*.tar.gz#'