Config file format and location

We look for mrjob.conf in these locations:

  • The location specified by MRJOB_CONF
  • ~/.mrjob.conf
  • /etc/mrjob.conf

You can specify one or more configuration files with the --conf-path flag. See Options available to all runners for more information.

The point of mrjob.conf is to let you set up things you want every job to have access to so that you don’t have to think about it. For example:

  • libraries and source code you want to be available for your jobs
  • where temp directories and logs should go
  • security credentials

mrjob.conf is just a YAML- or JSON-encoded dictionary containing default values to pass in to the constructors of the various runner classes. Here’s a minimal mrjob.conf:

runners:
  emr:
    cmdenv:
      TZ: America/Los_Angeles

Now whenever you run mr_your_script.py -r emr, EMRJobRunner will automatically set TZ to America/Los_Angeles in your job’s environment when it runs on EMR.

If you don’t have the yaml module installed, you can use JSON in your mrjob.conf instead (JSON is a subset of YAML, so it’ll still work once you install yaml). Here’s how you’d render the above example in JSON:

{
  "runners": {
    "emr": {
      "cmdenv": {
        "TZ": "America/Los_Angeles"
      }
    }
  }
}

Precedence and combining options

Options specified on the command-line take precedence over mrjob.conf. Usually this means simply overriding the option in mrjob.conf. However, we know that cmdenv contains environment variables, so we do the right thing. For example, if your mrjob.conf contained:

runners:
  emr:
    cmdenv:
      PATH: /usr/local/bin
      TZ: America/Los_Angeles

and you ran your job as:

mr_your_script.py -r emr --cmdenv TZ=Europe/Paris --cmdenv PATH=/usr/sbin

We’d automatically handle the PATH variables and your job’s environment would be:

{'TZ': 'Europe/Paris', 'PATH': '/usr/sbin:/usr/local/bin'}

What’s going on here is that cmdenv is associated with combine_envs(). Each option is associated with an appropriate combiner function that that combines options in an appropriate way.

Combiner functions can also do useful things like expanding environment variables and globs in paths. For example, you could set:

runners:
  local:
    upload_files: &upload_files
    - $DATA_DIR/*.db
  hadoop:
    upload_files: *upload_files
  emr:
    upload_files: *upload_files

and every time you ran a job, every job in your .db file in $DATA_DIR would automatically be loaded into your job’s current working directory.

Also, if you specified additional files to upload with --file, those files would be uploaded in addition to the .db files, rather than instead of them.

See Configuration quick reference for the entire dizzying array of configurable options.

Option data types

The same option may be specified multiple times and be one of several data types. For example, the AWS region may be specified in mrjob.conf, in the arguments to EMRJobRunner, and on the command line. These are the rules used to determine what value to use at runtime.

Values specified “later” refer to an option being specified at a higher priority. For example, a value in mrjob.conf is specified “earlier” than a value passed on the command line.

When there are multiple values, they are “combined with” a combiner function. The combiner function for each data type is listed in its description.

Simple data types

When these are specified more than once, the last non-None value is used.

String
Simple, unchanged string. Combined with combine_values().
Command
String containing all ASCII characters to be parsed with shlex.split(), or list of command + arguments. Combined with combine_cmds().
Path
Local path with ~ and environment variables (e.g. $TMPDIR) resolved. Combined with combine_paths().

List data types

The values of these options are specified as lists. When specified more than once, the lists are concatenated together.

String list
List of strings. Combined with combine_lists().
Path list
List of paths. Combined with combine_path_lists().

Strings and non-sequence data types (e.g. numbers) are treated as single-item lists.

For example,

runners:
  emr:
    setup: /run/some/command with args

is equivalent to:

runners:
  emr:
    setup:
    - /run/some/command with args

Dict data types

The values of these options are specified as dictionaries. When specified more than once, each has custom behavior described below.

Plain dict
Values specified later override values specified earlier. Combined with combine_dicts().

JobConf Dicts

New in version 0.6.6: Like plain dicts except that non-string values are converted into a format that Java understands. For example, the boolean value true here:

jobconf:
  mapreduce.output.fileoutputformat.compress: true

gets passed through to Hadoop in Java format (true), not Python format (True).

Keys whose values are None are not passed to Hadoop at all.

Warning

Prior to version 0.6.6, you should use "true" and "false", for boolean jobconf values in config files, not true and false.

Environment variable dict

Values specified later override values specified earlier, except for those with keys ending in PATH, in which values are concatenated and separated by a colon (:) rather than overwritten. The later value comes first.

For example, this config:

runners:
  emr:
    cmdenv:
      PATH: /usr/bin

when run with this command:

python my_job.py --cmdenv PATH=/usr/local/bin

will result in the following value of cmdenv:

/usr/local/bin:/usr/bin

The function that handles this is combine_envs().

The one exception to this behavior is in the local runner, which uses the local system separator (on Windows ;, on everything else still :) instead of always using :. In local mode, the function that combines config values is combine_local_envs().

Using multiple config files

If you have several standard configurations, you may want to have several config files “inherit” from a base config file. For example, you may have one set of AWS credentials, but two code bases and default instance sizes. To accomplish this, use the include option:

~/mrjob.very-large.conf:

include: ~/.mrjob.base.conf
runners:
  emr:
    num_core_instances: 20
    core_instance_type: m1.xlarge

~/mrjob.very-small.conf:

include: $HOME/.mrjob.base.conf
runners:
  emr:
    num_core_instances: 2
    core_instance_type: m1.small

~/.mrjob.base.conf:

runners:
  emr:
    aws_access_key_id: HADOOPHADOOPBOBADOOP
    aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
    region: us-west-1

Options that are lists, commands, dictionaries, etc. combine the same way they do between the config files and the command line (with combiner functions).

You can use $ENVIRONMENT_VARIABLES and ~/file_in_your_home_dir inside include.

You can inherit from multiple config files by passing include a list instead of a string. Files on the right will have precedence over files on the left. To continue the above examples, this config:

~/.mrjob.everything.conf

include:
- ~/.mrjob.very-small.conf
- ~/.mrjob.very-large.conf

will be equivalent to this one:

~/.mrjob.everything-2.conf

runners:
  emr:
    aws_access_key_id: HADOOPHADOOPBOBADOOP
    aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
    core_instance_type: m1.xlarge
    num_core_instances: 20
    region: us-west-1

In this case, ~/.mrjob.very-large.conf has taken precedence over ~/.mrjob.very-small.conf.

Relative includes

Relative include: paths are relative to the real (after resolving symlinks) path of the including conf file.

For example, you could do this:

~/.mrjob/base.conf:

runners:
  ...

~/.mrjob/default.conf:

include: base.conf

You could then load your configs via a symlink ~/.mrjob.conf to ~/.mrjob/default.conf and ~/.mrjob/base.conf would still be included (even though it’s not in the same directory as the symlink).

Clearing configs

Sometimes, you just want to override a list-type config (e.g. setup) or a *PATH environment variable, rather than having mrjob cleverly concatenate it with previous configs.

You can do this in YAML config files by tagging the values you want to take precedence with the !clear tag.

For example:

~/.mrjob.base.conf

runners:
  emr:
    aws_access_key_id: HADOOPHADOOPBOBADOOP
    aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
    cmdenv:
      PATH: /this/nice/path
      PYTHONPATH: /here/be/serpents
      USER: dave
    setup:
    - /run/this/command

~/.mrjob.conf

include: ~/mrjob.base.conf
runners:
  emr:
    cmdenv:
      PATH: !clear /this/even/better/path/yay
      PYTHONPATH: !clear
    setup: !clear
    - /run/this/other/command

is equivalent to:

runners:
  emr:
    aws_access_key_id: HADOOPHADOOPBOBADOOP
    aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
    cmdenv:
      PATH: /this/even/better/path/yay
      USER: dave
    setup:
    - /run/this/other/command

If you specify multiple config files (e.g. -c ~/mrjob.base.conf -c ~/mrjob.conf), a !clear in a later file will override earlier files. include: is really just another way to prepend to the list of config files to load.

If you find it more readable, you may put the !clear tag before the key you want to clear. For example,

runners:
  emr:
    !clear setup:
    - /run/this/other/command

is equivalent to:

runners:
  emr:
    setup: !clear
    - /run/this/other/command

!clear tags in lists are ignored. You cannot currently clear an entire set of configs (e.g. runners: emr: !clear ... does not work).