What’s New

For a complete list of changes, see CHANGES.txt

0.7.4

Docker on EMR

This release adds support for Docker on EMR, which was released with AMI version 6.0.0. This is enabled by setting docker_image to point at your image.

There is also a docker_mounts option, and, if you want to host your image on a private ECR repo instead of Docker Hub, a docker_client_config option (though with AMIs 6.1.0 and later, you can also auto-authenticate to ECR; see this page).

As a result of adding Docker support, the default image_version on EMR is 6.0.0. Also, on EMR and Dataproc we used to literally bootstrap mrjob by copying it to Python’s root package directory, but as this won’t put mrjob into a Docker image, mrjob is now bootstrapped via py_files, like on every other runner.

Concurrent Steps on EMR clusters

This release also supports concurrent steps on EMR clusters, a feature introduced in AMI 5.28.0. The max_concurrent_steps option controls both the concurrency level of a newly launched cluster, and how much concurrency we will accept when joining a pooled cluster.

To prevent steps from the same job attempting to run simultaneously, mrjob will now submit steps of a multi-step one at a time (after the previous one completes) on clusters running AMI 5.28.0 or later. This can be changed with the add_steps_in_batch option.

get_job_steps() is now deprecated, as it can’t fetch steps before they’re submitted.

Cluster Pooling

Cluster pooling can now join pooled clusters based on available CPU and memory reported by the YARN resource manager, rather than looking at number and type of instances in the cluster. You can enable this by setting min_available_mb and/or min_available_virtual_cores. For this feature to work, you must enable SSH (the ec2_key_pair and ec2_key_pair_file options).

You can now control the size of your cluster pool with the max_clusters_in_pool option. If a job wants to launch a new cluster in the pool but the pool is already “full,” it will wait and try again until the pool is no longer full or it can join a cluster.

Once a job determines that it is okay to add another cluster to the pool, it will wait a random number of seconds and try again. This way, if several pooled jobs launch simultaneously, they will be likely to stay within the maximum number of clusters rather than all launching their own. The random wait time can be controlled with pool_jitter_seconds.

By default, a job will wait forever to either join an existing cluster or create new one. You can make jobs give up and raise an exception with the pool_timeout_minutes option.

mrjob will now bypass the pool_wait_minutes option if there is not a matching, active cluster to join. Basically, it won’t wait if there is not a cluster to wait for. As with max_clusters_in_pool, if a job determines there are no clusters to wait for, it will wait a random number of seconds and double-check before launching a new cluster.

Library requirements

To support concurrent steps, boto3 must be at least version 1.10.0 and botocore must be at least version 1.13.16. The google-cloud-dataproc library must be no greater than 1.1.0, to maintain compatibility with our code.

0.7.3

Made many long-overdue changes to Cluster Pooling, to reduce the potential for throttling by the EMR API. Pooling now puts most information a job needs to tell if it can join a cluster into the cluster name, meaning most non-matching clusters can be filtered out when we call ListClusters. Pooling also no longer needs to list cluster steps. Finally, if pool_wait_minutes is set, and there are multiple clusters we can join, we try them all, rather than just trying the “best” one and then requesting more information from the API.

This update resulted in a few minor changes to pooling. When a job has the choice of multiple clusters, it chooses solely on based on CPU capacity, using NormalizedInstanceHours in the cluster summary returned by the ListClusters API call. mrjob version and applications must now match exactly in all cases.

We also re-worked the “locking” mechanism that keeps multiple jobs from joining the same cluster. Formerly, this used S3 (which may only be eventually consistent), and locks had no fixed expiration time. Now, EMR tags are used for locking, locks always expire after one minute, and every job uses the same timing when locking clusters, reducing the potential for race conditions.

mrjob terminate-idle-clusters no longer attempts to lock clusters before terminating them, so its --max-mins-locked option is deprecated and does nothing.

The Spark harness now emulates counters correctly in local mode.

If you use mapper_raw(), and your setup script has an error, it will be correctly reported, even if your underlying shell is dash and not bash.

0.7.2

Spark normally only supports archives if you’re running on YARN. However, mrjob now seamlessly emulates archives on all Spark masters (other than local). This means you can now use --archives or --dirs with mrjob spark-submit, as well as using archives in your --setup script.

As a result of this change, mrjob is somewhat better at recognizing file extensions; it ignores . at the end of filenames, and can now recognize that a file with a name like mrjob-0.7.0.tar.gz is a .tar.gz file, not a .7.0.tar.gz file.

Also, if you don’t specify a name for an archive (e.g. --setup 'cd foo.tar.gz#/') mrjob no longer includes the file extension in the resulting directory name (foo/, not foo.tar.gz/).

Patched a long-standing security issue on EMR where we were copying the SSH key to the master node when reading logs from other nodes, which are only accessible via the master node. mrjob now correctly uses ssh-add and the SSH agent instead of copying the key. As a result, mrjob now has a ssh_add_bin option.

The extra_cluster_params option now recursively merges dict params into existing ones. For example, you can now do this:

runners:
  emr:
    extra_cluster_params:
      Instances:
        EmrManagedMasterSecurityGroup: sg-foo

without obliterating the rest of the Instances API parameter.

Python 2 has reached end-of-life, so if you’re using Python 2, the default python_bin is python2.7 rather than python, which now means Python 3 on some systems (for example, 6.x EMR AMIs).

Finally, we ensure that if you’re installing mrjob on Python 3.4, we’ll install a Python 3.4-compatible version of PyYAML.

0.7.1

EMR

Fixed a bug to set default value of VisibleToAllUsers to True.

You can set sub-parameters with extra_cluster_params to set it False. For example, you can now do:

--extra-cluster-param VisibleToAllUsers=false

Added logging for mrjob to show invoked runner with keyword arguments. Contents of archives are now used during bootstrapping to ensure clusters have same setup.

0.7.0

AWS and Google are now optional dependencies

Amazon Web Services (EMR/S3) and Google Cloud are now optional dependencies, aws and google respectively. For example, to install mrjob with AWS support, run:

pip install mrjob[aws]

non-Python mrjobs are no longer supported

Fully removed support for writing MRJob scripts in other languages and then running them with the mrjob library. (This capability so little used that chances are you never knew it existed.)

As a result the interpreter and steps_interpreter options are gone, the mrjob run subcommand is gone, and the MRJobLauncher class has been merged back into MRJob. Also removed mr_wc.rb from mrjob/examples/

MRSomeJob() means read from sys.argv

In prior versions, if you initialized a MRJob subclass with no arguments (MRSomeJob()), that meant the same thing as passing in an empty argument list (MRSomeJob(args=[])). It now means to read args directly from sys.argv[1:].

In practice, it’s rare to see MRJob subclass intialized this way outside of test cases. Running a MRJob script directly, or initializing it with an argument list works this same as in previous versions.

mrjob/examples/ love

The mrjob.examples package has been updated. Some examples that were difficult to test or maintain were removed, and the remainder were tested and fixed if necessary.

mrjob.examples.mr_text_classifier no longer needs you to encode documents in JSON format, and instead operates directly on text files with names like doc_id-cat_id_1-not_cat_id_2-etc.txt. Try it out:

python -m mrjob.examples.mr_text_classifier docs-to-classify/*.txt

miscellanous tweaks

The mrjob audit-emr-usage subcommand no longer attempts to read cluster pool names from clusters launched by mrjob v0.5.x.

Method arguments in filesystem classes (in mrjob.fs) are now consistenly named. This probably won’t matter in practice, as runner.fs <mrjob.runner.MRJobRunner.fs> is always a CompositeFilesystem anyhow.

removed deprecated code

Check your deprecation warnings! Everything marked deprecated in mrjob v0.6.x has been removed.

The following runner config options no longer exist: emr_api_params, interpreter, max_hours_idle, mins_to_end_of_hour, steps_interpreter, steps_python_bin, visible_to_all_users.

The following singular switches have been removed in favor of their plural alternative (e.g. --archives): --archive, --dir, --file, --hadoop-arg, --libjar, --py-file, --spark-arg.

The --steps switch is gone. This means --help --steps no longer works; use --help -v to see help for --mapper, etc.

Support for simulating optparse has been removed from MRJob. This includes add_file_option(), add_passthrough_option(), configure_options(), load_options(), pass_through_option(), self.args, self.OPTION_CLASS.

mrjob.job.MRJobRunner.stream_output() and mrjob.job.MRJob.parse_output_line() have been removed.

The constructor for MRJobRunner no longer has a file_upload_args keyword argument.

parse_and_save_options(), read_file(), and read_input() have all been removed from mrjob.util.

CompositeFilesystem no longer takes filesystems as arguments to its constructor; use add_fs(). The useless local_tmp_dir option to the GCSFilesystem constructor and the chunk_size arg to its put() method have been removed.

0.6.12

Updated the Dataproc’s runner default image_version to 1.3, as the old default, 1.0 no longer works.

The local and inline runners can now handle file:// URIs as input paths and as files/archives uploaded to the working directory. The local filesystem (available as runner.fs from all runners) can now handle file:// URIs as well.

0.6.11

Adds support for parsing Spark logs and driver output to determine why a job failed. This works with with the local, Hadoop, EMR, and Spark runners.

The Spark runner no longer needs pyspark in the $PYTHONPATH to launch scripts with spark-submit (it still needs pyspark to use the Spark harness).

On Python 3.7, you can now intermix positional arguments to MRJob with switches, similar to how you could back when mrjob used optparse. For example: mr_your_script.py file1 -v file2.

On EMR, the default image_version (AMI) is now 5.27.0.

Restored m4.large as the default instance type pre-5.13.0 AMIs, as they do not support m5.xlarge. (m5.xlarge is still the default for AMI 5.13.0 and later.)

mrjob can now retry on transient AWS API errors (e.g. throttling) or network errors when making API calls that use pagination (e.g. listing clusters).

The emr_configurations opt now supports the !clear tag rather than crashing. You may also override individual configs by setting a config with the same Classification.

This version restores official support for Python 3.4, as it’s the version of Python 3 installed on EMR AMIs prior to 5.20.0. In order to make this work, mrjob drops support for Google Cloud services in Python 3.4, as the recent Google libraries appear to need a later Python version.

0.6.10

Adds official support for PyPy (that is any version of it compatible with Python 2.7/3.5+). If you launch a job in PyPy python_bin will automatically default to pypy or pypy3 as appropriate.

Note that mrjob does not auto-install PyPy for you on EMR (Amazon Linux does not provide a PyPy package). Installing PyPy yourself at bootstrap time is fairly straightforward, see Installing PyPy.

The Spark harness can now be used on EMR, allowing you to run “classic” MRJobs in Spark, which is often faster. Essentially, you launch jobs in the Spark runner with --spark-submit-bin 'mrjob spark-submit -r emr'; see Running classic MRJobs on Spark on EMR for details.

The Spark runner can now optionally disable internal protocols when running “classic” MRJobs, eliminating the (usually) unnecessary effort of encoding data structures into JSON or other string representations and then decoding them. See skip_internal_protocol for details.

The EMR runner’s default instance type is now m5.xlarge, which works with newer reasons and should make it easier to run Spark jobs. The EMR runner also now logs the DNS of the master node as soon as it is available, to make it easier to SSH in.

Finally, mrjob gives a much clearer error message if you attempt to read a YAML mrjob.conf file without PyYAML installed.

0.6.9

Drops support for Python 3.4.

Fixes a bug introduced in 0.6.8 that could break archives or directories uploaded into Hadoop or Spark if the name of the unpacked archive didn’t have an archive extension (e.g. .tar.gz).

The Spark runner can now optionally emulate Hadoop’s mapreduce.map.input.file configuration property when running the mapper of the first step of a streaming job if you enable emulate_map_input_file. This means that jobs that depend on jobconf_from_env('mapreduce.map.input.file') will still work.

The Spark runner also now uses the correct argument names when emulating increment_counter(), and logs a warning if spark_tmp_dir doesn’t match spark_master.

mrjob spark-submit can now pass switches to the Spark script/JAR without explicitly separating them out with --.

The local and inline runners now more correctly emulate the mapreduce.map.input.file config property by making it a file:// URL.

Deprecated methods add_file_option() and add_passthrough_option() can now take a type (e.g. int) as their type argument, to better emulate optparse.

0.6.8

Nearly full support for Spark

This release adds nearly full support for Spark, including mrjob-specific features like setup scripts and passthrough options. See Why use mrjob with Spark? for everything mrjob can do with Spark.

This release adds a SparkMRJobRunner (-r spark), which works with any Spark installation, does not require Hadoop, and can access any filesystem supported by both mrjob and Spark (HDFS, S3, GCS). The Spark runner is now the default for mrjob spark-submit.

What’s not supported? mrjob does not yet support Spark on Google Cloud Dataproc. The Spark runner does not yet parse logs to determine probable cause of failure when your job fails (though it does give you the Spark driver output).

Spark Hadoop Streaming emulation

Not only does the Spark runner not need Hadoop to run Spark jobs, it doesn’t need Hadoop to run most Hadoop Streaming jobs, as it knows how to run them directly on Spark. This means if you want to migrate to a non-Hadoop Spark cluster, you can take all your old MRJobs with you. See Running “classic” MRJobs on Spark for details.

The “experimental harness script” mentioned in 0.6.7 is now fully integrated into the Spark runner and is no longer supported as a separate feature.

Local runner support for Spark

The local and inline runner can now run Spark scripts locally for testing, analogous to the way they’ve supported Hadoop streaming scripts (except that they do require a local Spark installation). See Other ways to run on Spark.

Other Spark improvements

MRJobs are now Spark-serializable without calling sandbox() (there used to be a problematic reference to sys.stdin). This means you can always pass job methods to rdd.flatMap() etc.

setup scripts are no longer a YARN-specific feature, working on all Spark masters (except local[*], which doesn’t give executors a separate working directory).

Likewise, you can now specify a different name for files in the job’s working directory (e.g. --file foo#bar) on all Spark masters.

Note

Uploading archives and directories still only works on YARN for now; Spark considers --archives a YARN-specific feature.

When running on a local Spark cluster, uses file://... rather than just the path of the file when necessary (e.g. with --py-files).

cat_output() now ignores files and subdirectories starting with "." (used to only be "_"). This allows mrjob to ignore Spark’s checksum files (e.g. .part-00000.crc), and also brings mrjob in closer compliance to the way Hadoop input formats read directories.

spark.yarn.appMasterEnv.* config properties are only set if you’re actually running on YARN.

The values of spark_master and spark_deploy_mode can no longer be overridden with configuration properties (-D spark.master=...). While not exactly a “feature,” this means that mrjob always knows what Spark platform it’s running on.

Filesystems

Every runner has an fs attribute that gives access to all the filesystems that runner supports.

Added a put() method to all filesystems, which allows uploading a single file (it used to be that each runner had custom logic for uploads).

It also used to be that if you wanted to create a bucket on S3 or GCS, you had to call create_bucket(...) explicitly. Now mkdir() will automatically create buckets as needed.

If you still need to access methods specific to a filesystem, you should do so through fs.<name>, where <name> is the (lowercase) name of the storage service. For example the Spark runner’s filesystem offers both runner.fs.s3.create_bucket() and runner.fs.gcs.create_bucket(). The old style of implicitly passing through FS-specific methods (runner.fs.create_bucket(...)) is deprecated and going away in v0.7.0.

GCSFilesystem‘s constructor had a useless local_tmp_dir argument, which is now deprecated and going away in v0.7.0.

EMR

Fixed a bad bug introduced in 0.6.7 that could prevent mrjob from running on EMR with a non-default temp bucket.

You can now set sub-parameters with extra_cluster_params. For example, you can now do:

--extra-cluster-param Instances.EmrManagedMasterSecurityGroup=...

without clobbering the zone or instance group/fleet configs specified in Instances.

Running your job with --subnet '' now un-sets a subnet specified in your config file (used to be ignored).

If you are using cluster pooling with retries (pool_wait_minutes), mrjob now retains information about clusters that is immutable (e.g. AMI version), saving API calls.

Dependency upgrades

Bumped the required versions of several Google Cloud Python libraries to be more compatible with current versions of their sub-dependencies (Google libraries pin a fairly narrow range of dependencies). mrjob now requires:

  • google-cloud-dataproc at least 0.3.0,
  • google-cloud-logging at least 1.9.0, and
  • google-cloud-storage at least 1.13.1.

Also dropped support for PyYAML 3.08; now we require at least PyYAML 3.10 (which came out in 2011).

Note

We are aware that the Google libraries’ extensive dependencies can be a nuisance for mrjob users who don’t use Google Cloud. Our tentative plan is to make dependencies specific to a third-party service (including google-cloud-* and boto3) optional starting in v0.7.0.

Other bugfixes

Fixed a long-standing bug that would cause the Hadoop runner to hang or raise cryptic errors if hadoop_bin or spark_submit_bin is not executable.

Support files for mrjob.examples (e.g. stop_words.txt for MRMostUsedWord) are now installed along with mrjob.

Setting a *_bin option to an empty value (e.g. --hadoop-bin) now always instructs mrjob to use the default, rather than disabling core features or creating cryptic errors. This affects gcloud_bin, hadoop_bin, sh_bin, and ssh_bin; the various *python_bin options already worked this way.

0.6.7

setup commands now work on Spark (at least on YARN).

Added the mrjob spark-submit subcommand, which works as a drop-in replacement for spark-submit but with mrjob runners (e.g EMR) and mrjob features (e.g. setup, cmdenv).

Fixed a bug that was causing idle timeout scripts to silently fail on 2.x EMR AMIs.

Fixed a bug that broke create_bucket() on us-east-1, preventing new mrjob installations from launching on EMR in that region.

Fixed an ImportError from attempting to import os.SIGKILL on Windows.

The default instance type on EMR is now m4.large.

EMR’s cluster pooling now knows the CPU and memory capacity of c5 and m5 instances, allowing it to join “better” clusters.

Added the plural form of several switches (separate multiple values with commas):

  • --applications
  • --archives
  • --dirs
  • --files
  • --libjars
  • --py-files

Except for --application, the singular version of these switches (--archive, --dir, --file, --libjar, --py-file) is deprecated for consistency with Hadoop and Spark

sh_bin is now fully qualified by default (/bin/sh -ex, not sh -ex). sh_bin may no longer be empty, and a warning is issued if it has more than one argument, to properly support shell script shebangs (e.g. #!/bin/sh -ex) on Linux.

Runners no longer call MRJobs with --steps; instead the job passes its step description to the runner on instantiation. --steps and steps_python_bin are now deprecated.

The Hadoop and EMR runner can now set SPARK_PYTHON and SPARK_DRIVER_PYTHON to different values if need be (e.g. to match task_python_bin, or to support setup scripts in client mode).

The inline runner no longer attempts to run command substeps.

The inline and local runner no longer silently pretend to run non-streaming steps.

The Hadoop runner no longer has the bootstrap_spark option, which did nothing.

interpreter and steps_interpreter are deprecated, in anticipation in removing support for writing MRJobs in other programming languages.

Runners now issue a warning if they receive options that belong to other runners (e.g. passing image_version to the Hadoop runner).

mrjob create-cluster now supports --emr-action-on-failure.

Updated deprecate escape sequences in mrjob code that would break on Python 3.8.

--help message for mrjob subcommands now correctly includes the subcommand in usage.

mrjob no longer raises AssertionError, instead raising ValueError.

Added an experimental harness script (in mrjob/spark) to run basic MRJobs on Spark, potentially without Hadoop:

spark-submit mrjob_spark_harness.py module.of.YourMRJob input_path output_dir

Added map_pairs(), reduce_pairs(), and combine_pairs() methods to MRJob, to enable the Spark harness script.

0.6.6

Fixes a longstanding bug where boolean jobconf values were passed to Hadoop in Python format (True instead of true). You can now do safely do something like this:

runners:
  emr:
    jobconf:
      mapreduce.output.fileoutputformat.compress: true

whereas in prior versions of mrjob, you had to use "true" in quotes.

Added -D as a synonym for --jobconf, to match Hadoop.

On EMR, if you have SSH set up (see Configuring SSH credentials) mrjob can fetch your history log directly from HDFS, allowing it to more quickly diagnose why your job failed.

Added a --local-tmp-dir switch. If you set local_tmp_dir to empty string, mrjob will use the system default.

You can now pass multiple arguments to Hadoop --hadoop-args (for example, --hadoop-args='-fs hdfs://namenode:port'), rather than having to use --hadoop-arg one argument at time. --hadoop-arg is now deprecated.

Similarly, you can use --spark-args to pass arguments to spark-submit in place of the now-deprecated --spark-arg.

mrjob no longer automatically passes generic arguments (-D and -libjars) to JarSteps, because this confuses some JARs. If you want mrjob to pass generic arguments to a JAR, add GENERIC_ARGS to your JarStep‘s args keyword argument, like you would with INPUT and OUTPUT.

The Hadoop runner now has a spark_deploy_mode option.

Fixed the usage: usage: typo in --help messages.

mrjob.job.MRJob.add_file_arg() can now take an explicit type=str (used to cause an error).

The deprecated optparse emulation methods add_file_option() and add_passthrough_option() now support type='str' (used to only accept type='string').

Fixed a permissions error that was breaking inline and local mode on some versions of Windows.

0.6.5

This release fixes an issue with self-termination of idle clusters on EMR (see max_mins_idle) where the master node sometimes simply ignored sudo shutdown -h now. The idle self termination script now logs to bootstrap-actions/mrjob-idle-termination.log.

Note

If you are using Cluster Pooling, it’s highly recommended you upgrade to this version to fix the self-termination issue.

You can now turn off log parsing (on all runners) by setting read_logs to false. This can speed up cases where you don’t care why a job failed (e.g. integration tests) or where you’d rather use the diagnose tool after the fact.

You may specify custom AMIs with the image_id option. To find Amazon Linux AMIs compatible with EMR that you can use as a base for your custom image, use describe_base_emr_images().

The default AMI on EMR is now 5.16.0.

New EMR clusters launched by mrjob will be automatically tagged with __mrjob_label (filename of your mrjob script) and __mrjob_owner (your username), to make it easier to understand your mrjob usage in CloudWatch etc. You can change the value of these tags with the label and owner options.

You may now set the root EBS volume size for EMR clusters directly with ebs_root_volume_gb (you used to have to use instance_groups or instance_fleets).

API clients returned by EMRJobRunner now retry on SSL timeouts. EMR clients returned by mrjob.emr.EMRJobRunner.make_emr_client() won’t retry faster than check_cluster_every, to prevent throttling.

Cluster pooling recovery (relaunching a job when your pooled cluster self-terminates) now works correctly on single-node clusters.

0.6.4

This release makes it easy to attach static files to your MRJob with the FILES, DIRS, and ARCHIVES attributes.

In most cases, you no longer need setup scripts to access other python modules or packages from your job because you can use DIRS instead. For more details, see Using other python modules and packages.

For completeness, also added files(), dirs(), and archives() methods.

terminate-idle-clusters now skips termination-protected idle clusters, rather than crashing (this is fixed in 0.5.12, but not previous 0.6.x versions).

Python 3.3 is no longer supported.

mrjob now requires google-cloud-dataproc 0.2.0+ (this library used to be vendored).

0.6.3

Read arbitrary file formats

You can now pass entire files in any format to your mapper by defining mapper_raw(). See Passing entire files to the mapper for an example.

Google Cloud Datatproc parity

mrjob now offers feature parity between Google Cloud Dataproc and Amazon Elastic MapReduce. Support for Spark and libjars will be added in a future release. (There is no plan to introduce Cluster Pooling with Dataproc.)

Specifically, DataprocJobRunner now supports:

Improvements to existing Dataproc features:

  • bootstrap scripts run in a temp dir, rather than /
  • uses Dataproc’s built-in auto-termination feature, rather than a script
  • GCS filesystem:
    • cat() streams data rather than dumping to a temp file
    • exists() no longer swallows all exceptions

To get started, read Getting started with Google Cloud.

Other changes

mrjob no longer streams your job output to the command line if you specify output_dir. You can control this with the --cat-output and --no-cat-output switches (--no-output is deprecated).

cloud_upload_part_size has been renamed to cloud_part_size_mb (the old name will work until v0.7.0).

mrjob can now recognize “not a valid JAR” errors from Hadoop and suggest them as probable cause of job failure.

mrjob no longer depends on google-cloud (which implies several other Google libraries). Its current Google library dependencies are google-cloud-logging 1.5.0+ and google-cloud-storage 1.9.0+. Future versions of mrjob will depend on google-cloud-dataproc 0.11.0+ (currently included with mrjob because it hasn’t yet been released).

RetryWrapper now sets __name__ when wrapping methods, making for easier debugging.

0.6.2

mrjob is now orders of magnitude quicker at parsing logs, making it practical to diagnose rare errors from very large jobs. However, on some AMIs, it can no longer parse errors without waiting for logs to transfer to S3 (this may be fixed in a future version).

To run jobs on Google Cloud Dataproc, mrjob no longer requires you to install the gcloud util (though if you do have it installed, mrjob can read credentials from its configs). For details, see Dataproc Quickstart.

mrjob no longer requires you to select a Dataproc zone prior to running jobs. Auto zone placement (just set region and let Dataproc pick a zone) is now enabled, with the default being auto zone placement in us-west1. mrjob no longer reads zone and region from gcloud‘s compute engine configs.

mrjob’s Dataproc code has been ported from the google-python-api-client library (which is in maintenance mode) to google-cloud-sdk, resulting in some small changes to the GCS filesystem API. See CHANGES.txt for details.

Local mode now has a num_cores option that allow you to control how tasks it handles simultaneously.

0.6.1

Added the diagnose tool (run mrjob diagnose j-CLUSTERID), which determines why a previously run job failed.

Fixed a serious bug that made mrjob unable to properly parse error logs in some cases.

Added the get_job_steps() method to EMRJobRunner.

0.6.0

Dropped Python 2.6

mrjob now supports Python 2.7 and Python 3.3+. (Some versions of PyPy also work but are not officially supported.)

boto3, not boto

mrjob now uses boto3 rather than boto to talk to AWS. This makes it much simpler to pass user-defined data structures directly to the API, enabling a number of features.

At least version 1.4.6 of boto3 is required to run jobs on EMR.

It is now possible to fully configure instances (including EBS volumes). See instance_groups for an example.

mrjob also now supports Instance Fleets, which may be fully configured (including EBS volumes) through the instance_fleets option.

Methods that took or returned boto objects (for example, make_emr_conn()) have been completely removed as there as no way to make a deprecated shim for them without keeping boto as a dependency. See EMRJobRunner and S3Filesystem for new method names.

Note that boto3 reads temporary credentials from $AWS_SESSION_TOKEN, not $AWS_SECURITY_TOKEN as in boto (see aws_session_token for details).

argparse, not optparse

mrjob now uses argparse to parse options, rather than optparse, which has been deprecated since Python 2.7.

argparse has slightly different option-parsing logic. A couple of things you should be aware of:

  • everything that starts with - is assumed to be a switch. --hadoop-arg=-verbose works, but --hadoop-arg -verbose does not.
  • positional arguments may not be split. mr_wc.py CHANGES.txt LICENSE.txt -r local will work, but mr_wc.py CHANGES.txt -r local LICENSE.txt will not.

Passthrough options, file options, etc. are now handled with add_file_arg(), add_passthru_arg(), configure_args(), load_args(), and pass_arg_through(). The old methods with “option” in their name are deprecated but still work.

As part of this refactor, OptionStore and its subclasses have been removed; options are now handled by runners directly.

Chunks, not lines

mrjob no longer assumes that job output will be line-based. If you run your job programmatically, you should read your job output with cat_output(), which yields bytestrings which don’t necessarily correspond to lines, and run these through parse_output(), which will convert them into key/value pairs.

runner.fs.cat() also now yields arbitrary bytestrings, not lines. When it yields from multiple files, it will yield an empty bytestring (b'') between the chunks from each file.

read_file() and read_input() are now deprecated because they are line-based. Try decompress(), to_chunks(), and to_lines().

Better local/inline mode

The sim runners (inline and local mode) have been completely rewritten, making it possible to fix a number of outstanding issues.

Local mode now runs one mapper/reducer per CPU, using multiprocesssing, for faster results.

We only sort by reducer key (not the full line) unless SORT_VALUES is set, exposing bad assumptions sooner.

The step_output_dir option is now supported, making it easier to debug issues in intermediate steps.

Files in tasks’ (e.g. mappers’) working directories are marked user-executable, to better imitate Hadoop Distributed Cache. When possible, we also symlink to a copy of each file/archive in the “cache,” rather than copying them.

If os.symlink() raises an exception, we fall back to copying (this can be an issue in Python 3 on Windows).

Tasks are run more like they are in Hadoop; input is passed through stdin, rather than as script arguments. mrjob.cat is no longer executable because local mode no longer needs it.

Cloud runner improvements

Much of the common code for the “cloud” runners (Dataproc and EMR) has been merged, so that new features can be rolled out in parallel.

The bootstrap option (for both Dataproc and EMR) can now take archives and directories as well as files, like the setup option has since version 0.5.8.

The extra_cluster_params option allows you to pass arbitrary JSON to the API at cluster create time (in Dataproc and EMR). The old emr_api_params option is deprecated and disabled.

max_hours_idle has been replaced with max_mins_idle (the old option is deprecated but still works). The default is 10 minutes. Due to a bug, smaller numbers of minutes might cause the cluster to terminate before the job runs.

It is no longer possible for mrjob to launch a cluster that sits idle indefinitely (except by setting max_mins_idle to an unreasonably high value). It is still a good idea to run report-long-jobs because mrjob can’t tell if a running job is doing useful work or has stalled.

EMR now bills by the second, not the hour

Elastic MapReduce recently stopped billing by the full hour, and now bills by the second. This means that Cluster Pooling is no longer a cost-saving strategy, though developers might find it handy to reduce wait times when testing.

The mins_to_end_of_hour option no longer makes sense, and has been deprecated and disabled.

audit-emr-usage has been updated to use billing by the second when approximating time billed and waste.

Note

Pooling was enabled by default for some development versions of v0.6.0, prior to the billing change. This did not make it into the release; you must still explicitly turn on cluster pooling.

Other EMR changes

The default AMI is now 5.8.0. Note that this means you get Spark 2 by default.

Regions are now case-sensitive, and the EU alias for eu-west-1 no longer works.

Pooling no longer adds dummy arguments to the master bootstrap script, instead setting the __mrjob_pool_hash and __mrjob_pool_name tags on the cluster.

mrjob automatically adds the __mrjob_version tag to clusters it creates.

Jobs will not add tags to clusters they join rather than create.

enable_emr_debugging now works on AMI 4.x and later.

AMI 2.4.2 and earlier are no longer supported (no Python 2.7). There is no longer any special logic for the “latest” AMI alias (which the API no longer supports).

The SSH filesystem no longer dumps file contents to memory.

Pooling will only join a cluster with enough running instances to meet its specifications; requested instances no longer count.

Pooling is now aware of EBS (disk) setup.

Pooling won’t join a cluster that has extra instance types that don’t have enough memory or disk space to run your job.

Errors in bootstrapping scripts are no longer dumped as JSON.

visible_to_all_users is deprecated.

Massive purge of deprecated code

About a hundred functions, methods, options, and more that were deprecated in v0.5.x have been removed. See CHANGES.txt for details.

0.5.12

This release came out after v0.6.3. It was mostly a backport from v0.6.x.

Python 2.6 and 3.3 are no longer supported.

mrjob.parse.parse_s3_uri() handles s3a:// URIs.

terminate-idle-clusters now skips termination-protected idle clusters, rather than crashing.

Since Amazon no longer bills by the full hour, the mins_to_end_of_hour option now defaults to 60, effectively disabling it.

When mrjob passes an environment dictionary to subprocesses, it ensures that the keys and values are always strs (this mostly affects Python 2 on Windows).

0.5.11

The report-long-jobs utility can now ignore certain clusters based on EMR tags.

This version deals more gracefully with clusters that use instance fleets, preventing crashes that may occur in some rare edge cases.

0.5.10

Fixed an issue where bootstrapping mrjob on Dataproc or EMR could stall if mrjob was already installed.

The aws_security_token option has been renamed to aws_session_token. If you want to set it via environment variable, you still have to use $AWS_SECURITY_TOKEN because that’s what boto uses.

Added protocol support for rapidjson; see RapidJSONProtocol and RapidJSONValueProtocol. If available, rapidjson will be used as the default JSON implementation if ujson is not installed.

The master bootstrap script on EMR and Dataproc now has the correct file extension (.sh, not .py).

0.5.9

Fixed a bug that prevented setup scripts from working on EMR AMIs 5.2.0 and later. Our workaround should be completely transparent unless you use a custom shell binary; see sh_bin for details.

The EMR runner now correctly re-starts the SSH tunnel to the job tracker/resource manager when a cluster it tries to run a job on auto-terminates. It also no longer requires a working SSH tunnel to fetch job progress (you still a working SSH; see ec2_key_pair_file).

The emr_applications option has been renamed to applications.

The terminate-idle-clusters utility is now slightly more robust in cases where your S3 temp directory is an different region from your clusters.

Finally, there a couple of changes that probably only matter if you’re trying to wrap your Hadoop tasks (mappers, reducers, etc.) in docker:

  • You can set just the python binary for tasks with task_python_bin. This allows you to use a wrapper script in place of Python without perturbing setup scripts.
  • Local mode now no longer relies on an absolute path to access the mrjob.cat utility it uses to handle compressed input files; copying the job’s working directory into Docker is enough.

0.5.8

You can now pass directories to jobs, either directly with the upload_dirs option, or through setup commands. For example:

--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'

mrjob will automatically tarball these directories and pass them to Hadoop as archives.

For multi-step jobs, you can now specify where inter-step output goes with step_output_dir (--step-output-dir), which can be useful for debugging.

All job step types now take the jobconf keyword argument to set Hadoop properties for that step.

Jobs’ --help printout is now better-organized and less verbose.

Made several fixes to pre-filters (commands that pipe into streaming steps):

mrjob now respects sh_bin when it needs to wrap a command in sh before passing it to Hadoop (e.g. to support pipes)

On EMR, mrjob now fetches logs from task nodes when determining probable cause of error, not just core nodes (the ones that run tasks and host HDFS).

Several unused functions in mrjob.util are now deprecated:

  • args_for_opt_dest_subset()
  • bash_wrap()
  • populate_option_groups_with_options()
  • scrape_options_and_index_by_dest()
  • tar_and_gzip()

bunzip2_stream() and gunzip_stream() have been moved from mrjob.util to mrjob.cat.

SSHFilesystem.ssh_slave_hosts() has been deprecated.

Option group attributes in MRJobs have been deprecated, as has the get_all_option_groups() method.

0.5.7

Cluster pooling

mrjob can now add up to 1,000 steps on pooled clusters on EMR (except on very old AMIs). mrjob now prints debug messages explaining why your job matched a particular pooled cluster when running in verbose mode (the -v option). Fixed a bug that caused pooling to fail when there was no need for a master bootstrap script (e.g. when running with --no-bootstrap-mrjob).

Other improvements

Log interpretation is much more efficient at determining a job’s probable cause of failure (this works with Spark as well).

When running custom JARs (see JarStep) mrjob now repects libjars and jobconf.

The hadoop_streaming_jar option now supports environment variables and ~.

The terminate-idle-clusters tool now works with all step types, including Spark. (It’s still recommended that you rely on the max_hours_idle option rather than this tool.)

mrjob now works in Anaconda3 Jupyter Notebook.

Bugfixes

Added several missing command-line switches, including --no-bootstrap-python on Dataproc. Made a major refactor that should prevent these kinds of issues in the future.

Fixed a bug that caused mrjob to crash when the ssh binary (see ssh_bin) was missing or not executable.

Fixed a bug that erroneously reported failed or just-started jobs as 100% complete.

Fixed a bug where timestamps were erroneously recognized as URIs. mrjob now only recognizes strings containing :// as URIs (see is_uri()).

Deprecation

The following are deprecated and will be removed in v0.6.0:

  • JarStep.``INPUT``; use mrjob.step.INPUT instead
  • JarStep.``OUTPUT``; use mrjob.step.OUTPUT instead
  • non-strict protocols (see strict_protocols)
  • the python_archives option (try this instead)
  • is_windows_path()
  • parse_key_value_list()
  • parse_port_range_list()
  • scrape_options_into_new_groups()

0.5.6

Fixed a critical bug that caused Dataproc runner to always crash when determining Hadoop version.

Log interpretation now prioritizes task errors (e.g. a traceback from your Python script) as probable cause of failure, even if they aren’t the most recent error. Log interpretation will now continue to download and parse task logs until it finds a non-empty stderr log.

Log interpretation also strips the “subprocess failed” Java stack trace that appears in task stderr logs from Hadoop 1.

0.5.5

Functionally equivalent to 0.5.4, except that it restores the deprecated ami_version option as an alias for image_version, making it easier to upgrade from earlier versions of mrjob.

Also slightly improves Cluster Pooling on EMR with updated information on memory and CPU power of various EC2 instance types, and by treating application names (e.g. “Spark”) as case-insensitive.

0.5.4

Pooling and idle cluster self-termination

Warning

This release accidentally removed the ami_version option instead of merely deprecating it. If you are upgrading from an earlier version of mrjob, use version 0.5.5 or later.

This release resolves a long-standing EMR API race condition that made it difficult to use Cluster Pooling and idle cluster self-termination (see max_hours_idle) together. Now if your pooled job unknowingly runs on a cluster that was in the process of shutting down, it will detect that and re-launch the job on a different cluster.

This means pretty much everyone running jobs on EMR should now enable pooling, with a configuration like this:

runners:
  emr:
    max_hours_idle: 1
    pool_clusters: true

You may also run the terminate-idle-clusters script periodically, but (barring any bugs) this shouldn’t be necessary.

Generic EMR option names

Many options to the EMR runner have been made more generic, to make it easier to share code with the Dataproc runner (in most cases, the new names are also shorter and easier to remember):

old option name new option name
ami_version image_version
aws_availablity_zone zone
aws_region region
check_emr_status_every check_cluster_every
ec2_core_instance_bid_price core_instance_bid_price
ec2_core_instance_type core_instance_type
ec2_instance_type instance_type
ec2_master_instance_bid_price master_instance_bid_price
ec2_master_instance_type master_instance_type
ec2_slave_instance_type core_instance_type
ec2_task_instance_bid_price task_instance_bid_price
ec2_task_instance_type task_instance_type
emr_tags tags
num_ec2_core_instances num_core_instances
num_ec2_task_instances num_task_instances
s3_log_uri cloud_log_dir
s3_sync_wait_time cloud_fs_sync_secs
s3_tmp_dir cloud_tmp_dir
s3_upload_part_size cloud_upload_part_size

The old option names and command-line switches are now deprecated but will continue to work until v0.6.0. (Exception: ami_version was accidentally removed; if you need it, use 0.5.5 or later.)

num_ec2_instances has simply been deprecated (it’s just num_core_instances plus one).

hadoop_streaming_jar_on_emr has also been deprecated; in its place, you can now pass a file:// URI to hadoop_streaming_jar to reference a path on the master node.

Log interpretation

Log interpretation (counters and probable cause of job failure) on Hadoop is more robust, handing a wider variety of log4j formats and recovering more gracefully from permissions errors. This includes fixing a crash that could happen on Python 3 when attempting to read data from HDFS.

Log interpretation used to be partially broken on EMR AMI 4.3.0 and later due to a permissions issue; this is now fixed.

pass_through_option()

You can now pass through existing command-line switches to your job; for example, you can tell a job which runner launched it. See pass_through_option() for details.

If you don’t do this, self.options.runner will now always be None in your job (it used to confusingly default to 'inline').

Stop logging credentials

When mrjob is run in verbose mode (the -v option), the values of all runner options are debug-logged to stderr. This has been the case since the very early days of mrjob.

Unfortunately, this means that if you set your AWS credentials in mrjob.conf, they get logged as well, creating a surprising potential security vulnerability. (This doesn’t happen for AWS credentials set through environment variables.)

Starting in this version, the values of aws_secret_access_key and aws_security_token are shown as '...' if they are set, and all but the last four characters of aws_access_key_id are blanked out as well (e.g. '...YNDR').

Other improvements and bugfixes

The ssh tunnel to the resource manager on EMR (see ssh_tunnel) now connects to its correct internal IP; this resolves a firewall issue that existed on some VPC setups.

Uploaded files will no longer be given names starting with _ or ., since Hadoop’s input processing treats these files as “hidden”.

The EMR idle cluster self-termination script (see max_hours_idle) now only runs on the master node.

The audit-emr-usage command-line tool should no longer constantly trigger throttling warnings.

bootstrap_python no longer bothers trying to install Python 3 on EMR AMI 4.6.0 and later, since it is already installed.

The --ssh-bind-ports command-line switch was broken (starting in 0.4.5!), and is now fixed.

0.5.3

This release adds support for custom libjars (such as nicknack), allowing easy access to custom input and output formats. This works on Hadoop and EMR (including on a cluster that’s already running).

In addition, jobs can specify needed libjars by setting the LIBJARS attribute or overriding the libjars() method. For examples, see Input and output formats.

The Hadoop runner now tries even harder to find your log files without needing additional configuration (see hadoop_log_dirs).

The EMR runner now supports Amazon VPC subnets (see subnet), and, on 4.x AMIs, Application Configurations (see emr_configurations).

If your EMR cluster fails during bootstrapping, mrjob can now determine the probable cause of failure.

There are also some minor improvements to SSH tunneling and a handful of small bugfixes; see CHANGES.txt for details.

0.5.2

This release adds basic support for Google Cloud Dataproc which is Google’s Hadoop service, roughly analogous to EMR. See Dataproc Quickstart. Some features are not yet implemented:

  • fetching counters
  • finding probable cause of errors
  • running Java JARs as steps

Added the emr_applications option, which helps you configure 4.x AMIs.

Fixed an EMR bug (introduced in v0.5.0) where we were waiting for steps to complete in the wrong order (in a multi-step job, we wouldn’t register that the first step had finished until the last one had).

Fixed a bug in SSH tunneling (introduced in v0.5.0) that made connections to the job tracker/resource manager on EMR time out when running on a 2.x AMI inside a VPC (Virtual Private Cluster).

Fixed a bug (introduced in v0.4.6) that kept mrjob from interpreting ~ (home directory) in includes in mrjob.conf.

It is now again possible to run tool modules deprecated in v0.5.0 directly (e.g. python -m mrjob.tools.emr.create_job_flow). This is still a deprecated feature; it’s recommended that you use the appropriate mrjob subcommand instead (e.g. mrjob create-cluster).

0.5.1

Fixes a bug in the previous relase that broke SORT_VALUES and any other attempt by the job to set the partitioner. The --partitioner switch is now deprecated (the choice of partitioner is part of your job semantics).

Fixes a bug in the previous release that caused strict_protocols and check_input_paths to be ignored in mrjob.conf. (We would much prefer you fixed jobs that are using “loose protocols” rather than setting strict_protocols: false in your config file, but we didn’t break this on purpose, we promise!)

mrjob terminate-idle-clusters now correctly handles EMR debugging steps (see enable_emr_debugging) set up by boto 2.40.0.

Fixed a bug that could result in showing a blank probable cause of error for pre-YARN (Hadoop 1) jobs.

ssh_bind_ports now defaults to a range object (xrange on Python 2), so that when you run on emr in verbose mode (-r emr -v), debug logging devotes one line to the value of ssh_bind_ports rather than 840.

0.5.0

Python versions

mrjob now fully supports Python 3.3+ in a way that should be transparent to existing Python 2 users (you don’t have to suddenly start handling unicode instead of str). For more information, see Python 2 vs. Python 3.

If you run a job with Python 3, mrjob will automatically install Python 3 on ElasticMapreduce AMIs (see bootstrap_python).

When you run jobs on EMR in Python 2, mrjob attempts to match your minor version of Python as well (either python2.6 or python2.7); see python_bin for details.

Note

If you’re currently running Python 2.7, and using yum to install python libraries, you’ll want to use the Python 2.7 version of the package (e.g. python27-numpy rather than python-numpy).

The mrjob command is now installed with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you install mrjob for multiple versions of Python.

Hadoop

mrjob should now work out-of-the box on almost any Hadoop setup. If hadoop is in your path, or you set any commonly-used $HADOOP_* environment variable, mrjob will find the Hadoop binary, the streaming jar, and your logs, without any help on your part (see hadoop_bin, hadoop_log_dirs, hadoop_streaming_jar).

mrjob has been updated to fully support Hadoop 2 (YARN), including many updates to HadoopFilesystem. Hadoop 1 is still supported, though anything prior to Hadoop 0.20.203 is not (mrjob is actually a few months older than Hadoop 0.20.203, so this used to matter).

3.x and 4.x AMIs

mrjob now fully supports the 3.x and 4.x Elastic MapReduce AMIs, including SSH tunneling to the resource mananager, fetching counters and finding probable cause of job failure.

The default ami_version (see image_version) is now 3.11.0. Our plan is to continue updating this to the lastest (non-broken) 3.x AMI for each 0.5.x release of mrjob.

The default instance_type is now m1.medium (m1.small is too small for the 3.x and 4.x AMIs)

You can specify 4.x AMIs with either the new release_label option, or continue using ami_version; both work.

mrjob continues to support 2.x AMIs. However:

Warning

2.x AMIs are deprecated by AWS, and based on a very old version of Debian (squeeze), which breaks apt-get and exposes you to security holes.

Please, please switch if you haven’t already.

AWS Regions

The new default aws_region (see region) is us-west-2 (Oregon). This both matches the default in the EMR console and, according to Amazon, is carbon neutral.

An edge case that might affect you: EC2 key pairs (i.e. SSH credentials) are region-specific, so if you’ve set up SSH but not explicitly specified a region, you may get an error saying your key pair is invalid. The fix is simply to create new SSH keys for the us-west-2 (Oregon) region.

S3

mrjob is much smarter about the way it interacts with S3:
  • automatically creates temp bucket in the same region as jobs
  • connects to S3 buckets on the endpoint matching their region (no more 307 errors)
    • EMRJobRunner and S3Filesystem methods no longer take s3_conn args (passing around a single S3 connection no longer makes sense)
  • no longer uses the temp bucket’s location to choose where you run your job
  • rm() no longer has special logic for *_$folder$ keys
  • ls() recurses “subdirectories” even if you pass it a URI without a trailing slash

Log interpretation

The part of mrjob that fetches counters and tells you what probably caused your job to fail was basically unmaintainable and has been totally rewritten. Not only do we now have solid support across Hadoop and EMR AMI versions, but if we missed anything, it should be straightforward to add it.

Once casualty of this change was the mrjob fetch-logs command, which means mrjob no longer offers a way to fetch or interpret logs from a past job. We do plan to re-introduce this functionality.

Protocols

Protocols are now strict by default (they simply raise an exception on unencodable data). “Loose” protocols can be re-enabled with the --no-strict-protocols switch; see strict_protocols for why this is a bad idea.

Protocols will now use the much faster ujson library, if installed, to encode and decode JSON. This is especially recommended for simple jobs that spend a significant fraction of their time encoding and data.

Note

If you’re using EMR, try out this bootstrap recipe to install ujson.

mrjob will fall back to the simplejson library if ujson is not installed, and use the built-in json module if neither is installed.

You can now explicitly specify which JSON implementation you wish to use (e.g. StandardJSONProtocol, SimpleJSONProtocol, UltraJSONProtocol).

Status messages

We’ve tried to cut the logging messages that your job prints as it runs down to the basics (either useful info, like where a temp directory is, or something that tells you why you’re waiting). If there are any messages you miss, try running your job with -v.

When a step in your job fails, mrjob no longer prints a useless stacktrace telling you where in the code the runner raised an exception about your step failing. This is thanks to StepFailedException, which you can also catch and interpret if you’re running jobs programmatically.

Deprecation

Many things that were deprecated in 0.4.6 have been removed:

mrjob.compat functions supports_combiners_in_hadoop_streaming(), supports_new_distributed_cache_options(), and uses_generic_jobconf(), which only existed to support very old versions of Hadoop, were removed without deprecation warnings (sorry!).

To avoid a similar wave of deprecation warnings in the future, the name of every part of mrjob that isn’t meant to be a stable interface provided by the library now starts with an underscore. You can still use these things (or copy them; it’s Open Source), but there’s no guarantee they’ll exist in the next release.

If you want to get ahead of the game, here is a list of things that are deprecated starting in mrjob 0.5.0 (do these after upgrading mrjob):

  • mrjob subcommands - mrjob create-job-flow is now mrjob create-cluster - mrjob terminate-idle-job-flows is now mrjob terminate-idle-clusters - mrjob terminate-job-flow is now mrjob temrinate-cluster

Other changes

  • mrjob now requires boto 2.35.0 or newer (chances are you’re already doing this). Later 0.5.x releases of mrjob may require newer versions of boto.
  • visible_to_all_users now defaults to True
  • HadoopFilesystem.rm() uses -skipTrash
  • new iam_endpoint option
  • custom hadoop_streaming_jars are properly uploaded
  • JOB cleanup on EMR is temporarily disabled
  • mrjob now follows symlinks when ls()ing the local filesystem (beware recursive symlinks!)
  • The interpreter option disables bootstrap_mrjob by default (interpreter is meant for non-Python jobs)
  • Cluster Pooling now respects ec2_key_pair
  • cluster self-termination (see max_hours_idle) now respects non-streaming jobs
  • LocalFilesystem now rejects URIs rather than interpreting them as local paths
  • local and inline runners no longer have a default hadoop_version, instead handling jobconf in a version-agnostic way
  • steps_python_bin now defaults to the current Python interpreter.
  • minor changes to mrjob.util:
    • file_ext() takes filename, not path
    • gunzip_stream() now yields chunks of bytes, not lines
    • moved random_identifier() method here from mrjob.aws
    • buffer_iterator_to_line_iterator() is now named to_lines(), and no longer appends a trailing newline to data.

0.4.6

include: in conf files can now use relative paths in a meaningful way. See Relative includes.

List and environment variable options loaded from included config files can be totally overridden using the !clear tag. See Clearing configs.

Options that take lists (e.g. setup) now treat scalar values as single-item lists. See this example.

Fixed a bug that kept the pool_wait_minutes option from being loaded from config files.

0.4.5

This release moves mrjob off the deprecated DescribeJobFlows EMR API call.

Warning

AWS again broke older versions mrjob for at least some new accounts, by returning 400s for the deprecated DescribeJobFlows API call. If you have a newer AWS account (circa July 2015), you must use at least this version of mrjob.

The new API does not provide a way to tell when a job flow (now called a “cluster”) stopped provisioning instances and started bootstrapping, so the clock for our estimates of when we are close to the end of a billing hour now start at cluster creation time, and are thus more conservative.

Related to this change, terminate_idle_job_flows no longer considers job flows in the STARTING state idle; use report_long_jobs to catch jobs stuck in this state.

terminate_idle_job_flows performs much better on large numbers of job flows. Formerly, it collected all job flow information first, but now it terminates idle job flows as soon as it identifies them.

collect_emr_stats and job_flow_pool have not been ported to the new API and will be removed in v0.5.0.

Added an aws_security_token option to allow you to run mrjob on EMR using temporary AWS credentials.

Added an emr_tags (see tags) option to allow you to tag EMR job flows at creation time.

EMRJobRunner now has a get_ami_version() method.

The hadoop_version option no longer has any effect in EMR. This option only every did anything on the 1.x AMIs, which mrjob no longer supports.

Added many missing switches to the EMR tools (accessible from the mrjob command). Formerly, you had to use a config file to get at these options.

You can now access the mrboss tool from the command line: mrjob boss <args>.

Previous 0.4.x releases have worked with boto as old as 2.2.0, but this one requires at least boto 2.6.0 (which is still more than two years old). In any case, it’s recommended that you just use the latest version of boto.

This branch has a number of additional deprecation warnings, to help prepare you for mrjob v0.5.0. Please heed them; a lot of deprecated things really are going to be completely removed.

0.4.4

mrjob now automatically creates and uses IAM objects as necessary to comply with new requirements from Amazon Web Services.

(You do not need to install the AWS CLI or run aws emr create-default-roles as the link above describes; mrjob takes care of this for you.)

Warning

The change that AWS made essentially broke all older versions of mrjob for all new accounts. If the first time your AWS account created an Elastic MapReduce cluster was on or after April 6, 2015, you should use at least this version of mrjob.

If you must use an old version of mrjob with a new AWS account, see this thread for a possible workaround.

--iam-job-flow-role has been renamed to --iam-instance-profile.

New --iam-service-role option.

0.4.3

This release also contains many, many bugfixes, one of which probably affects you! See CHANGES.txt for details.

Added a new subcommand, mrjob collect-emr-active-stats, to collect stats about active jobflows and instance counts.

--iam-job-flow-role option allows setting of a specific IAM role to run this job flow.

You can now use --check-input-paths and --no-check-input-paths on EMR as well as Hadoop.

Files larger than 100MB will be uploaded to S3 using multipart upload if you have the filechunkio module installed. You can change the limit/part size with the --s3-upload-part-size option, or disable multipart upload by setting this option to 0.

You can now require protocols to be strict from mrjob.conf; this means unencodable input/output will result in an exception rather than the job quietly incrementing a counter. It is recommended you set this for all runners:

runners:
  emr:
    strict_protocols: true
  hadoop:
    strict_protocols: true
  inline:
    strict_protocols: true
  local:
    strict_protocols: true

You can use --no-strict-protocols to turn off strict protocols for a particular job.

Tests now support pytest and tox.

Support for Python 2.5 has been dropped.

0.4.2

JarSteps, previously experimental, are now fully integrated into multi-step jobs, and work with both the Hadoop and EMR runners. You can now use powerful Java libraries such as Mahout in your MRJobs. For more information, see Jar steps.

Many options for setting up your task’s environment (--python-archive, --setup-cmd and --setup-script) have been replaced by a powerful --setup option. See the Job Environment Setup Cookbook for examples.

Similarly, many options for bootstrapping nodes on EMR (--bootstrap-cmd, --bootstrap-file, --bootstrap-python-package and --bootstrap-script) have been replaced by a single --bootstrap option. See the EMR Bootstrapping Cookbook.

This release also contains many bugfixes, including problems with boto 2.10.0+, bz2 decompression, and Python 2.5.

0.4.1

The SORT_VALUES option enables secondary sort, ensuring that your reducer(s) receive values in sorted order. This allows you to do things with reducers that would otherwise involve storing all the values in memory, such as:

  • Receiving a grand total before any subtotals, so you can calculate percentages on the fly. See mr_next_word_stats.py for an example.
  • Running a window of fixed length over an arbitrary amount of sorted values (e.g. a 24-hour window over timestamped log data).

The max_hours_idle option allows you to spin up EMR job flows that will terminate themselves after being idle for a certain amount of time, in a way that optimizes EMR/EC2’s full-hour billing model.

For development (not production), we now recommend always using job flow pooling, with max_hours_idle enabled. Update your mrjob.conf like this:

runners:
  emr:
    max_hours_idle: 0.25
    pool_emr_job_flows: true

Warning

If you enable pooling without max_hours_idle (or cronning terminate_idle_job_flows), pooled job flows will stay active forever, costing you money!

You can now use --no-check-input-paths with the Hadoop runner to allow jobs to run even if hadoop fs -ls can’t see their input files (see check_input_paths).

Two bits of straggling deprecated functionality were removed:

  • Built-in protocols must be instantiated to be used (formerly they had class methods).
  • Old locations for mrjob.conf are no longer supported.

This version also contains numerous bugfixes and natural extensions of existing functionality; many more things will now Just Work (see CHANGES.txt).

0.4.0

The default runner is now inline instead of local. This change will speed up debugging for many users. Use local if you need to simulate more features of Hadoop.

The EMR tools can now be accessed more easily via the mrjob command. Learn more here.

Job steps are much richer now:

  • You can now use mrjob to run jar steps other than Hadoop Streaming. More info
  • You can filter step input with UNIX commands. More info
  • In fact, you can use arbitrary UNIX commands as your whole step (mapper/reducer/combiner). More info

If you Ctrl+C from the command line, your job will be terminated if you give it time. If you’re running on EMR, that should prevent most accidental runaway jobs. More info

mrjob v0.4 requires boto 2.2.

We removed all deprecated functionality from v0.2:

  • –hadoop-*-format
  • –*-protocol switches
  • MRJob.DEFAULT_*_PROTOCOL
  • MRJob.get_default_opts()
  • MRJob.protocols()
  • PROTOCOL_DICT
  • IF_SUCCESSFUL
  • DEFAULT_CLEANUP
  • S3Filesystem.get_s3_folder_keys()

We love contributions, so we wrote some guidelines to help you help us. See you on Github!

0.3.5

The pool_wait_minutes (--pool-wait-minutes) option lets your job delay itself in case a job flow becomes available. Reference: Configuration quick reference

The JOB and JOB_FLOW cleanup options tell mrjob to clean up the job and/or the job flow on failure (including Ctrl+C). See CLEANUP_CHOICES for more information.

0.3.2

The EMR instance type/number options have changed to support spot instances:

  • core_instance_bid_price
  • core_instance_type
  • master_instance_bid_price
  • master_instance_type
  • slave_instance_type (alias for core_instance_type)
  • task_instance_bid_price
  • task_instance_type

There is also a new ami_version option to change the AMI your job flow uses for its nodes.

For more information, see mrjob.emr.EMRJobRunner.__init__().

The new report_long_jobs tool alerts on jobs that have run for more than X hours.

0.3

Features

Support for Combiners

You can now use combiners in your job. Like mapper() and reducer(), you can redefine combiner() in your subclass to add a single combiner step to run after your mapper but before your reducer. (MRWordFreqCount does this to improve performance.) combiner_init() and combiner_final() are similar to their mapper and reducer equivalents.

You can also add combiners to custom steps by adding keyword argumens to your call to steps().

More info: One-step jobs, Multi-step jobs

*_init(), *_final() for mappers, reducers, combiners

Mappers, reducers, and combiners have *_init() and *_final() methods that are run before and after the input is run through the main function (e.g. mapper_init() and mapper_final()).

More info: One-step jobs, Multi-step jobs

Custom Option Parsers

It is now possible to define your own option types and actions using a custom OptionParser subclass.

Job Flow Pooling

EMR jobs can pull job flows out of a “pool” of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler.

More info: Cluster Pooling

SSH Log Fetching

mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows.

More info: Configuring SSH credentials

New EMR Tool: fetch_logs

If you want to fetch the counters or error logs for a job after the fact, you can use the new fetch_logs tool.

More info: mrjob.tools.emr.fetch_logs

New EMR Tool: mrboss

If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new mrboss tool.

More info: mrjob.tools.emr.mrboss

Changes and Deprecations

Configuration

The search path order for mrjob.conf has changed. The new order is:

  • The location specified by MRJOB_CONF
  • ~/.mrjob.conf
  • ~/.mrjob (deprecated)
  • mrjob.conf in any directory in PYTHONPATH (deprecated)
  • /etc/mrjob.conf

If your mrjob.conf path is deprecated, use this table to fix it:

Old Location New Location
~/.mrjob ~/.mrjob.conf
somewhere in PYTHONPATH Specify in MRJOB_CONF

More info: mrjob.conf

Defining Jobs (MRJob)

Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data.

The --hadoop-*-format switches are deprecated. Instead, set your job’s Hadoop formats with HADOOP_INPUT_FORMAT/HADOOP_OUTPUT_FORMAT or hadoop_input_format()/hadoop_output_format(). Hadoop formats can no longer be set from mrjob.conf.

In addition to --jobconf, you can now set jobconf values with the JOBCONF attribute or the jobconf() method. To read jobconf values back, use mrjob.compat.jobconf_from_env(), which ensures that the correct name is used depending on which version of Hadoop is active.

You can now set the Hadoop partioner class with --partitioner, the PARTITIONER attribute, or the partitioner() method.

More info: Hadoop configuration

Protocols

Protocols can now be anything with a read() and write() method. Unlike previous versions of mrjob, they can be instance methods rather than class methods. You should use instance methods when defining your own protocols.

The --*protocol switches and DEFAULT_*PROTOCOL are deprecated. Instead, use the *_PROTOCOL attributes or redefine the *_protocol() methods.

Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements.

More info: Protocols

Running Jobs

All Modes

All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments.

All *_bin configuration options (hadoop_bin, python_bin, and ssh_bin) take lists instead of strings so you can add arguments (like ['python', '-v']). More info: Configuration quick reference

Cleanup options have been split into cleanup and cleanup_on_failure. There are more granular values for both of these options.

Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions.

The job_name_prefix option is gone (was deprecated).

All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use.

Steps with no mapper use cat instead of going through a no-op mapper.

Compressed files can be streamed with the cat() method.

EMR Mode

The default Hadoop version on EMR is now 0.20 (was 0.18).

The instance_type option only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job.

Inline Mode

Inline mode now supports the cmdenv option.

Local Mode

Local mode now runs 2 mappers and 2 reducers in parallel by default.

There is preliminary support for simulating some jobconf variables. The current list of supported variables is:

  • mapreduce.job.cache.archives
  • mapreduce.job.cache.files
  • mapreduce.job.cache.local.archives
  • mapreduce.job.cache.local.files
  • mapreduce.job.id
  • mapreduce.job.local.dir
  • mapreduce.map.input.file
  • mapreduce.map.input.length
  • mapreduce.map.input.start
  • mapreduce.task.attempt.id
  • mapreduce.task.id
  • mapreduce.task.ismap
  • mapreduce.task.output.dir
  • mapreduce.task.partition

Other Stuff

boto 2.0+ is now required.

The Debian packaging has been removed from the repostory.