Hadoop Cookbook

Increasing the task timeout

Warning

Some EMR AMIs appear to not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration.

If your mappers or reducers take a long time to process a single step, you may want to increase the amount of time Hadoop lets them run before failing them as timeouts.

You can do this with jobconf. For example, to set the timeout to one hour:

runners:
  hadoop: # also works for emr runner
    jobconf:
      mapreduce.task.timeout: 3600000

Note

If you’re using Hadoop 1, which uses mapred.task.timeout, don’t worry: this example still works because mrjob auto-converts your jobconf options between Hadoop versions.

Writing compressed output

To save space, you can have Hadoop automatically save your job’s output as compressed files. Here’s how you tell it to bzip them:

runners:
  hadoop: # also works for emr runner
    jobconf:
      # "true" must be a string argument, not a boolean! (Issue #323)
      mapreduce.output.fileoutputformat.compress: "true"
      mapreduce.output.fileoutputformat.compress.codec: org.apache.hadoop.io.compress.BZip2Codec

Note

You could also gzip your files with org.apache.hadoop.io.compress.GzipCodec. Usually bzip is a better option, as .bz2 files are splittable, and .gz files are not. For example, if you use .gz files as input, Hadoop has no choice but to create one mapper per .gz file.