Skip to content

Don't open too many output files #69

@melodell

Description

@melodell

When the input is really big, sometimes Madoop will try to open way too many output files and crash.

In mapreduce.py, prepare_input_files(), we define a magic number MAX_INPUT_SPLIT_SIZE that ends up determining the number of output files we try to make, unbounded by anything.
Then we open all the output files at once in a loop with

outfiles = [stack.enter_context(i.open('w')) for i in outpaths]

(This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be massive. The quick fix was to bump the value of MAX_INPUT_SPLIT_SIZE in their local environment, but we should fix it for real.)

Potential solutions:

  1. Open output files one at a time instead
  2. Use the split as a subprocess and avoid opening the files in Python

(CC + credit for solutions @MattyMay)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions