Don't open too many output files

When the input is _really_ big, sometimes Madoop will try to open _way_ too many output files and crash. 

In `mapreduce.py, prepare_input_files()`, we define a magic number `MAX_INPUT_SPLIT_SIZE` that ends up determining the number of output files we try to make, unbounded by anything.
Then we open all the output files at once in a loop with
```python
outfiles = [stack.enter_context(i.open('w')) for i in outpaths]
```

(This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be **massive**. The quick fix was to bump the value of `MAX_INPUT_SPLIT_SIZE` in their local environment, but we should fix it for real.)

Potential solutions:
1. Open output files one at a time instead
2. Use the `split` as a subprocess and avoid opening the files in Python

(CC + credit for solutions @MattyMay)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't open too many output files #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Don't open too many output files #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions