-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Description
When the input is really big, sometimes Madoop will try to open way too many output files and crash.
In mapreduce.py, prepare_input_files(), we define a magic number MAX_INPUT_SPLIT_SIZE that ends up determining the number of output files we try to make, unbounded by anything.
Then we open all the output files at once in a loop with
outfiles = [stack.enter_context(i.open('w')) for i in outpaths](This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be massive. The quick fix was to bump the value of MAX_INPUT_SPLIT_SIZE in their local environment, but we should fix it for real.)
Potential solutions:
- Open output files one at a time instead
- Use the
splitas a subprocess and avoid opening the files in Python
(CC + credit for solutions @MattyMay)
Metadata
Metadata
Assignees
Labels
No labels