Increase max file size to 2 MB #74

melodell · 2024-03-12T16:57:07Z

To support EECS 485 P5 -- The HTML Dataset changes use full Wikipedia pages. We want to make sure that each file gets read in its entirety in a MapReduce job so that students can use the full document content in the pipeline. Some Wikipedia pages are larger than 1 MB, so let's bump the limit so they don't get split.

codecov · 2024-03-12T16:59:07Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.31%. Comparing base (5408a7d) to head (e02df80).
Report is 6 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop      #74   +/-   ##
========================================
  Coverage    96.31%   96.31%           
========================================
  Files            4        4           
  Lines          217      217           
========================================
  Hits           209      209           
  Misses           8        8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

awdeorio · 2024-03-12T20:45:56Z

I'd happy to make the change, but it would be helpful to memorialize a short explanation here.

noah-weingarden · 2024-03-12T21:01:05Z

I think this is redundant because of #70 too?

melodell · 2024-03-12T21:39:29Z

I'd happy to make the change, but it would be helpful to memorialize a short explanation here.

Was short on time earlier. Updated now!

awdeorio · 2024-03-13T01:36:20Z

I think this is redundant because of #70 too?

There are some changes coming down the pipeline (pun intended) to EECS 485 Project 5. The first stage of the MapReduce pipeline will process pages in parallel map tasks. Currently, the first mapper is a single task which processes all input files (each file is a wiki page).

awdeorio

LGTM

noah-weingarden · 2024-03-13T01:49:30Z

There are some changes coming down the pipeline (pun intended) to EECS 485 Project 5. The first stage of the MapReduce pipeline will process pages in parallel map tasks. Currently, the first mapper is a single task which processes all input files (each file is a wiki page).

That sounds promising; is it related to this comment of mine?

awdeorio · 2024-03-13T12:36:28Z

It's different. HTML parsing will be part of the first pipeline stage (stage 0). There will be one mapper per file, which means one execution per HTML document.

noah-weingarden · 2024-03-13T12:43:04Z

Mappers will run per file instead of per line? That sounds like we're not going to be using a streaming interface anymore.

Increase max file size to 2 MB

e02df80

melodell requested a review from awdeorio March 12, 2024 16:57

awdeorio approved these changes Mar 13, 2024

View reviewed changes

awdeorio merged commit 44326de into develop Mar 13, 2024

awdeorio deleted the increase-max-input-split-size branch March 13, 2024 01:36

noah-weingarden mentioned this pull request Mar 31, 2024

Optimizations #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase max file size to 2 MB #74

Increase max file size to 2 MB #74

Uh oh!

melodell commented Mar 12, 2024 •

edited

Loading

Uh oh!

codecov bot commented Mar 12, 2024

Uh oh!

awdeorio commented Mar 12, 2024

Uh oh!

noah-weingarden commented Mar 12, 2024

Uh oh!

melodell commented Mar 12, 2024

Uh oh!

awdeorio commented Mar 13, 2024

Uh oh!

awdeorio left a comment

Uh oh!

noah-weingarden commented Mar 13, 2024

Uh oh!

awdeorio commented Mar 13, 2024

Uh oh!

noah-weingarden commented Mar 13, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Increase max file size to 2 MB #74

Increase max file size to 2 MB #74

Uh oh!

Conversation

melodell commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 12, 2024

Codecov Report

Uh oh!

awdeorio commented Mar 12, 2024

Uh oh!

noah-weingarden commented Mar 12, 2024

Uh oh!

melodell commented Mar 12, 2024

Uh oh!

awdeorio commented Mar 13, 2024

Uh oh!

awdeorio left a comment

Choose a reason for hiding this comment

Uh oh!

noah-weingarden commented Mar 13, 2024

Uh oh!

awdeorio commented Mar 13, 2024

Uh oh!

noah-weingarden commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

melodell commented Mar 12, 2024 •

edited

Loading

noah-weingarden commented Mar 13, 2024 •

edited

Loading