Skip to content

Conversation

@melodell
Copy link
Member

@melodell melodell commented Mar 12, 2024

To support EECS 485 P5 -- The HTML Dataset changes use full Wikipedia pages. We want to make sure that each file gets read in its entirety in a MapReduce job so that students can use the full document content in the pipeline. Some Wikipedia pages are larger than 1 MB, so let's bump the limit so they don't get split.

@melodell melodell requested a review from awdeorio March 12, 2024 16:57
@codecov
Copy link

codecov bot commented Mar 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.31%. Comparing base (5408a7d) to head (e02df80).
Report is 6 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop      #74   +/-   ##
========================================
  Coverage    96.31%   96.31%           
========================================
  Files            4        4           
  Lines          217      217           
========================================
  Hits           209      209           
  Misses           8        8           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@awdeorio
Copy link
Contributor

I'd happy to make the change, but it would be helpful to memorialize a short explanation here.

@noah-weingarden
Copy link
Contributor

I think this is redundant because of #70 too?

@melodell
Copy link
Member Author

I'd happy to make the change, but it would be helpful to memorialize a short explanation here.

Was short on time earlier. Updated now!

@awdeorio
Copy link
Contributor

I think this is redundant because of #70 too?

There are some changes coming down the pipeline (pun intended) to EECS 485 Project 5. The first stage of the MapReduce pipeline will process pages in parallel map tasks. Currently, the first mapper is a single task which processes all input files (each file is a wiki page).

Copy link
Contributor

@awdeorio awdeorio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@awdeorio awdeorio merged commit 44326de into develop Mar 13, 2024
@awdeorio awdeorio deleted the increase-max-input-split-size branch March 13, 2024 01:36
@noah-weingarden
Copy link
Contributor

There are some changes coming down the pipeline (pun intended) to EECS 485 Project 5. The first stage of the MapReduce pipeline will process pages in parallel map tasks. Currently, the first mapper is a single task which processes all input files (each file is a wiki page).

That sounds promising; is it related to this comment of mine?

@awdeorio
Copy link
Contributor

It's different. HTML parsing will be part of the first pipeline stage (stage 0). There will be one mapper per file, which means one execution per HTML document.

@noah-weingarden
Copy link
Contributor

noah-weingarden commented Mar 13, 2024

Mappers will run per file instead of per line? That sounds like we're not going to be using a streaming interface anymore.

@noah-weingarden noah-weingarden mentioned this pull request Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants