The template for this repo is forked from git://g.csail.mit.edu/6.5840-golabs-2025
, but the implementation is mine.
The repo will contain all the lab assingments for MIT-6.824: Distributed Systems.
A Go implementation of MapReduce, a programming model for processing large datasets across distributed systems. MapReduce simplifies distributed data processing by abstracting the complexities of parallelization, fault-tolerance, data distribution, and load balancing behind a clean programming interface.
This implementation is designed to understand the fundamentals of distributed computing, fault tolerance, and parallel processing.
This project was made using this paper as reference.,br>
My implementation passes all the test cases except for the early-exit, and so far I haven't been able to pinpoint why exactly it is failing.
The MapReduce system consists of two main components:
- Coordinates the entire MapReduce operation
- Distributes map and reduce tasks to workers
- Monitors worker health and handles failures
- Manages intermediate files and task scheduling
- Maintains state information about task progress
- Infitely requests the coordinator for tasks, until the coordinator exits.
- Executes map and reduce tasks as assigned by the master
- Handles input/output operations for their assigned tasks
- Communicates with master about task completion and status
- Can be dynamically added or removed from the cluster
src/
├── main/ # Main applications and test programs (provided in template )
├── mrapps/ # MapReduce applications (map and reduce functions) (provided in template )
└── mr/ # Core MapReduce library implementation (implemented by me)
- Already partitioned input is distributed to map workers
- Each map worker applies the map function to its input chunk
- Map function outputs key-value pairs
- Output is partitioned into nReduce buckets using hash function
- Intermediate files are written to disk
inter_{map-task-id}_{bucket-id}
- Reduce workers read intermediate files from all map tasks
- Data is sorted by key to group related values
- Reduce function is applied to each unique key and its values
- Final output is written to result files
mr-out-{bucket-id}
- Master assigns tasks to available workers
- Workers report task completion back to master
- Master tracks task states and handles failures
- Master detects failed workers through timeouts and automatically reassigns them to healthy workers
- System proceeds to reduce phase after all map tasks complete
Providing Map and Reduce functions dynamically at runtime via plugins.
Concept | Why It Matters | Lab Context |
---|---|---|
Dynamic Loading | Enables swapping Map/Reduce logic without recompiling workers | Lets you test different apps (wc, indexer) with same worker code |
Type Assertion | Ensures loaded functions match expected signatures | Prevents runtime errors |
Plugin System | Go-specific way to share compiled code | Alternative to RPC-based distribution |
This specific implementation is meant for a single machine with threads cosplaying as worker nodes. Unix sockets seems like a better fit.
They are faster and lighter for communication between processes on the same machine, because they avoid the full network stack.
Unix Domain Socket | TCP Socket |
---|---|
Communicates via a file on disk | Communicates via an IP/port |
Only works for processes on same host | Can work across machines |
Lower overhead, faster for local IPC | Slightly more overhead, network stack involved |
Access controlled by filesystem permissions | Access controlled by firewall, IP, etc. |
- MapReduce: Simplified Data Processing on Large Clusters - Original Google paper
- MIT 6.824 Distributed Systems - Course materials and assignments
- Go RPC Package Documentation - RPC implementation details