Author: Wenlue Jeff Zhong
Project Date: 05/23/2024
This project demonstrates the implementation of a parallel web scraping system designed to scrape product data from an e-commerce demo site. The system includes three versions:
- Sequential Implementation
- Parallel Implementation using Channels
- Work-Stealing Implementation
To run the benchmark script and reproduce the results, use the following command:
python benchmark/benchmark.pyThis script will execute each implementation (sequential, parallel channel-based, and work-stealing) 10 times, calculate the average execution time, and generate performance analysis.
For examining specific implementations:
Sequential Implementation:
./scraper --seqParallel Implementation using Channels:
./scraper --chan --threads <numThreads>Work-Stealing Implementation:
./scraper --workstealing --threads <numThreads>The final report summarizing the results and analysis can be found at:
benchmark/final_report.pdfmain.go: The main entry point for the program.sequential/sequential.go: Contains the sequential implementation.parallel/parallel.go: Contains the parallel implementation using channels.workstealing/workstealing.go: Contains the work-stealing implementation.util/util.go: Contains utility functions for downloading images and saving to CSV.benchmark/benchmark.py: The benchmark script for running and analyzing the implementations.benchmark/speedup_graph.png: The generated speedup graph.benchmark/benchmark_results.txt: The raw results of the benchmark runs.benchmark/final_report.pdf: The final report in PDF format.
- Go: Ensure you have Go installed on your system.
- Python: For running the benchmark script and generating graphs.