A lightweight Python project for scraping product details from any API and saving the results in json format.
This repo contains two implementations:
api_call_async.py— asynchronous scraping usingaiohttp(recommended for high concurrency)api_call_multi_thread.py— multi-threaded scraping usingrequestsandThreadPoolExecutor
- Reads a list of product IDs from an Excel file
- Fetches product details from a configurable API endpoint in batches
- Writes:
- successful records to a
fetched_datafolder - failed/errored records to a
faulty_datafolder
- successful records to a
- Uses a rotating
User-Agentpool to reduce blocking - Implements retries + exponential backoff for transient failures (e.g.,
429 Too Many Requests)
- Batch-driven scraping with configurable batch size
- Rotating
User-Agentpool to reduce blocking - Retry + exponential backoff for transient failures (e.g.,
429 Too Many Requests) - Output results to JSON files (fetched / faulty)
- Optional async mode for better throughput
api_call_async.py
api_call_multi_thread.py
user_agents.json
async_fetched_data/
async_faulty_data/
fetched_data/
faulty_data/
.env.example
README.md
- Python 3.10+
pipavailable
-
Create a Python environment (recommended):
python -m venv .venv .venv\\Scripts\\activate
-
Install dependencies:
pip install -r requirements.txt
If there is no
requirements.txt, install the used packages manually:pip install aiohttp python-dotenv pandas python-calamine
-
Copy
.env.exampleto.envand fill in your settings:DATA_ID_SOURCE="path/to/excel.xlsx" FETCHED_DATA_SINK="async_fetched_data" FAULTY_DATA_SINK="async_faulty_data" API_URL="https://api.tiki.vn/product-detail/api/v1/products" BATCH_SIZE=100 ASYNC_CONCURRENT_LIMIT=50 USER_AGENT_LIST="user_agents.json"
python api_call_async.pyThis will:
- Read product IDs from the Excel file configured by
DATA_ID_SOURCE - Fetch product details in batches concurrently
- Write successful results to the
FETCHED_DATA_SINKfolder - Write failed/errored results to the
FAULTY_DATA_SINKfolder - This mode is ideal for high throughput using
aiohttpand asynchronous I/O.
python api_call_multi_thread.pyThis mode uses a thread pool for synchronous HTTP requests and is useful for debugging or when asyncio is not desired.
Change BATCH_SIZE in .env.
Adjust ASYNC_CONCURRENT_LIMIT in .env.
Adjust MAX_WORKERS in .env.
Uses the write_to_json() helper and supports optional keyword args like pre_fix and batch_no for descriptive filenames.
-
TypeError: write_to_json() got an unexpected keyword argument ...- Ensure
write_to_json()is defined with**kwargs.
- Ensure
-
Too many
429 Too Many Requestserrors- Reduce concurrency (async:
ASYNC_CONCURRENT_LIMIT, thread:MAX_WORKERS) - Increase backoff delay between retries
- Reduce concurrency (async:
-
Missing/invalid
.envvalues- Confirm
.envexists and is formatted correctly (numeric values forBATCH_SIZE,ASYNC_CONCURRENT_LIMIT,MAX_WORKERS).
- Confirm
Changes made:
-
multi-threading method for API calls:
- Implement new data loading library for reading large Excel files efficiently:
- openpyxl with optimized read-only mode
- pandas with calamine for quick Excel file parsing -> better performance and lower memory usage
- Adjust naming conventions for better clarity and consistency across the codebase
- Refactor data batching logic to allow for configurable batch sizes, improving flexibility and scalability of the data processing pipeline
- Implement new data loading library for reading large Excel files efficiently:
-
async method for API calls:
- Implemented asynchronous API calls using aiohttp to improve performance and reduce latency when fetching data from the API
- Added error handling for HTTP status codes, including retries with exponential backoff for transient errors like 429 Too Many Requests
- Refactored code to separate concerns, improving readability and maintainability of the codebase
Future improvements:
- Implement logging instead of print statements for better monitoring and debugging
- Add unit tests for critical functions to ensure reliability and facilitate future refactoring
For questions or contributions, please open an issue or submit a pull request. Email: duongghuy96@gmail.com
MIT-style (adapt as needed)