Data Scraping API

A lightweight Python project for scraping product details from any API and saving the results in json format.

This repo contains two implementations:

api_call_async.py — asynchronous scraping using aiohttp (recommended for high concurrency)
api_call_multi_thread.py — multi-threaded scraping using requests and ThreadPoolExecutor

✅ What this project does

Reads a list of product IDs from an Excel file
Fetches product details from a configurable API endpoint in batches
Writes:
- successful records to a fetched_data folder
- failed/errored records to a faulty_data folder
Uses a rotating User-Agent pool to reduce blocking
Implements retries + exponential backoff for transient failures (e.g., 429 Too Many Requests)

✅ Features

Batch-driven scraping with configurable batch size
Rotating User-Agent pool to reduce blocking
Retry + exponential backoff for transient failures (e.g., 429 Too Many Requests)
Output results to JSON files (fetched / faulty)
Optional async mode for better throughput

🧩 Project Structure

api_call_async.py
api_call_multi_thread.py
user_agents.json
async_fetched_data/
async_faulty_data/
fetched_data/
faulty_data/
.env.example
README.md

⚙️ Prerequisites

Python 3.10+
pip available

⚙️ Setup

Create a Python environment (recommended):

python -m venv .venv
.venv\\Scripts\\activate

Install dependencies:
```
pip install -r requirements.txt
```
If there is no requirements.txt, install the used packages manually:
```
pip install aiohttp python-dotenv pandas python-calamine
```

Copy .env.example to .env and fill in your settings:

DATA_ID_SOURCE="path/to/excel.xlsx"
FETCHED_DATA_SINK="async_fetched_data"
FAULTY_DATA_SINK="async_faulty_data"
API_URL="https://api.tiki.vn/product-detail/api/v1/products"
BATCH_SIZE=100
ASYNC_CONCURRENT_LIMIT=50
USER_AGENT_LIST="user_agents.json"

▶️ Running the async API function (recommended)

python api_call_async.py

This will:

Read product IDs from the Excel file configured by DATA_ID_SOURCE
Fetch product details in batches concurrently
Write successful results to the FETCHED_DATA_SINK folder
Write failed/errored results to the FAULTY_DATA_SINK folder
This mode is ideal for high throughput using aiohttp and asynchronous I/O.

▶️ Running the multi-threaded scraper

python api_call_multi_thread.py

This mode uses a thread pool for synchronous HTTP requests and is useful for debugging or when asyncio is not desired.

💡 Customization

Batch size

Change BATCH_SIZE in .env.

Concurrent limit (async)

Adjust ASYNC_CONCURRENT_LIMIT in .env.

Worker limit (multi-threaded)

Adjust MAX_WORKERS in .env.

Output file naming

Uses the write_to_json() helper and supports optional keyword args like pre_fix and batch_no for descriptive filenames.

🧪 Notes / Troubleshooting

TypeError: write_to_json() got an unexpected keyword argument ...
- Ensure write_to_json() is defined with **kwargs.
Too many 429 Too Many Requests errors
- Reduce concurrency (async: ASYNC_CONCURRENT_LIMIT, thread: MAX_WORKERS)
- Increase backoff delay between retries
Missing/invalid .env values
- Confirm .env exists and is formatted correctly (numeric values for BATCH_SIZE, ASYNC_CONCURRENT_LIMIT, MAX_WORKERS).

📓 Development Notes

06 Mar 2026

Changes made:

multi-threading method for API calls:
- Implement new data loading library for reading large Excel files efficiently:
  - openpyxl with optimized read-only mode
  - pandas with calamine for quick Excel file parsing -> better performance and lower memory usage
- Adjust naming conventions for better clarity and consistency across the codebase
- Refactor data batching logic to allow for configurable batch sizes, improving flexibility and scalability of the data processing pipeline
async method for API calls:
- Implemented asynchronous API calls using aiohttp to improve performance and reduce latency when fetching data from the API
- Added error handling for HTTP status codes, including retries with exponential backoff for transient errors like 429 Too Many Requests
- Refactored code to separate concerns, improving readability and maintainability of the codebase

Future improvements:

Implement logging instead of print statements for better monitoring and debugging
Add unit tests for critical functions to ensure reliability and facilitate future refactoring

📨 Contact

For questions or contributions, please open an issue or submit a pull request. Email: duongghuy96@gmail.com

📌 License

MIT-style (adapt as needed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Scraping API

✅ What this project does

✅ Features

🧩 Project Structure

⚙️ Prerequisites

⚙️ Setup

▶️ Running the async API function (recommended)

▶️ Running the multi-threaded scraper

💡 Customization

Batch size

Concurrent limit (async)

Worker limit (multi-threaded)

Output file naming

🧪 Notes / Troubleshooting

📓 Development Notes

06 Mar 2026

📨 Contact

📌 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Scraping API

✅ What this project does

✅ Features

🧩 Project Structure

⚙️ Prerequisites

⚙️ Setup

▶️ Running the async API function (recommended)

▶️ Running the multi-threaded scraper

💡 Customization

Batch size

Concurrent limit (async)

Worker limit (multi-threaded)

Output file naming

🧪 Notes / Troubleshooting

📓 Development Notes

06 Mar 2026

📨 Contact

📌 License