Skip to content

huy-dg/data_scraping_API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Scraping API

A lightweight Python project for scraping product details from any API and saving the results in json format.

This repo contains two implementations:

  • api_call_async.py — asynchronous scraping using aiohttp (recommended for high concurrency)
  • api_call_multi_thread.py — multi-threaded scraping using requests and ThreadPoolExecutor

✅ What this project does

  • Reads a list of product IDs from an Excel file
  • Fetches product details from a configurable API endpoint in batches
  • Writes:
    • successful records to a fetched_data folder
    • failed/errored records to a faulty_data folder
  • Uses a rotating User-Agent pool to reduce blocking
  • Implements retries + exponential backoff for transient failures (e.g., 429 Too Many Requests)

✅ Features

  • Batch-driven scraping with configurable batch size
  • Rotating User-Agent pool to reduce blocking
  • Retry + exponential backoff for transient failures (e.g., 429 Too Many Requests)
  • Output results to JSON files (fetched / faulty)
  • Optional async mode for better throughput

🧩 Project Structure

api_call_async.py
api_call_multi_thread.py
user_agents.json
async_fetched_data/
async_faulty_data/
fetched_data/
faulty_data/
.env.example
README.md

⚙️ Prerequisites

  • Python 3.10+
  • pip available

⚙️ Setup

  1. Create a Python environment (recommended):

    python -m venv .venv
    .venv\\Scripts\\activate
  2. Install dependencies:

    pip install -r requirements.txt

    If there is no requirements.txt, install the used packages manually:

    pip install aiohttp python-dotenv pandas python-calamine
  3. Copy .env.example to .env and fill in your settings:

    DATA_ID_SOURCE="path/to/excel.xlsx"
    FETCHED_DATA_SINK="async_fetched_data"
    FAULTY_DATA_SINK="async_faulty_data"
    API_URL="https://api.tiki.vn/product-detail/api/v1/products"
    BATCH_SIZE=100
    ASYNC_CONCURRENT_LIMIT=50
    USER_AGENT_LIST="user_agents.json"

▶️ Running the async API function (recommended)

python api_call_async.py

This will:

  • Read product IDs from the Excel file configured by DATA_ID_SOURCE
  • Fetch product details in batches concurrently
  • Write successful results to the FETCHED_DATA_SINK folder
  • Write failed/errored results to the FAULTY_DATA_SINK folder
  • This mode is ideal for high throughput using aiohttp and asynchronous I/O.

▶️ Running the multi-threaded scraper

python api_call_multi_thread.py

This mode uses a thread pool for synchronous HTTP requests and is useful for debugging or when asyncio is not desired.


💡 Customization

Batch size

Change BATCH_SIZE in .env.

Concurrent limit (async)

Adjust ASYNC_CONCURRENT_LIMIT in .env.

Worker limit (multi-threaded)

Adjust MAX_WORKERS in .env.

Output file naming

Uses the write_to_json() helper and supports optional keyword args like pre_fix and batch_no for descriptive filenames.


🧪 Notes / Troubleshooting

  • TypeError: write_to_json() got an unexpected keyword argument ...

    • Ensure write_to_json() is defined with **kwargs.
  • Too many 429 Too Many Requests errors

    • Reduce concurrency (async: ASYNC_CONCURRENT_LIMIT, thread: MAX_WORKERS)
    • Increase backoff delay between retries
  • Missing/invalid .env values

    • Confirm .env exists and is formatted correctly (numeric values for BATCH_SIZE, ASYNC_CONCURRENT_LIMIT, MAX_WORKERS).

📓 Development Notes

06 Mar 2026

Changes made:

  • multi-threading method for API calls:

    • Implement new data loading library for reading large Excel files efficiently:
      • openpyxl with optimized read-only mode
      • pandas with calamine for quick Excel file parsing -> better performance and lower memory usage
    • Adjust naming conventions for better clarity and consistency across the codebase
    • Refactor data batching logic to allow for configurable batch sizes, improving flexibility and scalability of the data processing pipeline
  • async method for API calls:

    • Implemented asynchronous API calls using aiohttp to improve performance and reduce latency when fetching data from the API
    • Added error handling for HTTP status codes, including retries with exponential backoff for transient errors like 429 Too Many Requests
    • Refactored code to separate concerns, improving readability and maintainability of the codebase

Future improvements:

  • Implement logging instead of print statements for better monitoring and debugging
  • Add unit tests for critical functions to ensure reliability and facilitate future refactoring

📨 Contact

For questions or contributions, please open an issue or submit a pull request. Email: duongghuy96@gmail.com


📌 License

MIT-style (adapt as needed)

About

Crawling 200.000 data from Tiki website via API.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages