JSI Scraper

A web scraping project specifically designed to collect data from jogjasonicindex.com by parsing HTML elements. Of course, this is ✨vibe coded✨ using QWEN CODE.

Setup

Make sure you have Python 3.11+ installed (Python 3.10 is also supported when using Docker)
Install the required dependencies:

pip install -r requirements.txt

Available Libraries

This project includes the following libraries specifically for scraping jogjasonicindex.com:

requests - For making HTTP requests
beautifulsoup4 - For parsing HTML/XML
fastapi - For API framework
uvicorn - For ASGI server
pydantic - For data validation

Getting Started

The scrape.py file contains a basic template to start your scraping project for jogjasonicindex.com.

For API usage, see the main FastAPI application in app/main.py.

API Usage

The project now includes a FastAPI application with two endpoints:

Starting the API Server

cd app && uvicorn main:app --host 0.0.0.0 --port 8000

Running with Docker

You can also run this application using Docker:

Build the Docker image:

docker build -t jsi-scraper .

Run the container:

docker run -p 8000:8000 jsi-scraper

The application will be available at http://localhost:8000.

Endpoints

JSON Endpoint: /scrape/json
- Method: GET
- Parameters:
  - max_pages (optional): Limit the number of project pages to scrape
- Returns: JSON formatted scraped data from jogjasonicindex.com
CSV Endpoint: /scrape/csv
- Method: GET
- Parameters:
  - max_pages (optional): Limit the number of project pages to scrape
- Returns: CSV formatted scraped data from jogjasonicindex.com as a downloadable file
JSON Range Endpoint: /scrape/json/range
- Method: GET
- Parameters:
  - page_from (required): Starting page number (≥ 1)
  - page_to (required): Ending page number (≥ page_from)
- Returns: JSON formatted scraped data from jogjasonicindex.com for the specified page range
CSV Range Endpoint: /scrape/csv/range
- Method: GET
- Parameters:
  - page_from (required): Starting page number (≥ 1)
  - page_to (required): Ending page number (≥ page_from)
- Returns: CSV formatted scraped data from jogjasonicindex.com as a downloadable file for the specified page range

API Documentation

After starting the server, visit http://localhost:8000/docs to access the interactive API documentation and test the endpoints.

Important Notes

Always respect jogjasonicindex.com's robots.txt file
Be mindful of rate limiting to avoid being blocked
Check jogjasonicindex.com's terms of service before scraping
This scraper is specifically designed for jogjasonicindex.com and may not work on other websites

Improvements for Production Deployment

The application has been enhanced to handle production deployment challenges, specifically addressing 504 Gateway Timeout errors:

Timeout Handling Improvements

Request Timeout Protection: Each HTTP request has a 30-second timeout with retry mechanism and exponential backoff
Overall Scraping Timeout: Total scraping operation limited to 60 seconds with automatic termination
Improved Headers: Better browser mimicry to reduce blocking chances
Efficient Processing: Reduced inter-request delay from 0.5s to 0.2s while maintaining server respectfulness

Async Processing Implementation

Thread Pool Executor: Long-running scraping operations run in separate threads to prevent blocking
Non-blocking Endpoints: API endpoints use async/await patterns with thread pooling
Graceful Shutdown: Proper cleanup of thread pools on server shutdown

Production Configuration

Gunicorn Server: Replaced simple uvicorn with Gunicorn for better production handling
Configurable Workers: Multiple worker processes for concurrent request handling
Extended Timeouts: Server timeout configuration increased to 300 seconds
Enhanced Logging: Better access and error logging for production monitoring

Usage Recommendations

Limit Scraping Scope: Always specify max_pages parameter to prevent long-running operations
Monitor Progress: Check logs during operation to track scraping progress
Implement Caching: For repeated requests, consider implementing caching to reduce load
Consider Background Jobs: For extensive scraping, consider implementing a background job queue

Releases

Current Version

Version: 1.0.0

Release Process

This project follows semantic versioning. For information about releases and how to create them, see:

CHANGELOG.md - Detailed changes for each release
RELEASE_NOTES.md - Detailed release notes

Docker Images

Docker images are published to DockerHub with version tags. To use a specific version:

docker pull pawitrawarda/jsi-scraper:latest

GitHub Releases

Check the Releases page for downloadable assets and release notes.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
app		app
logs_report		logs_report
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
FEATURE_TODO.md		FEATURE_TODO.md
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
RELEASE_PROCESS.md		RELEASE_PROCESS.md
RFC_PAGE_RANGE.md		RFC_PAGE_RANGE.md
VERSION		VERSION
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
element_analysis.md		element_analysis.md
fastapi_todo.md		fastapi_todo.md
gunicorn_config.py		gunicorn_config.py
nohup.out		nohup.out
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JSI Scraper

Setup

Available Libraries

Getting Started

API Usage

Starting the API Server

Running with Docker

Endpoints

API Documentation

Important Notes

Improvements for Production Deployment

Timeout Handling Improvements

Async Processing Implementation

Production Configuration

Usage Recommendations

Releases

Current Version

Release Process

Docker Images

GitHub Releases

About

Uh oh!

Releases 6

Packages

Languages

shonenxnaifu/jsi-scraper

Folders and files

Latest commit

History

Repository files navigation

JSI Scraper

Setup

Available Libraries

Getting Started

API Usage

Starting the API Server

Running with Docker

Endpoints

API Documentation

Important Notes

Improvements for Production Deployment

Timeout Handling Improvements

Async Processing Implementation

Production Configuration

Usage Recommendations

Releases

Current Version

Release Process

Docker Images

GitHub Releases

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages