A web scraping project specifically designed to collect data from jogjasonicindex.com by parsing HTML elements. Of course, this is ✨vibe coded✨ using QWEN CODE.
- Make sure you have Python 3.11+ installed (Python 3.10 is also supported when using Docker)
- Install the required dependencies:
pip install -r requirements.txtThis project includes the following libraries specifically for scraping jogjasonicindex.com:
requests- For making HTTP requestsbeautifulsoup4- For parsing HTML/XMLfastapi- For API frameworkuvicorn- For ASGI serverpydantic- For data validation
The scrape.py file contains a basic template to start your scraping project for jogjasonicindex.com.
For API usage, see the main FastAPI application in app/main.py.
The project now includes a FastAPI application with two endpoints:
cd app && uvicorn main:app --host 0.0.0.0 --port 8000You can also run this application using Docker:
- Build the Docker image:
docker build -t jsi-scraper .- Run the container:
docker run -p 8000:8000 jsi-scraperThe application will be available at http://localhost:8000.
-
JSON Endpoint:
/scrape/json- Method: GET
- Parameters:
max_pages(optional): Limit the number of project pages to scrape
- Returns: JSON formatted scraped data from jogjasonicindex.com
-
CSV Endpoint:
/scrape/csv- Method: GET
- Parameters:
max_pages(optional): Limit the number of project pages to scrape
- Returns: CSV formatted scraped data from jogjasonicindex.com as a downloadable file
-
JSON Range Endpoint:
/scrape/json/range- Method: GET
- Parameters:
page_from(required): Starting page number (≥ 1)page_to(required): Ending page number (≥ page_from)
- Returns: JSON formatted scraped data from jogjasonicindex.com for the specified page range
-
CSV Range Endpoint:
/scrape/csv/range- Method: GET
- Parameters:
page_from(required): Starting page number (≥ 1)page_to(required): Ending page number (≥ page_from)
- Returns: CSV formatted scraped data from jogjasonicindex.com as a downloadable file for the specified page range
After starting the server, visit http://localhost:8000/docs to access the interactive API documentation and test the endpoints.
- Always respect jogjasonicindex.com's
robots.txtfile - Be mindful of rate limiting to avoid being blocked
- Check jogjasonicindex.com's terms of service before scraping
- This scraper is specifically designed for jogjasonicindex.com and may not work on other websites
The application has been enhanced to handle production deployment challenges, specifically addressing 504 Gateway Timeout errors:
- Request Timeout Protection: Each HTTP request has a 30-second timeout with retry mechanism and exponential backoff
- Overall Scraping Timeout: Total scraping operation limited to 60 seconds with automatic termination
- Improved Headers: Better browser mimicry to reduce blocking chances
- Efficient Processing: Reduced inter-request delay from 0.5s to 0.2s while maintaining server respectfulness
- Thread Pool Executor: Long-running scraping operations run in separate threads to prevent blocking
- Non-blocking Endpoints: API endpoints use async/await patterns with thread pooling
- Graceful Shutdown: Proper cleanup of thread pools on server shutdown
- Gunicorn Server: Replaced simple uvicorn with Gunicorn for better production handling
- Configurable Workers: Multiple worker processes for concurrent request handling
- Extended Timeouts: Server timeout configuration increased to 300 seconds
- Enhanced Logging: Better access and error logging for production monitoring
- Limit Scraping Scope: Always specify
max_pagesparameter to prevent long-running operations - Monitor Progress: Check logs during operation to track scraping progress
- Implement Caching: For repeated requests, consider implementing caching to reduce load
- Consider Background Jobs: For extensive scraping, consider implementing a background job queue
- Version: 1.0.0
This project follows semantic versioning. For information about releases and how to create them, see:
- CHANGELOG.md - Detailed changes for each release
- RELEASE_NOTES.md - Detailed release notes
Docker images are published to DockerHub with version tags. To use a specific version:
docker pull pawitrawarda/jsi-scraper:latestCheck the Releases page for downloadable assets and release notes.