A Python-based web scraping and data processing workflow to extract and organize public court decision data from the Indonesian Supreme Court (Mahkamah Agung, MA) website. This project uses Beautiful Soup and requests for extraction, and Pandas for data manipulation, and is designed to run within a Docker container, orchestrating tasks using Apache Airflow.
The processed data is ultimately loaded into Google BigQuery for analytical use.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Docker and Docker Compose
- Python (specifically for running Airflow tasks or local testing)
- Access to a Google Cloud Platform (GCP) project with BigQuery enabled (for the final data load).
- Clone the repository:
git clone https://github.com/ramadiansyah/ma-web-scrapping-to-bigquery.git cd ma-web-scrapping-to-bigquery - Configure Environment Variables:
Create a
.envfile in the root directory to store configuration for Docker, Airflow, and GCP credentials. Example variables to include (adapt as necessary):AIRFLOW_UID=50000 # Update this with the path to your GCP service account key file GOOGLE_APPLICATION_CREDENTIALS=/opt/airflow/dags/google-credentials.json GCP_PROJECT_ID=your-gcp-project-id BIGQUERY_DATASET=ma_dataset
- Place Credentials:
Ensure your GCP service account key file (
google-credentials.json) is placed in a location accessible by the Airflow container, as referenced in your.envfile. - Build and Run with Docker Compose:
This command builds the necessary Docker images (including the custom Python environment) and starts the Airflow services.
docker-compose up -d --build
The entire process is visualized in the following Data Flow Diagram:
- Website MA (Mahkamah Agung): The primary source of public court decisions.
| Component | Technology | Role |
|---|---|---|
| Orchestration | Docker & Apache Airflow | Manages the sequence and scheduling of the scraping and processing tasks. |
| HTML Extraction | requests & Beautiful Soup | Handles HTTP requests and parses the main listing pages to extract metadata and pagination links. |
| PDF Extraction | fitz (PyMuPDF) | Extracts text and data from the PDF files of the detailed court decisions. |
| Data Processing | Pandas | Cleans, transforms, and merges the extracted data into structured DataFrames. |
| Storage (Staging) | On-Premises File Storage (e.g., local volume mount) | Temporarily holds HTML snapshots, raw/staged/final data in pickle format before loading to BigQuery. |
| Final Destination | Google Cloud Platform (BigQuery) | The final, persistent data warehouse for structured and queryable court decision data. |
The workflow is broken down into three main parallel and sequential pipelines:
-
Pagination Metadata Extraction & Transformation
- Extract: Fetches the main website pages (using
requests) and parses the HTML snapshot (usingBeautiful Soup) to get all pagination links and basic metadata. - Transform: Data is loaded into a Pandas DataFrame, de-duplicated, and filtered by required year/month.
- Load: The final metadata is loaded into BigQuery as
pagination_metadata_final.
- Extract: Fetches the main website pages (using
-
Decision Detail Data Extraction & Transformation (from HTML)
- Extract: Uses the metadata from the first stage to scrape the detailed HTML page for each decision.
- Load: The raw data is saved into a pickle file.
- Transform: The raw data is loaded back into Pandas, where it's de-duplicated (
df.dedup by key). - Load: The structured detail data is loaded into BigQuery as
putusan_detail_raw.
-
PDF Document Extraction (from PDF)
- Extract: Uses
requeststo download the decision PDF files. - Extract/Transform: Uses PyMuPDF (fitz) and Pandas to extract the text content from the PDF.
- Load: The extracted PDF text content is loaded into BigQuery as
putusan_pdf_raw.
- Extract: Uses
Feel free to open issues or submit pull requests for improvements, bug fixes, or new features.
This project is licensed under the MIT License - see the LICENSE file for details.
Ramadiansyah
- GitHub: ramadiansyah
