Skip to content

Latest commit

 

History

History
103 lines (69 loc) · 4.78 KB

File metadata and controls

103 lines (69 loc) · 4.78 KB

🐍 ma-web-scrapping

A Python-based web scraping and data processing workflow to extract and organize public court decision data from the Indonesian Supreme Court (Mahkamah Agung, MA) website. This project uses Beautiful Soup and requests for extraction, and Pandas for data manipulation, and is designed to run within a Docker container, orchestrating tasks using Apache Airflow.

The processed data is ultimately loaded into Google BigQuery for analytical use.

ma-web-scrapping-to-bigquery

🚀 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Docker and Docker Compose
  • Python (specifically for running Airflow tasks or local testing)
  • Access to a Google Cloud Platform (GCP) project with BigQuery enabled (for the final data load).

Installation

  1. Clone the repository:
    git clone https://github.com/ramadiansyah/ma-web-scrapping-to-bigquery.git
    cd ma-web-scrapping-to-bigquery
  2. Configure Environment Variables: Create a .env file in the root directory to store configuration for Docker, Airflow, and GCP credentials. Example variables to include (adapt as necessary):
    AIRFLOW_UID=50000 
    # Update this with the path to your GCP service account key file
    GOOGLE_APPLICATION_CREDENTIALS=/opt/airflow/dags/google-credentials.json 
    GCP_PROJECT_ID=your-gcp-project-id
    BIGQUERY_DATASET=ma_dataset 
  3. Place Credentials: Ensure your GCP service account key file (google-credentials.json) is placed in a location accessible by the Airflow container, as referenced in your .env file.
  4. Build and Run with Docker Compose:
    docker-compose up -d --build
    This command builds the necessary Docker images (including the custom Python environment) and starts the Airflow services.

⚙️ Architecture and Workflow

The entire process is visualized in the following Data Flow Diagram:


🌐 Data Sources

  • Website MA (Mahkamah Agung): The primary source of public court decisions.

📦 Tools and Technologies

Component Technology Role
Orchestration Docker & Apache Airflow Manages the sequence and scheduling of the scraping and processing tasks.
HTML Extraction requests & Beautiful Soup Handles HTTP requests and parses the main listing pages to extract metadata and pagination links.
PDF Extraction fitz (PyMuPDF) Extracts text and data from the PDF files of the detailed court decisions.
Data Processing Pandas Cleans, transforms, and merges the extracted data into structured DataFrames.
Storage (Staging) On-Premises File Storage (e.g., local volume mount) Temporarily holds HTML snapshots, raw/staged/final data in pickle format before loading to BigQuery.
Final Destination Google Cloud Platform (BigQuery) The final, persistent data warehouse for structured and queryable court decision data.

📋 Workflow Stages

The workflow is broken down into three main parallel and sequential pipelines:

  1. Pagination Metadata Extraction & Transformation

    • Extract: Fetches the main website pages (using requests) and parses the HTML snapshot (using Beautiful Soup) to get all pagination links and basic metadata.
    • Transform: Data is loaded into a Pandas DataFrame, de-duplicated, and filtered by required year/month.
    • Load: The final metadata is loaded into BigQuery as pagination_metadata_final.
  2. Decision Detail Data Extraction & Transformation (from HTML)

    • Extract: Uses the metadata from the first stage to scrape the detailed HTML page for each decision.
    • Load: The raw data is saved into a pickle file.
    • Transform: The raw data is loaded back into Pandas, where it's de-duplicated (df.dedup by key).
    • Load: The structured detail data is loaded into BigQuery as putusan_detail_raw.
  3. PDF Document Extraction (from PDF)

    • Extract: Uses requests to download the decision PDF files.
    • Extract/Transform: Uses PyMuPDF (fitz) and Pandas to extract the text content from the PDF.
    • Load: The extracted PDF text content is loaded into BigQuery as putusan_pdf_raw.

🤝 Contribution

Feel free to open issues or submit pull requests for improvements, bug fixes, or new features.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Ramadiansyah