ma-web-scrapping-to-bigquery/README.md at main · ramadiansyah/ma-web-scrapping-to-bigquery

🐍 ma-web-scrapping

A Python-based web scraping and data processing workflow to extract and organize public court decision data from the Indonesian Supreme Court (Mahkamah Agung, MA) website. This project uses Beautiful Soup and requests for extraction, and Pandas for data manipulation, and is designed to run within a Docker container, orchestrating tasks using Apache Airflow.

The processed data is ultimately loaded into Google BigQuery for analytical use.

🚀 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Docker and Docker Compose
Python (specifically for running Airflow tasks or local testing)
Access to a Google Cloud Platform (GCP) project with BigQuery enabled (for the final data load).

Installation

Clone the repository:

git clone https://github.com/ramadiansyah/ma-web-scrapping-to-bigquery.git
cd ma-web-scrapping-to-bigquery

Configure Environment Variables: Create a .env file in the root directory to store configuration for Docker, Airflow, and GCP credentials. Example variables to include (adapt as necessary):

AIRFLOW_UID=50000 
# Update this with the path to your GCP service account key file
GOOGLE_APPLICATION_CREDENTIALS=/opt/airflow/dags/google-credentials.json 
GCP_PROJECT_ID=your-gcp-project-id
BIGQUERY_DATASET=ma_dataset

Place Credentials: Ensure your GCP service account key file (google-credentials.json) is placed in a location accessible by the Airflow container, as referenced in your .env file.
Build and Run with Docker Compose:
```
docker-compose up -d --build
```
This command builds the necessary Docker images (including the custom Python environment) and starts the Airflow services.

⚙️ Architecture and Workflow

The entire process is visualized in the following Data Flow Diagram:

🌐 Data Sources

Website MA (Mahkamah Agung): The primary source of public court decisions.

📦 Tools and Technologies

Component	Technology	Role
Orchestration	Docker & Apache Airflow	Manages the sequence and scheduling of the scraping and processing tasks.
HTML Extraction	requests & Beautiful Soup	Handles HTTP requests and parses the main listing pages to extract metadata and pagination links.
PDF Extraction	fitz (PyMuPDF)	Extracts text and data from the PDF files of the detailed court decisions.
Data Processing	Pandas	Cleans, transforms, and merges the extracted data into structured DataFrames.
Storage (Staging)	On-Premises File Storage (e.g., local volume mount)	Temporarily holds HTML snapshots, raw/staged/final data in pickle format before loading to BigQuery.
Final Destination	Google Cloud Platform (BigQuery)	The final, persistent data warehouse for structured and queryable court decision data.

📋 Workflow Stages

The workflow is broken down into three main parallel and sequential pipelines:

Pagination Metadata Extraction & Transformation
- Extract: Fetches the main website pages (using requests) and parses the HTML snapshot (using Beautiful Soup) to get all pagination links and basic metadata.
- Transform: Data is loaded into a Pandas DataFrame, de-duplicated, and filtered by required year/month.
- Load: The final metadata is loaded into BigQuery as pagination_metadata_final.
Decision Detail Data Extraction & Transformation (from HTML)
- Extract: Uses the metadata from the first stage to scrape the detailed HTML page for each decision.
- Load: The raw data is saved into a pickle file.
- Transform: The raw data is loaded back into Pandas, where it's de-duplicated (df.dedup by key).
- Load: The structured detail data is loaded into BigQuery as putusan_detail_raw.
PDF Document Extraction (from PDF)
- Extract: Uses requests to download the decision PDF files.
- Extract/Transform: Uses PyMuPDF (fitz) and Pandas to extract the text content from the PDF.
- Load: The extracted PDF text content is loaded into BigQuery as putusan_pdf_raw.

🤝 Contribution

Feel free to open issues or submit pull requests for improvements, bug fixes, or new features.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Ramadiansyah

GitHub: ramadiansyah

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐍 ma-web-scrapping

🚀 Getting Started

Prerequisites

Installation

⚙️ Architecture and Workflow

🌐 Data Sources

📦 Tools and Technologies

📋 Workflow Stages

🤝 Contribution

📄 License

👤 Author

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🐍 ma-web-scrapping

🚀 Getting Started

Prerequisites

Installation

⚙️ Architecture and Workflow

🌐 Data Sources

📦 Tools and Technologies

📋 Workflow Stages

🤝 Contribution

📄 License

👤 Author