eco-pulse

The Guardian Environment News Enrichment & Search Pipeline

A cloud-native data pipeline designed to collect, enrich (via AI), and archive news articles. The architecture follows an S3-first approach to minimize costs associated with AWS OpenSearch.

Tech Stack

Java (Quarkus Native): Core logic for Daily and Backfill Lambdas, and the Historical processor.
Python: Utility scripts for data fetching and S3 orchestration.
AWS Lambda: Serverless execution of daily enrichment and data restoration.
AWS S3: The "Single Source of Truth" (SSoT) storing both raw CSVs and enriched JSONs.
AWS OpenSearch: Search engine for front-end queries (managed as an on-demand resource).
OpenAI API: GPT models used for content summarization and sentiment analysis.

Data Flow & Workflows

1. Initial Bootstrapping (Historical Data)

To process past news from scratch:

Python Script 1: Pulls news from The Guardian API for a specific date range and saves them to local .csv files.
Python Script 2: Pushes these CSV files to the S3 bucket (raw-data/ prefix).
Historical Java Class: A local/standalone runner that pulls CSVs from S3, sends content to ChatGPT for enrichment, and generates JSON files.
Python Script 3: Pushes the final JSON files to S3 (enriched-news/ prefix).

2. Backfill Procedure `BackfillIndexingLambdaHandler`

When OpenSearch is re-enabled after a period of being offline:

Deploy/Start the OpenSearch Domain.
Invoke the Backfill Lambda.
The Lambda scans enriched-news/, reads all JSON files, and performs a Bulk Indexing operation to restore the search database.

3. Daily Pipeline (Automation)`ScheduledDailyLambdaHandler`

The automated daily cycle:

Trigger: Triggered nightly via AWS EventBridge.
Fetch: Downloads yesterday's news from The Guardian.
Enrich: Processes text through the OpenAI API.
S3 Archive: Saves the result to enriched-news/YYYY-MM-DD.json. This step is critical and never skipped.
Indexing: Attempts to push data to OpenSearch. If the domain is missing/off, it fails gracefully without re-triggering ChatGPT (saving API credits).

4. Analytics API Lambda (The "Server") `DashboardController`

The front-end doesn't talk to OpenSearch directly. Instead, there is a dedicated Server Lambda that acts as the analytics engine:

Purpose: Provides REST endpoints for the UI to fetch aggregated data and charts.
Date Range: Supports full historical analysis from January 1st, 2024, to the present day.
Efficiency: Optimized to query OpenSearch and format data specifically for visualization libraries (charts/graphs).

Security & Access Control

The API is not just protected by identity; it’s hardened against unauthorized clients:

Firebase App Check: Integrated to ensure only requests from my verified web application can access the analytics endpoints. This prevents scraping and unauthorized API usage (even if someone has a token).
Identity: Works alongside Firebase Auth to provide a multi-layered security model.
Logic: Implemented in FirebaseAuthFilter to validate the X-Firebase-AppCheck tokens on the server side.

Maintenance & Lessons Learned

Storage Hierarchy:
- raw/ -> Original API responses (CSV).
- enriched-news/ -> Final AI-processed data (JSON). One file per day.
OpenSearch Domain Rules Creation:
- Domain Level Access with IP and IAM Role

Python Utility Scripts

guardian_news_scrap.py: API interaction and CSV generation.
upload_to_s3_csv.py: Generic S3 uploader for local raw news (csv) files.
upload_to_s3_json.py: Generic S3 uploader for local enrich news (json) files.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ingest-service		ingest-service
query-service		query-service
util		util
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

eco-pulse

The Guardian Environment News Enrichment & Search Pipeline

Tech Stack

Data Flow & Workflows

1. Initial Bootstrapping (Historical Data)

2. Backfill Procedure `BackfillIndexingLambdaHandler`

3. Daily Pipeline (Automation)`ScheduledDailyLambdaHandler`

4. Analytics API Lambda (The "Server") `DashboardController`

Security & Access Control

Maintenance & Lessons Learned

Python Utility Scripts

About

Uh oh!

Releases

Packages

Languages

License

seregamazur/eco-pulse

Folders and files

Latest commit

History

Repository files navigation

eco-pulse

The Guardian Environment News Enrichment & Search Pipeline

Tech Stack

Data Flow & Workflows

1. Initial Bootstrapping (Historical Data)

2. Backfill Procedure BackfillIndexingLambdaHandler

3. Daily Pipeline (Automation)ScheduledDailyLambdaHandler

4. Analytics API Lambda (The "Server") DashboardController

Security & Access Control

Maintenance & Lessons Learned

Python Utility Scripts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Backfill Procedure `BackfillIndexingLambdaHandler`

3. Daily Pipeline (Automation)`ScheduledDailyLambdaHandler`

4. Analytics API Lambda (The "Server") `DashboardController`

Packages