Skip to content

feat(telemetry): add centralized threat intelligence data pipeline (S3 + Lambda + Athena)#35

Open
TrishaG189 wants to merge 1 commit intoc2siorg:mainfrom
TrishaG189:feat/threat-intel-pipeline
Open

feat(telemetry): add centralized threat intelligence data pipeline (S3 + Lambda + Athena)#35
TrishaG189 wants to merge 1 commit intoc2siorg:mainfrom
TrishaG189:feat/threat-intel-pipeline

Conversation

@TrishaG189
Copy link
Copy Markdown

Summary

Implements the complete threat intelligence data pipeline for the Honeynet project —
the layer that transforms raw attacker logs into actionable, queryable intelligence.

Problem

Every existing PR deploys honeypots that store logs locally on the instance.
These logs are lost on termination, siloed by region, and provide zero threat context.
A honeypot without centralized, enriched data is just a trap with no analysis.

What This PR Adds

1. terraform/modules/telemetry/ — Serverless Pipeline Infrastructure

  • S3 Log Sink — encrypted, append-only bucket receiving logs from any honeypot
    node on any cloud provider (AWS, GCP, Azure)
  • AWS Lambda (Python) — auto-triggered on every new log file, parses Cowrie JSON,
    queries AbuseIPDB for abuse score/country/ISP/Tor status, writes enriched record
  • AWS Secrets Manager — AbuseIPDB API key stored securely, never hardcoded
  • AWS Glue Crawler — runs hourly, auto-discovers schema from enriched logs
  • AWS Athena Workgroup — enables direct SQL queries on attack data, zero infrastructure

2. lambda/enrichment/handler.py — Python Enrichment Function

  • Parses every Cowrie log event (login attempts, commands, sessions)
  • Enriches attacker IP with: abuse score, country, ISP, domain, Tor status
  • Fails gracefully — if AbuseIPDB is unreachable, log is still stored unenriched
  • Private IPs are skipped automatically

3. ansible/playbooks/install_log_forwarder.yml — Filebeat Agent

Example Athena Queries (works immediately after deployment)

-- Top attacking IPs
SELECT src_ip, COUNT(*) as attempts FROM honeynet_attacks.enriched
ORDER BY attempts DESC LIMIT 20;

-- High-risk IPs
SELECT src_ip, abuse_score, country_code, isp FROM honeynet_attacks.enriched
WHERE abuse_score > 80 ORDER BY abuse_score DESC;

-- Most tried credentials
SELECT username, password, COUNT(*) as tries FROM honeynet_attacks.enriched
WHERE eventid = 'cowrie.login.failed'
GROUP BY username, password ORDER BY tries DESC LIMIT 20;

Architecture

Honeypot Node (AWS/GCP/Azure)
        ↓  Filebeat → TLS
S3 Log Sink (AES-256, versioned, lifecycle to Glacier)
        ↓  S3 trigger
Lambda (Python 3.12) → AbuseIPDB API → enriched JSON
        ↓
S3 Enriched Logs
        ↓  hourly Glue Crawler
Athena SQL — instant attack analysis

Alignment with GSoC Objectives

  • Data Enrichment — correlates attack IPs with global threat intelligence feeds
  • Scalability — serverless, zero servers to manage, handles any log volume
  • Distributed Architecture — works across all cloud providers and regions

Relation to Existing PRs

Proof of Execution (Live Environment Test)

To ensure this pipeline functions flawlessly in a real AWS environment, I deployed the complete module to a sandbox account, simulated Fluent Bit log ingestion, and queried the resulting enriched data via Athena.

1. Successful Terraform Deployment

Infrastructure provisioned cleanly. IAM roles, Lambda triggers, and S3 lifecycle rules successfully attached.

Screenshot 2026-03-31 102127

2. Lambda IP Enrichment & Caching (CloudWatch Logs)

Uploaded a mock cowrie.json payload directly to the raw S3 sink. The Lambda trigger fired immediately, processed the file, and successfully stored the enriched JSON.

Screenshot 2026-03-31 105050

3. Athena SQL Query Results (The Enriched Data)

Triggered the Glue Crawler to infer the schema. Successfully queried the enriched data using standard SQL in Athena. Note the abuse_score and country_code fields successfully appended to the raw Cowrie data!

Screenshot 2026-03-31 113612

Closes #30

…th S3 sink, Lambda enrichment, Athena, and Fluent Bit log forwarder
@hariram4862
Copy link
Copy Markdown

@TrishaG189

Thanks for this, really nice work. It helped make the telemetry direction much clearer.

I’ve built on top of it in PR #37, where I connected the multi-region Cowrie deployment with this pipeline and tested the full flow with real honeypot-generated events.

Looking forward to building more on this together.

@TrishaG189
Copy link
Copy Markdown
Author

@hariram4862
Thanks, really appreciate it!

Glad it helped — I’ll check out PR #37 and go through the integration. Nice to see the full pipeline working with real events

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Centralized Threat Intelligence Data Pipeline (S3 + Lambda Enrichment + Athena)

2 participants