This project implements an end-to-end data engineering pipeline designed to process, transform, and visualize financial transaction data for fraud detection. The architecture leverages local computation for Spark-based ETL and Airflow orchestration, targeting Google Cloud Platform (GCS, BigQuery) for storage, and dbt for in-warehouse data transformations.
Financial fraud results in significant monetary losses and operational inefficiencies. Identifying fraudulent transactions within highly imbalanced datasets requires robust feature engineering and reliable data pipelines. A scalable infrastructure is necessary to consistently deliver processed transaction data to analytical dashboards and downstream machine learning models.
- Develop a robust ETL pipeline to process and augment raw transaction data using PySpark.
- Implement workflow orchestration using Apache Airflow to automate data generation, cloud storage uploads, data warehouse loading, and data transformations.
- Execute data modeling using dbt to aggregate daily fraud metrics and create analytical data marts.
- Visualizing fraud patterns and anomalies via an interactive dashboard in Google Looker Studio.
- Source: Kaggle Dataset - Synthetic Financial Datasets For Fraud Detection
- Processing: The raw dataset is augmented randomly via a PySpark generator to scale the volume and simulate continuous time-series data for temporal analysis.
Configure your local environment variables to ensure Airflow and GCP SDK operate correctly without macOS multiprocessing conflicts.
export GOOGLE_APPLICATION_CREDENTIALS="$(pwd)/gcp-pipeline-key.json"
export AIRFLOW_HOME="$(pwd)"
export no_proxy="*"
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YESApply the variables using direnv:
direnv allowConfigure the dbt profile to connect to your BigQuery dataset using the generated service account key.
fraud_detection:
outputs:
dev:
dataset: fraud_detection
job_execution_timeout_seconds: 300
job_retries: 1
keyfile: /path/to/your/service-account-key.json
location: us-east1
method: service-account
priority: interactive
project: project_id
threads: 4
type: bigquery
target: devAuthenticate your local environment with Google Cloud.
gcloud auth application-default login1. Data Preparation
Download the Kaggle dataset and place the CSV file into the local data/ directory.
2. Infrastructure as Code (Terraform)
Provision the necessary GCP resources (GCS Bucket, BigQuery Dataset, Service Account, and IAM roles) and extract the authentication key.
cd infrastructure
terraform init
terraform plan
terraform apply -auto-approve
terraform output -raw service_account_key | base64 --decode > ../gcp-pipeline-key.jsonNote: variables.tf should be updated with your project_id before.
Set the credentials for the current session:
export GOOGLE_APPLICATION_CREDENTIALS="$(pwd)/gcp-pipeline-key.json"3. Airflow Initialization
Initialize the Airflow SQLite database and create an admin user for the Web UI.
# Set current project directory as AIRFLOW_HOME
export AIRFLOW_HOME="$(pwd)"
# Initialize the Airflow SQLite database
airflow db migrate
# Create an admin user for the Web UI
airflow users create \
--username admin \
--firstname Data \
--lastname Engineer \
--role Admin \
--email admin@example.com \
--password admin4. Airflow GCP Connection Setup
Add the GCP connection to Airflow's internal metadata database using the CLI to allow operators to interact with GCS and BigQuery.
airflow connections add 'google_cloud_default' \
--conn-type 'google_cloud_platform' \
--conn-extra '{"key_path": "/path/to/project/fraud-detection/gcp-pipeline-key.json", "project": "your-project-name"}'5. Start Airflow Launch the Airflow scheduler and webserver in standalone mode.
airflow standaloneNote: Log in to http://localhost:8080 using the credentials found in simple_auth_manager_passwords.json or the ones you manually configured.
6. Execute the Pipeline
In the Airflow Web UI, unpause and trigger the fraud_detection_pipeline DAG. Monitor the Graph View to ensure all 5 tasks (Spark Generation → Spark ETL → GCS Upload → BigQuery Load → dbt Build) complete successfully.
7. Dashboard Visualization
Connect Google Looker Studio to the fraud_monitor_daily table in BigQuery.
Dashboard by Google Looker Studio
The dashboard visualizes the correlation between daily fraud attempts and transaction error balances, providing an objective overview of systematic anomalies generated by the data pipeline.
This project is licensed under the MIT License.
