|
1 | | -# 1. Set up the environment |
| 1 | +# Ethereum Blockchain Data Analytics Platform |
2 | 2 |
|
3 | | -## Postgres with Docker |
| 3 | +Capstone project for [Foundry AI Academy](https://www.foundry.academy/) Data & AI Engineering program. An ELT pipeline for extracting, loading, and transforming Ethereum blockchain data with focus on stablecoin analytics. |
4 | 4 |
|
| 5 | +Inspired by [Visa on Chain Analytics](https://visaonchainanalytics.com/). |
| 6 | + |
| 7 | +## Quick Start |
| 8 | + |
| 9 | +### Prerequisites |
5 | 10 | ```bash |
| 11 | +# Create Docker network |
6 | 12 | docker network create fa-dae2-capstone_kafka_network |
| 13 | + |
| 14 | +# Start PostgreSQL |
7 | 15 | docker-compose up -d |
8 | | -``` |
9 | | -## Python environment |
10 | | -The project structured as a package in `src/capstone_package` directory. Runnable scripts are in `scripts` directory only. |
11 | 16 |
|
12 | | -Install dependencies using uv: |
13 | | -```bash |
| 17 | +# Install dependencies |
14 | 18 | uv sync |
15 | | -``` |
16 | 19 |
|
17 | | -## Initialize the database |
| 20 | +# Setup environment |
| 21 | +cp .env.example .env |
| 22 | +export $(cat .env | xargs) |
| 23 | +``` |
18 | 24 |
|
19 | | -### Set environment variables |
| 25 | +### Extract Data |
| 26 | +```bash |
| 27 | +# Extract logs and transactions from Etherscan |
| 28 | +uv run python scripts/el/extract_etherscan.py \ |
| 29 | + -c ethereum \ |
| 30 | + -a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \ |
| 31 | + --logs --transactions \ |
| 32 | + --from_block 18.5M --to_block 20M \ |
| 33 | + -v |
| 34 | +``` |
20 | 35 |
|
21 | | -- Copy `.env.example` to `.env` |
| 36 | +### Load Data |
22 | 37 | ```bash |
23 | | -cp .env.example .env |
| 38 | +# Load Parquet to PostgreSQL |
| 39 | +uv run python scripts/el/load.py \ |
| 40 | + -f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \ |
| 41 | + -c postgres \ |
| 42 | + -s raw \ |
| 43 | + -t logs \ |
| 44 | + -w append |
24 | 45 | ``` |
25 | | -- Set environment variables |
| 46 | + |
| 47 | +### Transform Data |
26 | 48 | ```bash |
27 | | -export $(cat .env | xargs) |
| 49 | +# Run dbt models |
| 50 | +./scripts/dbt.sh run |
| 51 | + |
| 52 | +# Run specific model |
| 53 | +./scripts/dbt.sh run --select stg_logs_decoded |
28 | 54 | ``` |
29 | 55 |
|
30 | | -### The data |
31 | | -- Log and transaction data of a smart contract [0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72](https://etherscan.io/address/0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72) on Ethereum. |
32 | | -- Whole loading data in is parquet format |
33 | | -- Example in json format: |
34 | | - - [logs.json](data/ethereum_0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72/logs.json) |
35 | | - - [transactions.json](data/ethereum_0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72/transactions.json) |
| 56 | +## Architecture |
36 | 57 |
|
37 | | -### Load data to Postgres |
| 58 | +**Extract** → **Load** → **Transform** |
38 | 59 |
|
39 | | -There are two ways to load data to Postgres: |
| 60 | +1. **Extract** (`scripts/el/extract_etherscan.py`): Pulls blockchain data from Etherscan API to `.data/raw/*.parquet` |
| 61 | +2. **Load** (`scripts/el/load.py`): Loads Parquet files into PostgreSQL/Snowflake `raw` schema |
| 62 | +3. **Transform** (`dbt_project/`): dbt models transform raw data into analytics-ready tables |
40 | 63 |
|
41 | | -1. Using DLT |
42 | | -dlt will automatically create the table and load data to it. |
43 | | -```bash |
44 | | -python scripts/data_loading/postgres_loader.py |
| 64 | +### Project Structure |
| 65 | +``` |
| 66 | +├── scripts/el/ # Extract & Load scripts |
| 67 | +├── src/onchaindata/ # Reusable Python package |
| 68 | +│ ├── data_extraction/ # Etherscan/GraphQL clients |
| 69 | +│ ├── data_pipeline/ # Loader classes |
| 70 | +│ └── utils/ # Database clients |
| 71 | +├── dbt_project/ # dbt transformation layer |
| 72 | +│ ├── models/01_staging/ # Raw data cleanup (views) |
| 73 | +│ ├── models/intermediate/# Business logic (ephemeral) |
| 74 | +│ └── models/marts/ # Analytics tables (tables) |
| 75 | +└── .data/raw/ # Extracted Parquet files |
45 | 76 | ``` |
46 | | -**Note**: for non-standard data types e.g. json, use [apply_hints](scripts/data_loading/postgres_loader.py#L28) to define the data type. |
47 | 77 |
|
48 | | -1. Without DLT, using psycopg to load data. |
| 78 | +## Key Features |
| 79 | + |
| 80 | +- **Multi-chain support**: Ethereum, Polygon, BSC via chainid mapping |
| 81 | +- **Automatic retry**: Failed extractions retry with smaller chunks (10x reduction) |
| 82 | +- **Flexible loading**: PostgreSQL and Snowflake support |
| 83 | +- **Block number shortcuts**: Use `18.5M` instead of `18500000` |
| 84 | +- **dbt transformations**: Staging → Intermediate → Marts layers |
| 85 | + |
| 86 | +## Environment Variables |
| 87 | + |
| 88 | +Required (see `.env.example`): |
| 89 | +- `POSTGRES_*`: Database connection |
| 90 | +- `ETHERSCAN_API_KEY`: API access |
| 91 | +- `DB_SCHEMA`: Default schema |
| 92 | + |
| 93 | +Optional (for Snowflake): |
| 94 | +- `SNOWFLAKE_*`: Snowflake connection details |
| 95 | + |
| 96 | +## Common Commands |
49 | 97 |
|
50 | | -- Initialize the table manually |
51 | 98 | ```bash |
52 | | -./scripts/sql/run_sql.sh ./scripts/sql/init.sql; |
| 99 | +# SQL operations |
| 100 | +./scripts/sql_pg.sh ./scripts/sql/init.sql |
| 101 | + |
| 102 | +# dbt operations |
| 103 | +./scripts/dbt.sh test # Run tests |
| 104 | +./scripts/dbt.sh docs generate # Generate docs |
| 105 | +./scripts/dbt.sh run --select staging.* # Run staging models |
| 106 | + |
| 107 | +# Extract with time range |
| 108 | +uv run python scripts/el/extract_etherscan.py \ |
| 109 | + -a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \ |
| 110 | + --logs --transactions \ |
| 111 | + --last_n_days 7 |
53 | 112 | ``` |
54 | 113 |
|
55 | | -- Use `load_parquet_to_postgres_wo_dlt` function in [postgres_loader.py](scripts/data_loading/postgres_loader.py) |
| 114 | +## Database Schema |
56 | 115 |
|
57 | | -### Load data to Snowflake |
| 116 | +- `raw.logs`: Raw event logs with JSONB topics |
| 117 | +- `raw.transactions`: Transaction data |
| 118 | +- `staging.stg_logs_decoded`: Decoded logs with parsed topics (topic0-topic3) |
| 119 | +- Marts: Analytics tables created by dbt |
58 | 120 |
|
59 | | -1. raw data stored in `database/RAW_DATA.JSON_STAGE` |
60 | | -Use `upload_file_to_stage` function in [snowflake_loader.py](scripts/data_loading/snowflake_loader.py) to upload the data to Snowflake. |
61 | | -```bash |
62 | | -python scripts/data_loading/snowflake_loading.py |
63 | | -``` |
| 121 | +## Documentation |
| 122 | + |
| 123 | +For detailed documentation, see [CLAUDE.md](CLAUDE.md) or the [docs/](docs/) directory. |
0 commit comments