@@ -10,38 +10,53 @@ This is a capstone project for FA DAE2 focused on Ethereum blockchain data extra
1010
1111The project follows an ELT (Extract, Load, Transform) pipeline:
1212
13- ### 1. Extract Layer (` scripts/extract / ` )
14- - ** Primary script** : ` runner .py` - Extracts logs and transactions from Etherscan API
15- - Uses the ` onchaindata.data_extraction.etherscan ` module
16- - Supports multiple blockchain networks via ` EtherscanClient `
17- - Features automatic retry logic for failed block ranges with exponential backoff
13+ ### 1. Extract Layer (` scripts/el / ` )
14+ - ** Primary script** : ` extract_etherscan .py` - Extracts logs and transactions from Etherscan API
15+ - Uses the ` onchaindata.data_extraction.etherscan ` module with ` EtherscanClient `
16+ - Supports multiple blockchain networks via chainid mapping in ` src/onchaindata/config/chainid.json `
17+ - Features automatic retry logic for failed block ranges with exponential backoff (reduces chunk size by 10x)
1818- Data stored as Parquet files in ` .data/raw/ ` directory
19- - Error tracking in ` logging/extract_error/ ` with automatic retry mechanism
19+ - Error tracking in ` logging/extract_error/ ` with automatic retry mechanism that logs failed ranges to CSV
20+ - Supports K/M/B suffixes for block numbers (e.g., '18.5M' = 18,500,000)
21+ - Additional extraction capabilities: ` extract_graphql.py ` for GraphQL-based extraction
2022
21- ### 2. Load Layer (` scripts/load / ` )
22- - ** postgres_load .py** : Loads Parquet files into PostgreSQL ` raw ` schema
23- - ** snowflake_load.py ** : Optional Snowflake loading capabilities
24- - Uses ` onchaindata.data_pipeline ` module for loading operations
25- - Supports both dlt-based and direct psycopg-based loading
23+ ### 2. Load Layer (` scripts/el / ` )
24+ - ** load .py** : Unified loader script supporting both PostgreSQL and Snowflake
25+ - Uses ` onchaindata.data_pipeline.Loader ` class with pluggable database clients
26+ - Takes arguments: ` -f ` (file path), ` -c ` (client: postgres/snowflake), ` -s ` (schema), ` -t ` (table), ` -w ` (write disposition: append/replace/merge)
27+ - Database clients in ` src/onchaindata/utils/ ` : ` PostgresClient ` , ` SnowflakeClient `
2628
2729### 3. Transform Layer (dbt)
2830- ** Location** : ` dbt_project/ `
2931- Standard dbt project structure with models organized by layer:
30- - ` models/staging / ` : Raw data cleanup (e.g., ` stg_logs_decoded ` )
32+ - ` models/01_staging / ` : Raw data cleanup (e.g., ` stg_logs_decoded.sql ` )
3133 - ` models/intermediate/ ` : Business logic transformations
3234 - ` models/marts/ ` : Final analytics tables
33- - Shared macros in ` dbt_project/macros/ ` for Ethereum data type conversions:
35+ - Materialization strategy:
36+ - staging: ` view `
37+ - intermediate: ` ephemeral `
38+ - marts: ` table `
39+ - Shared macros in ` dbt_project/macros/ethereum_macros.sql ` :
3440 - ` uint256_to_address ` : Extracts Ethereum addresses from uint256 hex strings
3541 - ` uint256_to_numeric ` : Converts uint256 hex to numeric values
36- - Models reference source data from ` raw ` schema
42+ - Sources defined in ` models/01_staging/sources.yml ` (references ` raw ` schema)
3743- Configuration: [ dbt_project.yml] ( dbt_project/dbt_project.yml ) , [ profiles.yml] ( dbt_project/profiles.yml )
3844
3945### 4. Package Structure (` src/onchaindata/ ` )
4046Reusable Python package with modules:
41- - ` data_extraction/ ` : Etherscan API client with rate limiting
42- - ` data_pipeline/ ` : Postgres and Snowflake loading utilities
43- - ` utils/ ` : Database clients (PostgresClient, SnowflakeClient)
44- - ` config/ ` : Configuration management
47+ - ` data_extraction/ ` :
48+ - ` etherscan.py ` : EtherscanClient with rate limiting
49+ - ` graphql.py ` : GraphQL-based extraction
50+ - ` rate_limiter.py ` : Rate limiting utilities
51+ - ` base.py ` : Base classes for API clients
52+ - ` data_pipeline/ ` :
53+ - ` loaders.py ` : Loader class for database operations
54+ - ` utils/ ` :
55+ - ` postgres_client.py ` : PostgreSQL client with connection pooling
56+ - ` snowflake_client.py ` : Snowflake client
57+ - ` chain.py ` : Chain-related utilities
58+ - ` base_client.py ` : Base database client interface
59+ - ` config/ ` : Configuration files (chainid.json)
4560
4661## Development Commands
4762
@@ -60,23 +75,23 @@ uv sync
6075cp .env.example .env
6176export $( cat .env | xargs)
6277
63- # Initialize database schema
64- ./scripts/sql/run_sql .sh ./scripts/sql/init.sql
78+ # Initialize database schema (if needed)
79+ ./scripts/sql_pg .sh ./scripts/sql/init.sql
6580```
6681
6782### Data Extraction
6883``` bash
6984# Extract logs and transactions for a specific contract address
7085# Supports K/M/B suffixes for block numbers (e.g., '18.5M')
71- uv run python scripts/extract/runner .py \
86+ uv run python scripts/el/extract_etherscan .py \
7287 -c ethereum \
7388 -a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \
7489 --logs --transactions \
7590 --from_block 18.5M --to_block 20M \
7691 -v # verbose logging
7792
7893# Extract data from last N days
79- uv run python scripts/extract/runner .py \
94+ uv run python scripts/el/extract_etherscan .py \
8095 -a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \
8196 --logs --transactions \
8297 --last_n_days 7
@@ -87,12 +102,20 @@ uv run python scripts/extract/runner.py \
87102### Data Loading
88103``` bash
89104# Load Parquet file to PostgreSQL
90- uv run python scripts/load/postgres_load .py \
105+ uv run python scripts/el/load .py \
91106 -f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \
92- -t logs
107+ -c postgres \
108+ -s raw \
109+ -t logs \
110+ -w append
93111
94112# Load to Snowflake (requires SNOWFLAKE_* env vars)
95- uv run python scripts/load/snowflake_load.py
113+ uv run python scripts/el/load.py \
114+ -f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \
115+ -c snowflake \
116+ -s raw \
117+ -t logs \
118+ -w append
96119```
97120
98121### dbt Operations
@@ -111,18 +134,15 @@ uv run python scripts/load/snowflake_load.py
111134./scripts/dbt.sh docs generate # Generate documentation
112135./scripts/dbt.sh run --select staging.* # Run all staging models
113136./scripts/dbt.sh deps # Install dbt packages
114-
115- # Legacy script (still available for backward compatibility)
116- ./scripts/run_dbt.sh staging run
117137```
118138
119139### SQL Operations
120140``` bash
121- # Run SQL scripts directly
122- ./scripts/sql/run_sql .sh ./scripts/sql/init.sql
141+ # Run SQL scripts directly against PostgreSQL
142+ ./scripts/sql_pg .sh ./scripts/sql/init.sql
123143
124144# Ad-hoc queries
125- ./scripts/sql/run_sql .sh ./scripts/sql/ad_hoc.sql
145+ ./scripts/sql_pg .sh ./scripts/sql/ad_hoc.sql
126146```
127147
128148## Environment Variables
@@ -131,36 +151,35 @@ Required variables (see `.env.example`):
131151- ` POSTGRES_HOST ` , ` POSTGRES_PORT ` , ` POSTGRES_DB ` , ` POSTGRES_USER ` , ` POSTGRES_PASSWORD `
132152- ` DB_SCHEMA ` : Default schema for operations (e.g., ` fa02_staging ` )
133153- ` KAFKA_NETWORK_NAME ` : Docker network name
154+ - ` ETHERSCAN_API_KEY ` : For Etherscan API access
134155
135156Optional (for Snowflake):
136157- ` SNOWFLAKE_ACCOUNT ` , ` SNOWFLAKE_USER ` , ` SNOWFLAKE_ROLE ` , ` SNOWFLAKE_WAREHOUSE `
137158- ` SNOWFLAKE_DATABASE ` , ` SNOWFLAKE_SCHEMA ` , ` SNOWFLAKE_PRIVATE_KEY_FILE_PATH `
138159
139160## Key Data Flows
140161
141- 1 . ** Etherscan → Parquet** : ` runner .py` extracts blockchain data to ` .data/raw/*.parquet `
142- 2 . ** Parquet → PostgreSQL** : ` postgres_load .py` loads into ` raw ` schema tables
162+ 1 . ** Etherscan → Parquet** : ` extract_etherscan .py` extracts blockchain data to ` .data/raw/*.parquet `
163+ 2 . ** Parquet → PostgreSQL/Snowflake ** : ` load .py` loads into ` raw ` schema tables
1431643 . ** PostgreSQL → dbt** : dbt models transform ` raw.logs ` → ` staging.stg_logs_decoded `
144- 4 . Failed extractions are logged to ` logging/extract_error/ ` and automatically retried with smaller chunk sizes
165+ 4 . Failed extractions are logged to ` logging/extract_error/ ` and automatically retried with smaller chunk sizes (10x reduction)
145166
146167## dbt Project Structure
147168
148169```
149170dbt_project/
150- ├── dbt_project.yml # Configuration
151- ├── profiles.yml # Database connections
171+ ├── dbt_project.yml # Configuration (project: stablecoins)
172+ ├── profiles.yml # Database connections (dev=postgres, test/prod=snowflake)
152173├── models/
153- │ ├── staging / # Raw data cleanup
154- │ │ ├── _staging__sources .yml
155- │ │ ├── _staging__models .yml
174+ │ ├── 01_staging / # Raw data cleanup (materialized as views)
175+ │ │ ├── sources .yml # Source definitions (raw schema)
176+ │ │ ├── models .yml # Model documentation
156177│ │ └── stg_logs_decoded.sql
157- │ ├── intermediate/ # Business logic transformations
158- │ └── marts/ # Final analytics tables
178+ │ ├── intermediate/ # Business logic (ephemeral)
179+ │ └── marts/ # Final analytics ( tables)
159180├── tests/ # Data quality tests
160- │ ├── test_valid_address.sql
161- │ └── test_block_number_range.sql
162- ├── macros/ # Reusable SQL (ethereum_macros.sql)
163- └── packages.yml # dbt dependencies (dbt_utils)
181+ ├── macros/ # ethereum_macros.sql (uint256_to_address, uint256_to_numeric)
182+ └── packages.yml # dbt dependencies
164183```
165184
166185### Model Naming Conventions
@@ -171,16 +190,16 @@ dbt_project/
171190
172191## Database Schema
173192
174- - ** raw.logs** : Raw log data with columns: address, topics (JSONB), data, block_number, transaction_hash, etc.
193+ - ** raw.logs** : Raw log data with columns: address, topics (JSONB), data, block_number, transaction_hash, time_stamp, etc.
175194- ** raw.transactions** : Transaction data (structure similar to logs)
176- - ** staging.stg_logs_decoded** : Decoded logs with parsed topics (topic0-topic3)
195+ - ** staging.stg_logs_decoded** : Decoded logs with parsed topics (topic0-topic3), indexed on (contract_address, transaction_hash, index)
177196- dbt creates additional staging/intermediate/mart tables based on models in ` dbt_project/models/ `
178197
179198## Project Structure Notes
180199
181- - Runnable scripts are ONLY in ` scripts/ ` directory
182- - Reusable code is packaged in ` src/onchaindata/ `
200+ - Runnable scripts are ONLY in ` scripts/ ` directory (organized as ` scripts/el/ ` for extract/load)
201+ - Reusable code is packaged in ` src/onchaindata/ ` as an installable package
183202- dbt project located at ` dbt_project/ ` with standard structure (staging → intermediate → marts)
184203- Data files: ` .data/raw/ ` for extracted data, ` sampledata/ ` for examples
185204- Always run Python scripts with ` uv run python ` (not direct python)
186- - Legacy ` dbt_subprojects/ ` directory retained for reference (can be removed after migration )
205+ - Project uses ` uv ` for dependency management (see ` pyproject.toml ` )
0 commit comments