Skip to content

Commit 06cac27

Browse files
committed
add more docs
1 parent 1a89fa9 commit 06cac27

File tree

13 files changed

+567
-173
lines changed

13 files changed

+567
-173
lines changed

CLAUDE.md

Lines changed: 69 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -10,38 +10,53 @@ This is a capstone project for FA DAE2 focused on Ethereum blockchain data extra
1010

1111
The project follows an ELT (Extract, Load, Transform) pipeline:
1212

13-
### 1. Extract Layer (`scripts/extract/`)
14-
- **Primary script**: `runner.py` - Extracts logs and transactions from Etherscan API
15-
- Uses the `onchaindata.data_extraction.etherscan` module
16-
- Supports multiple blockchain networks via `EtherscanClient`
17-
- Features automatic retry logic for failed block ranges with exponential backoff
13+
### 1. Extract Layer (`scripts/el/`)
14+
- **Primary script**: `extract_etherscan.py` - Extracts logs and transactions from Etherscan API
15+
- Uses the `onchaindata.data_extraction.etherscan` module with `EtherscanClient`
16+
- Supports multiple blockchain networks via chainid mapping in `src/onchaindata/config/chainid.json`
17+
- Features automatic retry logic for failed block ranges with exponential backoff (reduces chunk size by 10x)
1818
- Data stored as Parquet files in `.data/raw/` directory
19-
- Error tracking in `logging/extract_error/` with automatic retry mechanism
19+
- Error tracking in `logging/extract_error/` with automatic retry mechanism that logs failed ranges to CSV
20+
- Supports K/M/B suffixes for block numbers (e.g., '18.5M' = 18,500,000)
21+
- Additional extraction capabilities: `extract_graphql.py` for GraphQL-based extraction
2022

21-
### 2. Load Layer (`scripts/load/`)
22-
- **postgres_load.py**: Loads Parquet files into PostgreSQL `raw` schema
23-
- **snowflake_load.py**: Optional Snowflake loading capabilities
24-
- Uses `onchaindata.data_pipeline` module for loading operations
25-
- Supports both dlt-based and direct psycopg-based loading
23+
### 2. Load Layer (`scripts/el/`)
24+
- **load.py**: Unified loader script supporting both PostgreSQL and Snowflake
25+
- Uses `onchaindata.data_pipeline.Loader` class with pluggable database clients
26+
- Takes arguments: `-f` (file path), `-c` (client: postgres/snowflake), `-s` (schema), `-t` (table), `-w` (write disposition: append/replace/merge)
27+
- Database clients in `src/onchaindata/utils/`: `PostgresClient`, `SnowflakeClient`
2628

2729
### 3. Transform Layer (dbt)
2830
- **Location**: `dbt_project/`
2931
- Standard dbt project structure with models organized by layer:
30-
- `models/staging/`: Raw data cleanup (e.g., `stg_logs_decoded`)
32+
- `models/01_staging/`: Raw data cleanup (e.g., `stg_logs_decoded.sql`)
3133
- `models/intermediate/`: Business logic transformations
3234
- `models/marts/`: Final analytics tables
33-
- Shared macros in `dbt_project/macros/` for Ethereum data type conversions:
35+
- Materialization strategy:
36+
- staging: `view`
37+
- intermediate: `ephemeral`
38+
- marts: `table`
39+
- Shared macros in `dbt_project/macros/ethereum_macros.sql`:
3440
- `uint256_to_address`: Extracts Ethereum addresses from uint256 hex strings
3541
- `uint256_to_numeric`: Converts uint256 hex to numeric values
36-
- Models reference source data from `raw` schema
42+
- Sources defined in `models/01_staging/sources.yml` (references `raw` schema)
3743
- Configuration: [dbt_project.yml](dbt_project/dbt_project.yml), [profiles.yml](dbt_project/profiles.yml)
3844

3945
### 4. Package Structure (`src/onchaindata/`)
4046
Reusable Python package with modules:
41-
- `data_extraction/`: Etherscan API client with rate limiting
42-
- `data_pipeline/`: Postgres and Snowflake loading utilities
43-
- `utils/`: Database clients (PostgresClient, SnowflakeClient)
44-
- `config/`: Configuration management
47+
- `data_extraction/`:
48+
- `etherscan.py`: EtherscanClient with rate limiting
49+
- `graphql.py`: GraphQL-based extraction
50+
- `rate_limiter.py`: Rate limiting utilities
51+
- `base.py`: Base classes for API clients
52+
- `data_pipeline/`:
53+
- `loaders.py`: Loader class for database operations
54+
- `utils/`:
55+
- `postgres_client.py`: PostgreSQL client with connection pooling
56+
- `snowflake_client.py`: Snowflake client
57+
- `chain.py`: Chain-related utilities
58+
- `base_client.py`: Base database client interface
59+
- `config/`: Configuration files (chainid.json)
4560

4661
## Development Commands
4762

@@ -60,23 +75,23 @@ uv sync
6075
cp .env.example .env
6176
export $(cat .env | xargs)
6277

63-
# Initialize database schema
64-
./scripts/sql/run_sql.sh ./scripts/sql/init.sql
78+
# Initialize database schema (if needed)
79+
./scripts/sql_pg.sh ./scripts/sql/init.sql
6580
```
6681

6782
### Data Extraction
6883
```bash
6984
# Extract logs and transactions for a specific contract address
7085
# Supports K/M/B suffixes for block numbers (e.g., '18.5M')
71-
uv run python scripts/extract/runner.py \
86+
uv run python scripts/el/extract_etherscan.py \
7287
-c ethereum \
7388
-a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \
7489
--logs --transactions \
7590
--from_block 18.5M --to_block 20M \
7691
-v # verbose logging
7792

7893
# Extract data from last N days
79-
uv run python scripts/extract/runner.py \
94+
uv run python scripts/el/extract_etherscan.py \
8095
-a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \
8196
--logs --transactions \
8297
--last_n_days 7
@@ -87,12 +102,20 @@ uv run python scripts/extract/runner.py \
87102
### Data Loading
88103
```bash
89104
# Load Parquet file to PostgreSQL
90-
uv run python scripts/load/postgres_load.py \
105+
uv run python scripts/el/load.py \
91106
-f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \
92-
-t logs
107+
-c postgres \
108+
-s raw \
109+
-t logs \
110+
-w append
93111

94112
# Load to Snowflake (requires SNOWFLAKE_* env vars)
95-
uv run python scripts/load/snowflake_load.py
113+
uv run python scripts/el/load.py \
114+
-f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \
115+
-c snowflake \
116+
-s raw \
117+
-t logs \
118+
-w append
96119
```
97120

98121
### dbt Operations
@@ -111,18 +134,15 @@ uv run python scripts/load/snowflake_load.py
111134
./scripts/dbt.sh docs generate # Generate documentation
112135
./scripts/dbt.sh run --select staging.* # Run all staging models
113136
./scripts/dbt.sh deps # Install dbt packages
114-
115-
# Legacy script (still available for backward compatibility)
116-
./scripts/run_dbt.sh staging run
117137
```
118138

119139
### SQL Operations
120140
```bash
121-
# Run SQL scripts directly
122-
./scripts/sql/run_sql.sh ./scripts/sql/init.sql
141+
# Run SQL scripts directly against PostgreSQL
142+
./scripts/sql_pg.sh ./scripts/sql/init.sql
123143

124144
# Ad-hoc queries
125-
./scripts/sql/run_sql.sh ./scripts/sql/ad_hoc.sql
145+
./scripts/sql_pg.sh ./scripts/sql/ad_hoc.sql
126146
```
127147

128148
## Environment Variables
@@ -131,36 +151,35 @@ Required variables (see `.env.example`):
131151
- `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DB`, `POSTGRES_USER`, `POSTGRES_PASSWORD`
132152
- `DB_SCHEMA`: Default schema for operations (e.g., `fa02_staging`)
133153
- `KAFKA_NETWORK_NAME`: Docker network name
154+
- `ETHERSCAN_API_KEY`: For Etherscan API access
134155

135156
Optional (for Snowflake):
136157
- `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_USER`, `SNOWFLAKE_ROLE`, `SNOWFLAKE_WAREHOUSE`
137158
- `SNOWFLAKE_DATABASE`, `SNOWFLAKE_SCHEMA`, `SNOWFLAKE_PRIVATE_KEY_FILE_PATH`
138159

139160
## Key Data Flows
140161

141-
1. **Etherscan → Parquet**: `runner.py` extracts blockchain data to `.data/raw/*.parquet`
142-
2. **Parquet → PostgreSQL**: `postgres_load.py` loads into `raw` schema tables
162+
1. **Etherscan → Parquet**: `extract_etherscan.py` extracts blockchain data to `.data/raw/*.parquet`
163+
2. **Parquet → PostgreSQL/Snowflake**: `load.py` loads into `raw` schema tables
143164
3. **PostgreSQL → dbt**: dbt models transform `raw.logs``staging.stg_logs_decoded`
144-
4. Failed extractions are logged to `logging/extract_error/` and automatically retried with smaller chunk sizes
165+
4. Failed extractions are logged to `logging/extract_error/` and automatically retried with smaller chunk sizes (10x reduction)
145166

146167
## dbt Project Structure
147168

148169
```
149170
dbt_project/
150-
├── dbt_project.yml # Configuration
151-
├── profiles.yml # Database connections
171+
├── dbt_project.yml # Configuration (project: stablecoins)
172+
├── profiles.yml # Database connections (dev=postgres, test/prod=snowflake)
152173
├── models/
153-
│ ├── staging/ # Raw data cleanup
154-
│ │ ├── _staging__sources.yml
155-
│ │ ├── _staging__models.yml
174+
│ ├── 01_staging/ # Raw data cleanup (materialized as views)
175+
│ │ ├── sources.yml # Source definitions (raw schema)
176+
│ │ ├── models.yml # Model documentation
156177
│ │ └── stg_logs_decoded.sql
157-
│ ├── intermediate/ # Business logic transformations
158-
│ └── marts/ # Final analytics tables
178+
│ ├── intermediate/ # Business logic (ephemeral)
179+
│ └── marts/ # Final analytics (tables)
159180
├── tests/ # Data quality tests
160-
│ ├── test_valid_address.sql
161-
│ └── test_block_number_range.sql
162-
├── macros/ # Reusable SQL (ethereum_macros.sql)
163-
└── packages.yml # dbt dependencies (dbt_utils)
181+
├── macros/ # ethereum_macros.sql (uint256_to_address, uint256_to_numeric)
182+
└── packages.yml # dbt dependencies
164183
```
165184

166185
### Model Naming Conventions
@@ -171,16 +190,16 @@ dbt_project/
171190

172191
## Database Schema
173192

174-
- **raw.logs**: Raw log data with columns: address, topics (JSONB), data, block_number, transaction_hash, etc.
193+
- **raw.logs**: Raw log data with columns: address, topics (JSONB), data, block_number, transaction_hash, time_stamp, etc.
175194
- **raw.transactions**: Transaction data (structure similar to logs)
176-
- **staging.stg_logs_decoded**: Decoded logs with parsed topics (topic0-topic3)
195+
- **staging.stg_logs_decoded**: Decoded logs with parsed topics (topic0-topic3), indexed on (contract_address, transaction_hash, index)
177196
- dbt creates additional staging/intermediate/mart tables based on models in `dbt_project/models/`
178197

179198
## Project Structure Notes
180199

181-
- Runnable scripts are ONLY in `scripts/` directory
182-
- Reusable code is packaged in `src/onchaindata/`
200+
- Runnable scripts are ONLY in `scripts/` directory (organized as `scripts/el/` for extract/load)
201+
- Reusable code is packaged in `src/onchaindata/` as an installable package
183202
- dbt project located at `dbt_project/` with standard structure (staging → intermediate → marts)
184203
- Data files: `.data/raw/` for extracted data, `sampledata/` for examples
185204
- Always run Python scripts with `uv run python` (not direct python)
186-
- Legacy `dbt_subprojects/` directory retained for reference (can be removed after migration)
205+
- Project uses `uv` for dependency management (see `pyproject.toml`)

docs/02_data_pipeline/01_source.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,40 @@
1+
## Primary Source: HyperIndex (Envio)
2+
13
### `Transfer` data
2-
Raw transaction data is indexed with [HyperIndex](https://docs.envio.dev/docs/HyperIndex/overview), a blockchain indexing framework that transforms on-chain events into structured, queryable databases with GraphQL APIs.
3-
- To run the indexer
4+
Raw transaction data is indexed with [HyperIndex](https://docs.envio.dev/docs/HyperIndex/overview), a blockchain indexing framework that transforms on-chain events into structured, queryable databases with GraphQL APIs.
5+
6+
**To run the indexer:**
47
```bash
58
git clone https://github.com/newgnart/envio-stablecoins.git
69
pnpm dev
710
```
8-
More details on [envio-stablecoins](https://github.com/newgnart/envio-stablecoins)
9-
### Wallet labels data
11+
12+
**Benefits:**
13+
- ✅ Real-time continuous indexing
14+
- ✅ Structured GraphQL queries
15+
- ✅ Multiple contracts and events simultaneously
16+
- ✅ No API rate limits
17+
18+
More details: [envio-stablecoins](https://github.com/newgnart/envio-stablecoins)
19+
20+
---
21+
22+
## Alternative Source: Etherscan API (Optional)
23+
24+
The repository includes Etherscan API extraction tools (`scripts/el/extract_etherscan.py`) as an alternative data source. While not used in the primary pipeline, it's useful for:
25+
26+
- Historical data extraction and validation
27+
- Supporting additional EVM chains
28+
- Ad-hoc analysis without running the indexer
29+
30+
**Trade-offs:**
31+
- ✅ Flexible, no infrastructure needed
32+
- ✅ 50+ EVM chains supported
33+
- ❌ Rate limited (5 req/sec on free tier)
34+
- ❌ Requires API key
35+
36+
For detailed usage, see [Additional Tools](../06_additional_tools.md).
37+
38+
---
39+
40+
## Wallet labels data

docs/02_data_pipeline/03_transformation.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,4 +40,7 @@ uv run scripts/el/stream_graphql.py \
4040
- Currently supported databases:
4141
- PostgreSQL
4242
- Snowflake
43+
- Data Schema Reference
44+
45+
## dbt models
4346

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
## GraphQL
2+
Utilities for getting data from Envio GraphQL API endpoint
3+
4+
### `GraphQLBatch`
5+
6+
For extracting data from a GraphQL endpoint and save to Parquet file.
7+
8+
**Parameters:**
9+
10+
- `endpoint` (str): GraphQL endpoint URL
11+
- `query` (str): GraphQL query string
12+
13+
**Methods:**
14+
15+
- `extract()`: Execute GraphQL query and return results as dictionary
16+
- `extract_to_dataframe()`: Execute GraphQL query and return results as Polars DataFrame
17+
18+
19+
### `GraphQLStream`
20+
21+
For streaming data from a GraphQL endpoint and push to database directly.
22+
23+
**Parameters:**
24+
25+
- `endpoint` (str): GraphQL endpoint URL
26+
- `table_name` (str): Name of the table (GraphQL table) to fetch
27+
- `fields` (list): List of fields to fetch
28+
- `poll_interval` (int): Polling interval in seconds
29+
30+
**Methods:**
31+
32+
- `stream()`: Stream data from GraphQL endpoint and push to database directly
33+
- Arguments:
34+
- `loader` (Loader): Loader instance for database operations
35+
- `schema` (str): Target schema name
36+
- `table_name` (str): Target table name

0 commit comments

Comments
 (0)