newgnart
diff --git a/‎CLAUDE.md‎
Lines changed: 69 additions & 50 deletions b/‎CLAUDE.md‎
Lines changed: 69 additions & 50 deletions
diff --git a/‎docs/02_data_pipeline/01_source.md‎
Lines changed: 35 additions & 4 deletions b/‎docs/02_data_pipeline/01_source.md‎
Lines changed: 35 additions & 4 deletions
diff --git a/‎docs/02_data_pipeline/03_transformation.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/02_data_pipeline/03_transformation.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/04_api_reference/041_data_sources.md‎
Lines changed: 36 additions & 0 deletions b/‎docs/04_api_reference/041_data_sources.md‎
Lines changed: 36 additions & 0 deletions
@@ -10,38 +10,53 @@ This is a capstone project for FA DAE2 focused on Ethereum blockchain data extra
 
 The project follows an ELT (Extract, Load, Transform) pipeline:
 
-### 1. Extract Layer (`scripts/extract/`)
-- **Primary script**: `runner.py` - Extracts logs and transactions from Etherscan API
-- Uses the `onchaindata.data_extraction.etherscan` module
-- Supports multiple blockchain networks via `EtherscanClient`
-- Features automatic retry logic for failed block ranges with exponential backoff
+### 1. Extract Layer (`scripts/el/`)
+- **Primary script**: `extract_etherscan.py` - Extracts logs and transactions from Etherscan API
+- Uses the `onchaindata.data_extraction.etherscan` module with `EtherscanClient`
+- Supports multiple blockchain networks via chainid mapping in `src/onchaindata/config/chainid.json`
+- Features automatic retry logic for failed block ranges with exponential backoff (reduces chunk size by 10x)
 - Data stored as Parquet files in `.data/raw/` directory
-- Error tracking in `logging/extract_error/` with automatic retry mechanism
+- Error tracking in `logging/extract_error/` with automatic retry mechanism that logs failed ranges to CSV
+- Supports K/M/B suffixes for block numbers (e.g., '18.5M' = 18,500,000)
+- Additional extraction capabilities: `extract_graphql.py` for GraphQL-based extraction
 
-### 2. Load Layer (`scripts/load/`)
-- **postgres_load.py**: Loads Parquet files into PostgreSQL `raw` schema
-- **snowflake_load.py**: Optional Snowflake loading capabilities
-- Uses `onchaindata.data_pipeline` module for loading operations
-- Supports both dlt-based and direct psycopg-based loading
+### 2. Load Layer (`scripts/el/`)
+- **load.py**: Unified loader script supporting both PostgreSQL and Snowflake
+- Uses `onchaindata.data_pipeline.Loader` class with pluggable database clients
+- Takes arguments: `-f` (file path), `-c` (client: postgres/snowflake), `-s` (schema), `-t` (table), `-w` (write disposition: append/replace/merge)
+- Database clients in `src/onchaindata/utils/`: `PostgresClient`, `SnowflakeClient`
 
 ### 3. Transform Layer (dbt)
 - **Location**: `dbt_project/`
 - Standard dbt project structure with models organized by layer:
-  - `models/staging/`: Raw data cleanup (e.g., `stg_logs_decoded`)
+  - `models/01_staging/`: Raw data cleanup (e.g., `stg_logs_decoded.sql`)
   - `models/intermediate/`: Business logic transformations
   - `models/marts/`: Final analytics tables
-- Shared macros in `dbt_project/macros/` for Ethereum data type conversions:
+- Materialization strategy:
+  - staging: `view`
+  - intermediate: `ephemeral`
+  - marts: `table`
+- Shared macros in `dbt_project/macros/ethereum_macros.sql`:
   - `uint256_to_address`: Extracts Ethereum addresses from uint256 hex strings
   - `uint256_to_numeric`: Converts uint256 hex to numeric values
-- Models reference source data from `raw` schema
+- Sources defined in `models/01_staging/sources.yml` (references `raw` schema)
 - Configuration: [dbt_project.yml](dbt_project/dbt_project.yml), [profiles.yml](dbt_project/profiles.yml)
 
 ### 4. Package Structure (`src/onchaindata/`)
 Reusable Python package with modules:
-- `data_extraction/`: Etherscan API client with rate limiting
-- `data_pipeline/`: Postgres and Snowflake loading utilities
-- `utils/`: Database clients (PostgresClient, SnowflakeClient)
-- `config/`: Configuration management
+- `data_extraction/`:
+  - `etherscan.py`: EtherscanClient with rate limiting
+  - `graphql.py`: GraphQL-based extraction
+  - `rate_limiter.py`: Rate limiting utilities
+  - `base.py`: Base classes for API clients
+- `data_pipeline/`:
+  - `loaders.py`: Loader class for database operations
+- `utils/`:
+  - `postgres_client.py`: PostgreSQL client with connection pooling
+  - `snowflake_client.py`: Snowflake client
+  - `chain.py`: Chain-related utilities
+  - `base_client.py`: Base database client interface
+- `config/`: Configuration files (chainid.json)
 
 ## Development Commands
 
@@ -60,23 +75,23 @@ uv sync
 cp .env.example .env
 export $(cat .env | xargs)
 
-# Initialize database schema
-./scripts/sql/run_sql.sh ./scripts/sql/init.sql
+# Initialize database schema (if needed)
+./scripts/sql_pg.sh ./scripts/sql/init.sql
 ```
 
 ### Data Extraction
 ```bash
 # Extract logs and transactions for a specific contract address
 # Supports K/M/B suffixes for block numbers (e.g., '18.5M')
-uv run python scripts/extract/runner.py \
+uv run python scripts/el/extract_etherscan.py \
   -c ethereum \
   -a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \
   --logs --transactions \
   --from_block 18.5M --to_block 20M \
   -v  # verbose logging
 
 # Extract data from last N days
-uv run python scripts/extract/runner.py \
+uv run python scripts/el/extract_etherscan.py \
   -a 0x02950460e2b9529d0e00284a5fa2d7bdf3fa4d72 \
   --logs --transactions \
   --last_n_days 7
@@ -87,12 +102,20 @@ uv run python scripts/extract/runner.py \
 ### Data Loading
 ```bash
 # Load Parquet file to PostgreSQL
-uv run python scripts/load/postgres_load.py \
+uv run python scripts/el/load.py \
   -f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \
-  -t logs
+  -c postgres \
+  -s raw \
+  -t logs \
+  -w append
 
 # Load to Snowflake (requires SNOWFLAKE_* env vars)
-uv run python scripts/load/snowflake_load.py
+uv run python scripts/el/load.py \
+  -f .data/raw/ethereum_0xaddress_logs_18500000_20000000.parquet \
+  -c snowflake \
+  -s raw \
+  -t logs \
+  -w append
 ```
 
 ### dbt Operations
@@ -111,18 +134,15 @@ uv run python scripts/load/snowflake_load.py
 ./scripts/dbt.sh docs generate              # Generate documentation
 ./scripts/dbt.sh run --select staging.*     # Run all staging models
 ./scripts/dbt.sh deps                       # Install dbt packages
-
-# Legacy script (still available for backward compatibility)
-./scripts/run_dbt.sh staging run
 ```
 
 ### SQL Operations
 ```bash
-# Run SQL scripts directly
-./scripts/sql/run_sql.sh ./scripts/sql/init.sql
+# Run SQL scripts directly against PostgreSQL
+./scripts/sql_pg.sh ./scripts/sql/init.sql
 
 # Ad-hoc queries
-./scripts/sql/run_sql.sh ./scripts/sql/ad_hoc.sql
+./scripts/sql_pg.sh ./scripts/sql/ad_hoc.sql
 ```
 
 ## Environment Variables
@@ -131,36 +151,35 @@ Required variables (see `.env.example`):
 - `POSTGRES_HOST`, `POSTGRES_PORT`, `POSTGRES_DB`, `POSTGRES_USER`, `POSTGRES_PASSWORD`
 - `DB_SCHEMA`: Default schema for operations (e.g., `fa02_staging`)
 - `KAFKA_NETWORK_NAME`: Docker network name
+- `ETHERSCAN_API_KEY`: For Etherscan API access
 
 Optional (for Snowflake):
 - `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_USER`, `SNOWFLAKE_ROLE`, `SNOWFLAKE_WAREHOUSE`
 - `SNOWFLAKE_DATABASE`, `SNOWFLAKE_SCHEMA`, `SNOWFLAKE_PRIVATE_KEY_FILE_PATH`
 
 ## Key Data Flows
 
-1. **Etherscan → Parquet**: `runner.py` extracts blockchain data to `.data/raw/*.parquet`
-2. **Parquet → PostgreSQL**: `postgres_load.py` loads into `raw` schema tables
+1. **Etherscan → Parquet**: `extract_etherscan.py` extracts blockchain data to `.data/raw/*.parquet`
+2. **Parquet → PostgreSQL/Snowflake**: `load.py` loads into `raw` schema tables
 3. **PostgreSQL → dbt**: dbt models transform `raw.logs` → `staging.stg_logs_decoded`
-4. Failed extractions are logged to `logging/extract_error/` and automatically retried with smaller chunk sizes
+4. Failed extractions are logged to `logging/extract_error/` and automatically retried with smaller chunk sizes (10x reduction)
 
 ## dbt Project Structure
 
 ```
 dbt_project/
-├── dbt_project.yml          # Configuration
-├── profiles.yml             # Database connections
+├── dbt_project.yml          # Configuration (project: stablecoins)
+├── profiles.yml             # Database connections (dev=postgres, test/prod=snowflake)
 ├── models/
-│   ├── staging/            # Raw data cleanup
-│   │   ├── _staging__sources.yml
-│   │   ├── _staging__models.yml
+│   ├── 01_staging/         # Raw data cleanup (materialized as views)
+│   │   ├── sources.yml     # Source definitions (raw schema)
+│   │   ├── models.yml      # Model documentation
 │   │   └── stg_logs_decoded.sql
-│   ├── intermediate/       # Business logic transformations
-│   └── marts/              # Final analytics tables
+│   ├── intermediate/       # Business logic (ephemeral)
+│   └── marts/              # Final analytics (tables)
 ├── tests/                  # Data quality tests
-│   ├── test_valid_address.sql
-│   └── test_block_number_range.sql
-├── macros/                 # Reusable SQL (ethereum_macros.sql)
-└── packages.yml            # dbt dependencies (dbt_utils)
+├── macros/                 # ethereum_macros.sql (uint256_to_address, uint256_to_numeric)
+└── packages.yml            # dbt dependencies
 ```
 
 ### Model Naming Conventions
@@ -171,16 +190,16 @@ dbt_project/
 
 ## Database Schema
 
-- **raw.logs**: Raw log data with columns: address, topics (JSONB), data, block_number, transaction_hash, etc.
+- **raw.logs**: Raw log data with columns: address, topics (JSONB), data, block_number, transaction_hash, time_stamp, etc.
 - **raw.transactions**: Transaction data (structure similar to logs)
-- **staging.stg_logs_decoded**: Decoded logs with parsed topics (topic0-topic3)
+- **staging.stg_logs_decoded**: Decoded logs with parsed topics (topic0-topic3), indexed on (contract_address, transaction_hash, index)
 - dbt creates additional staging/intermediate/mart tables based on models in `dbt_project/models/`
 
 ## Project Structure Notes
 
-- Runnable scripts are ONLY in `scripts/` directory
-- Reusable code is packaged in `src/onchaindata/`
+- Runnable scripts are ONLY in `scripts/` directory (organized as `scripts/el/` for extract/load)
+- Reusable code is packaged in `src/onchaindata/` as an installable package
 - dbt project located at `dbt_project/` with standard structure (staging → intermediate → marts)
 - Data files: `.data/raw/` for extracted data, `sampledata/` for examples
 - Always run Python scripts with `uv run python` (not direct python)
-- Legacy `dbt_subprojects/` directory retained for reference (can be removed after migration)
+- Project uses `uv` for dependency management (see `pyproject.toml`)
@@ -1,9 +1,40 @@
+## Primary Source: HyperIndex (Envio)
+
 ### `Transfer` data
-Raw transaction data is indexed with [HyperIndex](https://docs.envio.dev/docs/HyperIndex/overview), a blockchain indexing framework that transforms on-chain events into structured, queryable databases with GraphQL APIs. 
-- To run the indexer
+Raw transaction data is indexed with [HyperIndex](https://docs.envio.dev/docs/HyperIndex/overview), a blockchain indexing framework that transforms on-chain events into structured, queryable databases with GraphQL APIs.
+
+**To run the indexer:**
 ```bash
 git clone https://github.com/newgnart/envio-stablecoins.git
 pnpm dev
 ```
-More details on [envio-stablecoins](https://github.com/newgnart/envio-stablecoins)
-### Wallet labels data
+
+**Benefits:**
+- ✅ Real-time continuous indexing
+- ✅ Structured GraphQL queries
+- ✅ Multiple contracts and events simultaneously
+- ✅ No API rate limits
+
+More details: [envio-stablecoins](https://github.com/newgnart/envio-stablecoins)
+
+---
+
+## Alternative Source: Etherscan API (Optional)
+
+The repository includes Etherscan API extraction tools (`scripts/el/extract_etherscan.py`) as an alternative data source. While not used in the primary pipeline, it's useful for:
+
+- Historical data extraction and validation
+- Supporting additional EVM chains
+- Ad-hoc analysis without running the indexer
+
+**Trade-offs:**
+- ✅ Flexible, no infrastructure needed
+- ✅ 50+ EVM chains supported
+- ❌ Rate limited (5 req/sec on free tier)
+- ❌ Requires API key
+
+For detailed usage, see [Additional Tools](../06_additional_tools.md).
+
+---
+
+## Wallet labels data
@@ -40,4 +40,7 @@ uv run scripts/el/stream_graphql.py \
     - Currently supported databases:
         - PostgreSQL
         - Snowflake
+- Data Schema Reference
+
+## dbt models
 
@@ -0,0 +1,36 @@
+## GraphQL
+Utilities for getting data from Envio GraphQL API endpoint 
+
+### `GraphQLBatch`
+
+For extracting data from a GraphQL endpoint and save to Parquet file.
+
+**Parameters:**
+
+- `endpoint` (str): GraphQL endpoint URL
+- `query` (str): GraphQL query string
+
+**Methods:**
+
+- `extract()`: Execute GraphQL query and return results as dictionary
+- `extract_to_dataframe()`: Execute GraphQL query and return results as Polars DataFrame
+
+
+### `GraphQLStream`
+
+For streaming data from a GraphQL endpoint and push to database directly.
+
+**Parameters:**
+
+- `endpoint` (str): GraphQL endpoint URL
+- `table_name` (str): Name of the table (GraphQL table) to fetch
+- `fields` (list): List of fields to fetch
+- `poll_interval` (int): Polling interval in seconds
+
+**Methods:**
+
+- `stream()`: Stream data from GraphQL endpoint and push to database directly
+  - Arguments:
+    - `loader` (Loader): Loader instance for database operations
+    - `schema` (str): Target schema name
+    - `table_name` (str): Target table name