Skip to content

Latest commit

 

History

History
364 lines (268 loc) · 10.3 KB

File metadata and controls

364 lines (268 loc) · 10.3 KB

SQLite Persistent Data Storage Guide

Overview

The ai-trader framework now includes a powerful SQLite-based persistent data storage layer using SQLModel ORM. This eliminates redundant API calls and dramatically speeds up repeated backtests.

Benefits:

  • 40x faster on repeated fetches (~2-3s → ~50ms)
  • Incremental updates - Only fetch missing dates
  • Multi-market support - US stocks, Taiwan stocks, crypto, forex, VIX
  • Type-safe ORM - SQLModel with proper schema constraints
  • Easy management - CLI commands for list, delete, clean

Quick Start

Using the CLI

# Default: Save to CSV only
ai-trader fetch AAPL --market us_stock --start-date 2024-01-01

# NEW: Save to SQLite only
ai-trader fetch AAPL --market us_stock --start-date 2024-01-01 --storage sqlite

# Save to both CSV and SQLite
ai-trader fetch AAPL --market us_stock --start-date 2024-01-01 --storage both

Data Management Commands

# List all cached tickers
ai-trader data list

# Filter by market
ai-trader data list --market us_stock

# View database statistics
ai-trader data info

# Delete specific ticker
ai-trader data delete --ticker AAPL --market us_stock

# Clean old data (before 2020)
ai-trader data clean --market us_stock --before 2020-01-01

Architecture

Database Schema (SQLModel ORM)

database: data/market_data.db

Per-Market Tables:
├── us_stock_data      [ticker, date, open, high, low, close, volume, adj_close]
├── tw_stock_data      [ticker, date, open, high, low, close, volume]
├── crypto_data        [ticker, date, open, high, low, close, volume, adj_close]
├── forex_data         [ticker, date, open, high, low, close, volume, adj_close]
├── vix_data           [ticker, date, open, high, low, close, volume, adj_close]
└── data_metadata      [ticker, market_type, first_date, last_date, total_rows, last_fetched_at]

Constraints:
├── UNIQUE(ticker, date) on each market table
├── Indexed: ticker, date
└── last_fetched_at tracks when data was last updated

Why Separate Tables?

  1. Schema Flexibility - Taiwan stocks don't have adj_close, no nullable columns needed
  2. Query Performance - Fast indexed lookups, no cross-market JOINs
  3. Data Integrity - Market-specific validation and constraints
  4. Future Extensibility - Easy to add market-specific columns

Python API

Basic Usage

from ai_trader.data.storage import SQLiteDataStorage
import pandas as pd

# Initialize
storage = SQLiteDataStorage(db_path="data/market_data.db")

# Save data
df = pd.DataFrame(...)  # OHLCV with DatetimeIndex
rows = storage.save(df=df, ticker="AAPL", market_type="us_stock")

# Load data
df = storage.load("AAPL", "us_stock", "2024-01-01", "2024-12-31")

# Check coverage
coverage = storage.get_coverage("AAPL", "us_stock")
# Returns: (date(2024,1,1), date(2024,12,31))

# Find missing dates
missing = storage.get_missing_ranges("AAPL", "us_stock",
                                     "2023-01-01", "2024-12-31")
# Returns: [(date(2023,1,1), date(2023,12,31))]

# List tickers
tickers = storage.list_tickers()
tickers = storage.list_tickers(market_type="us_stock")  # Filter by market

# Database info
info = storage.get_database_info()

# Clean up
storage.delete_ticker("AAPL", "us_stock")
storage.delete_before("us_stock", "2020-01-01")

Supported Markets

Market Table Has adj_close Example Tickers
us_stock us_stock_data Yes AAPL, MSFT, TSLA
tw_stock tw_stock_data No 2330, 2454, 3008
crypto crypto_data Yes BTC-USD, ETH-USD
forex forex_data Yes EURUSD, GBPUSD
vix vix_data Yes ^VIX

Incremental Update Logic

The system intelligently handles incremental updates:

Request: Fetch AAPL data for 2023-01-01 to 2024-12-31

Step 1: Check metadata table
        Database has: 2024-01-01 to 2024-06-30

Step 2: Calculate missing ranges
        Need:     2023-01-01 -------- 2024-06-30 -------- 2024-12-31
        Have:                         [████████████]
        Missing:  [████████████████]              [██████████]

Step 3: Fetch only missing data (2 ranges)
        API call 1: 2023-01-01 to 2023-12-31
        API call 2: 2024-07-01 to 2024-12-31

Step 4: Merge with existing data and update metadata

Performance Comparison

Operation CSV SQLite
First fetch (500 rows) ~2-3s (API) ~2-3s (API)
Repeated fetch (same data) ~2-3s (API) ~50ms (cache)
Incremental fetch (partial) ~2-3s (full API) ~500ms (partial API)
Query 1 year ~100ms ~30ms

Savings on typical workflow:

  • 50 backtests × 2.95s saved = 147.5 seconds (~2.5 minutes)
  • ~40x faster for repeated data

Examples

Example 1: Quick Backtest with Caching

# First run: Fetches from API (~3 seconds)
ai-trader fetch AAPL --market us_stock --start-date 2024-01-01 --storage sqlite
ai-trader quick SMAStrategy data/AAPL.csv

# Second run: Uses cached data (~1 second total)
ai-trader fetch AAPL --market us_stock --start-date 2024-01-01 --storage sqlite
ai-trader quick BBandsStrategy data/AAPL.csv

# Third run: Incremental update only (~1 second total)
ai-trader fetch AAPL --market us_stock --start-date 2024-01-01 --storage sqlite
ai-trader quick RSIStrategy data/AAPL.csv

Example 2: Multi-Symbol Backtest

# Fetch multiple symbols and cache them
ai-trader fetch AAPL MSFT GOOGL --market us_stock --start-date 2024-01-01 --storage sqlite

# List what was cached
ai-trader data list --market us_stock
# Output:
#   • AAPL (us_stock): 2024-01-01 to 2024-03-28 (251 rows)
#   • MSFT (us_stock): 2024-01-01 to 2024-03-28 (251 rows)
#   • GOOGL (us_stock): 2024-01-01 to 2024-03-28 (251 rows)

# Later: Add one more year of data (incremental)
ai-trader fetch AAPL MSFT GOOGL --market us_stock --start-date 2024-01-01 --end-date 2025-03-28 --storage sqlite
# Only fetches 2024-03-29 to 2025-03-28 from API

Example 3: Run All Examples

python scripts/examples/sqlite_storage_example.py

Best Practices

Do's

  1. Use SQLite for repeated backtests

    ai-trader fetch AAPL --storage sqlite
    # Later, same command = instant load
  2. Use CSV for sharing data

    ai-trader fetch AAPL --storage csv
    # Share data/us_stock/AAPL_*.csv with team
  3. Use both for production workflows

    ai-trader fetch AAPL --storage both
    # Persistent cache + shareable CSV
  4. Periodically clean old data

    ai-trader data clean --market us_stock --before 2015-01-01
  5. Monitor database size

    ai-trader data info

Don'ts

  1. Don't use SQLite for one-off backtests

    • CSV is fine if data isn't reused
  2. Don't commit database files to git

    • Database regenerates on demand
    • Use .gitignore for data/market_data.db
  3. Don't delete metadata accidentally

    • Database tracks coverage, deletion loses this info
    • Use ai-trader data delete --ticker X to clean properly
  4. Don't mix storage modes without understanding

    # BAD: Creates confusion about data origin
    ai-trader fetch AAPL --storage sqlite
    ai-trader fetch AAPL --storage csv  # Different location!
    
    # GOOD: Use consistent approach
    ai-trader fetch AAPL --storage sqlite
    ai-trader fetch AAPL --storage sqlite  # Same database

Migration Guide

From CSV-only to SQLite

# Old way: CSV files accumulate
ai-trader fetch AAPL --market us_stock --start-date 2020-01-01  # Creates CSV

# New way: Use SQLite
ai-trader fetch AAPL --market us_stock --start-date 2020-01-01 --storage sqlite

# Import existing CSVs (if needed)
python -c """
from ai_trader.data.storage import SQLiteDataStorage
import pandas as pd
from pathlib import Path

storage = SQLiteDataStorage()
for csv_file in Path('data/us_stock').glob('*.csv'):
    df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
    ticker = csv_file.stem.split('_')[0]
    storage.save(df=df, ticker=ticker, market_type='us_stock')
"""

Troubleshooting

Q: Database file is too large

A: Clean old data
ai-trader data clean --market us_stock --before 2020-01-01

Q: Why is first fetch not faster?

A: First fetch always calls API. SQLite speeds up REPEATED fetches.
Speedup happens on 2nd and subsequent calls.

Q: Can I use the same database for multiple projects?

A: Yes, each table is per-market, no conflicts.
Or use different db_path: SQLiteDataStorage(db_path="project2/data.db")

Q: How to reset the database?

A: Delete the database file
rm data/market_data.db
Next fetch will create a fresh one.

Technical Details

SQLModel & ORM

All table definitions use SQLModel for type safety:

from ai_trader.data.storage import USStockData, TWStockData, CryptoData

# These are proper SQLModel ORM models, not raw SQL tables
# Provides:
# - Type hints and IDE completion
# - Automatic schema validation
# - Migration-free updates
# - Seamless integration with SQLAlchemy 2.0

No Raw SQL

All database operations use the SQLModel ORM:

# Good: Using ORM
session.exec(select(USStockData).where(USStockData.ticker == "AAPL")).all()

# Bad: Raw SQL (not used in this project)
session.exec("SELECT * FROM us_stock_data WHERE ticker = 'AAPL'").all()

Thread Safety

SQLite storage is thread-safe:

# Safe for concurrent access
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(storage.load, symbol, "us_stock", "2024-01-01", "2024-12-31")
        for symbol in ["AAPL", "MSFT", "GOOGL", "AMZN"]
    ]
    results = [f.result() for f in futures]

References

License

Licensed under GPL-3.0. See LICENSE for details.