Sprint: S10 - The Real Deal
Epic ID: B
Owner: @siyahkare
Status: π§ In Progress
- Feature Store (Parquet/DuckDB) ile leak-safe veri seti
- Optuna ile β₯200 deneme, early stopping
- LightGBM (joblib) production model
- Model Card ve EnsemblePredictor entegrasyonu
backend/src/data/
feature_store.py # Parquet/DuckDB + time-based split
backend/src/ml/
train_lgbm_prod.py # Optuna + LightGBM + joblib dump + model_card
infer_lgbm.py # Production inference wrapper
backend/data/
feature_store/ # Parquet files per symbol
BTCUSDT.parquet
models/
YYYY-MM-DD/
lgbm.pkl
model_card.json
best_lgbm.pkl -> YYYY-MM-DD/lgbm.pkl
- Leakage tests pass (no future leak)
- Val accuracy β₯ 65%
-
backend/data/models/best_lgbm.pklandmodel_card.jsoncreated - EnsemblePredictor loads real model
- CI gate: coverage β₯ 80%, backtest KPI no regression
Features:
- Parquet append (schema evolution ready)
- Time-based train/val split (leak-safe)
- Minute-bar feature engineering
- Leakage guard assertions
Core Functions:
def minute_features(bars: pd.DataFrame, horizon: int = 5) -> pd.DataFrame
def time_based_split(df: pd.DataFrame, val_days: int) -> Tuple[pd.DataFrame, pd.DataFrame]
def guard_no_future_leak(train: pd.DataFrame, val: pd.DataFrame)
def to_parquet(df: pd.DataFrame, out_dir: str, symbol: str) -> strFeatures Generated:
ret1: 1-period returnsma20_gap: close - SMA(20)sma50_gap: close - SMA(50)vol_z: Volume z-score (50-period)- Target
y: 1 if price up inhorizonminutes, else 0
Features:
- Optuna hyperparameter search (β₯200 trials)
- LightGBM binary classification
- Early stopping (100 rounds)
- Model card generation
- Symlink to
best_lgbm.pkl
Hyperparameters Tuned:
learning_rate: 0.01 - 0.2 (log scale)num_leaves: 16 - 256 (log scale)min_data_in_leaf: 16 - 512 (log scale)feature_fraction: 0.6 - 1.0bagging_fraction: 0.6 - 1.0lambda_l1,lambda_l2: 1e-9 - 10.0 (log scale)
Output Artifacts:
backend/data/models/YYYY-MM-DD/
lgbm.pkl # Trained model
model_card.json # Metadata (features, accuracy, timestamp)
backend/data/models/
best_lgbm.pkl -> YYYY-MM-DD/lgbm.pkl # Symlink
Features:
- Thread-safe singleton model loader
- Lazy loading (load on first predict)
- Compatible with EnsemblePredictor interface
Usage:
from backend.src.ml.infer_lgbm import LGBMProd
# Load once, reuse
proba = LGBMProd.predict_proba_up({
"close": 100.0,
"ret1": 0.01,
"sma20_gap": -0.5,
"sma50_gap": 1.2,
"vol_z": 0.3
})
# Returns: 0.0 - 1.0 (probability of price going up)Update backend/src/ml/models/ensemble_predictor.py:
from backend.src.ml.infer_lgbm import LGBMProd
class LGBMPredictor:
def __init__(self, model_path: str = "backend/data/models/best_lgbm.pkl"):
self.model_path = model_path
self.loaded = False
def load(self):
if not self.loaded:
LGBMProd.load(self.model_path)
self.loaded = True
def predict_proba(self, features: Dict[str, float]) -> float:
return LGBMProd.predict_proba_up(features)- Verify time-based split has no future leak
- Assert
train.ts.max() < val.ts.min()
- Synthetic data β feature store β train
- Verify artifacts created
- Verify symlink exists
- Load model (if exists)
- Predict with dummy features
- Assert output in [0.0, 1.0]
cd /Users/onur/levibot/backend
source venv/bin/activate
pip install lightgbm optuna pyarrow duckdbpython - <<'PY'
import pandas as pd
import json
from pathlib import Path
from backend.src.data.feature_store import minute_features, to_parquet
# Convert existing raw data to feature store
raw_files = list(Path("backend/data/raw").glob("BTCUSDT_*.json"))
if raw_files:
rows = json.loads(raw_files[0].read_text())
bars = pd.DataFrame(rows)[["ts","o","h","l","c","v"]]
bars.columns = ["ts","open","high","low","close","volume"]
df = minute_features(bars, horizon=5)
path = to_parquet(df, symbol="BTCUSDT")
print(f"β
Feature store created: {path}")
print(f" Rows: {len(df)}, Features: {list(df.columns)}")
PYpython - <<'PY'
from backend.src.ml.train_lgbm_prod import train_from_parquet
# Train with Optuna (50 trials for quick test, 200+ for production)
artifacts = train_from_parquet(
"backend/data/feature_store/BTCUSDT.parquet",
val_days=14,
trials=50
)
print(f"β
Model artifacts:")
for k, v in artifacts.items():
print(f" {k}: {v}")
PY# Check engine status (should load real LGBM)
curl -s http://localhost:8000/engines/status | jq
# Test inference
python - <<'PY'
from backend.src.ml.infer_lgbm import LGBMProd
proba = LGBMProd.predict_proba_up({
"close": 100.0,
"ret1": 0.01,
"sma20_gap": -0.5,
"sma50_gap": 1.2,
"vol_z": 0.3
})
print(f"π Prediction: {proba:.4f} (probability of up move)")
PY# requirements.txt
lightgbm>=4.0.0
optuna>=3.0.0
pyarrow>=14.0.0
duckdb>=0.9.0# Optional: CPU-intensive job on nightly schedule
# PR: 5-10 trials for quick validation
# Nightly: 200+ trials for production model- Target: β₯80%
- Upload
model_card.jsonandbest_lgbm.pklas artifacts
Risk: Future data in training set
Mitigation:
- Time-based split only
- Lagged features (add
shift(1)if needed) guard_no_future_leak()assertion in tests
Risk: Unbalanced binary target
Mitigation:
- Monitor class distribution
- Use
is_unbalance=Trueorscale_pos_weightin LGBM params
Risk: Training takes too long
Mitigation:
- Set
timeoutparameter instudy.optimize() - Or use fixed
n_trialswith progress bar
Risk: Overwriting good models
Mitigation:
- Dated directories (
YYYY-MM-DD/) - Symlink for
best_* - Model card tracks metadata
| Metric | Target | Notes |
|---|---|---|
| Val Accuracy | β₯ 65% | Binary classification |
| Training Time | < 30 min | 200 trials on CPU |
| Inference Latency | < 10ms | CPU, single prediction |
| Model Size | < 50 MB | Joblib serialized |
| Feature Count | 5 | close, ret1, sma20_gap, sma50_gap, vol_z |
After Epic-B completion:
- Epic-C: Production TFT - PyTorch Lightning transformer
- Epic-D: Backtesting - 90-day historical simulation
- Model Monitoring - Drift detection, auto-retraining
- Advanced Features - Order flow, funding rate, on-chain data
Status: π§ Ready for implementation
Owner: @siyahkare
Next: Create skeleton files β Run training β Validate accuracy β₯65%