Skip to content

Feature/gap1 training history#331

Open
Sickwoman wants to merge 1 commit intoc2siorg:mainfrom
Sickwoman:feature/gap1-training-history
Open

Feature/gap1 training history#331
Sickwoman wants to merge 1 commit intoc2siorg:mainfrom
Sickwoman:feature/gap1-training-history

Conversation

@Sickwoman
Copy link
Copy Markdown
Contributor

Overview

This PR adds critical improvements to the GAP-1 training history feature:

  • Error handling: Capture training failures in database
  • Pagination: Prevent large result sets from endpoint
  • Logging: Track important operations
  • Validation: Warn when configured metrics don't match training output

Problem Statement

  • Training failures weren't being persisted to the database
  • GET /model/{model_id}/training-runs could return thousands of records
  • No logging for important operations like "set as best"
  • Silent failures when configured metric doesn't match training output

Solution

Backend Changes

1. Error Handling in model_run() (model_run.py)

  • Wrapped training execution in try-except block at model_run() level
  • When training fails:
    • Sets training_run.status = "failed"
    • Captures error message: training_run.error_message = str(e)
    • Records completion time and duration
    • Safely rolls back if database update fails
  • Non-blocking: Database errors don't prevent exception propagation

Example flow:

try:
    _run(model_name, db)  # Training runs here
except Exception as e:
    # Capture to database
    training_run.status = "failed"
    training_run.error_message = str(e)
    training_run.completed_at = datetime.now(UTC)
    db.commit()
    raise  # Still propagate to caller

2. Pagination on Training Runs Endpoint (training_run.py)

  • Added offset and limit query parameters
  • Defaults: offset=0, limit=50
  • Returns returned_runs count along with pagination info
  • Prevents loading 1000+ training runs into memory

Query example:
Response:

{
  "model_id": 123,
  "model_name": "my-model",
  "returned_runs": 50,
  "offset": 0,
  "limit": 50,
  "runs": [...]
}

3. Logging Improvements

  • Added logger initialization in training_run.py
  • Log when marking run as best:
  logger.info("Marked training run #%d as best for model #%d", run_id, model_id)

4. Metric Validation

  • Warn in logs if configured metric not found in training history
  • Better debugging for metric mismatches
  • Helps identify when training output doesn't match expectations

Testing

  • ✅ All 162 backend tests pass
  • ✅ Error handling tested with failed training scenarios
  • ✅ Pagination tested with multiple training runs
  • ✅ Logging verified

Backwards Compatibility

✅ Fully backwards compatible:

  • Endpoint still returns all fields
  • Pagination has sensible defaults
  • Error handling is non-blocking
  • Existing clients continue to work

Files Changed

  • tensormap-backend/app/services/model_run.py — Error handling wrapper
  • tensormap-backend/app/routers/training_run.py — Pagination + logging

Key Features

✅ Training failures now tracked with error messages
✅ Pagination prevents memory issues with large datasets
✅ Comprehensive logging for debugging
✅ Metric validation warns on mismatches
✅ All 162 tests passing
✅ Ruff linting passes

Related

Improves GAP-1 training history feature with production-ready error handling and performance optimizations.

@Sickwoman Sickwoman force-pushed the feature/gap1-training-history branch from 669ffa2 to 57274d9 Compare April 26, 2026 21:38
- Add ModelTrainingRun table to capture every training run
- Store timing, config snapshot, final metrics, epoch-by-epoch curves
- Save history in model_run.py after model.fit() completes
- Add 3 new endpoints:
    GET  /api/model/{model_id}/training-runs
    GET  /api/model/{model_id}/training-run/{run_id}/metrics
    POST /api/model/{model_id}/training-run/{run_id}/set-as-best
- Alembic migration: 9e13d23b8149
- Register ModelTrainingRun in migrations/env.py
@Sickwoman Sickwoman force-pushed the feature/gap1-training-history branch from 57274d9 to 17dd64a Compare April 26, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant