Skip to content

feat(api): Add import job REST API and service layer#34

Open
sjlangley wants to merge 3 commits intomainfrom
app/api/import-jobs
Open

feat(api): Add import job REST API and service layer#34
sjlangley wants to merge 3 commits intomainfrom
app/api/import-jobs

Conversation

@sjlangley
Copy link
Copy Markdown
Owner

Overview

Implements REST API endpoints and business logic for GEDCOM import job management. This is the first deliverable of PR #6 as outlined in the import job implementation plan.

API Endpoints (6)

  • POST /api/import-jobs - Create job with file upload
  • GET /api/import-jobs - List user's jobs (paginated, filterable)
  • GET /api/import-jobs/{job_id} - Get job detail with stages
  • POST /api/import-jobs/{job_id}/pause - Pause running job
  • POST /api/import-jobs/{job_id}/resume - Resume paused job
  • DELETE /api/import-jobs/{job_id} - Cancel job and delete files

Service Layer

New import_pipeline.py with 11 functions:

  • create_import_job() - Upload GEDCOM, transition UPLOADED → QUEUED
  • get_job_with_stages() - Fetch job with pipeline stages
  • list_user_jobs() - Paginated job list with filtering
  • pause_job() / resume_job() - Job lifecycle management
  • cancel_job() - Delete job and cleanup storage
  • claim_next_job() - Worker job claiming (lease-based)
  • complete_stage() / fail_job() - Stage execution updates
  • get_next_stage() / heartbeat_job() - Worker helpers

Database Changes

  • Added order: int field to ImportJobStage for pipeline ordering
  • Changed ImportJob.user_id from UUID to string (for Google JWT subject IDs)
  • Matches existing WikiTreeConnection pattern for authentication

File Storage

  • Pattern: /data/gedcom/{user_id}/{job_id}/original.ged
  • user_id is string (Google subject), job_id is UUID

State Machine

Jobs: UPLOADED → QUEUED → IN_PROGRESS → PAUSED/COMPLETED/FAILED/CANCELLED
Stages: PENDING → IN_PROGRESS → COMPLETED/FAILED/RETRYING

Testing

17/17 tests passing (100% success rate)

  • Comprehensive endpoint tests covering all 6 routes
  • Happy paths, validation errors, auth checks, and edge cases
  • Coverage: routes 72%, service layer 26%

Note: Lower service coverage expected - functions like claim_next_job(), complete_stage(), and fail_job() are called by the worker process (not yet implemented), not the API routes.

Quality Checks

  • ✅ Linting: All checks passed (ruff)
  • ✅ Formatting: All files formatted (ruff)
  • ✅ Type checking: SQLAlchemy false positives suppressed

Files Changed

New:

  • apps/api/src/api/routes/import_jobs.py (113 lines)
  • apps/api/src/api/services/import_pipeline.py (154 lines)
  • apps/api/tests/test_import_jobs_routes.py (481 lines)

Modified:

  • apps/api/src/api/database.py - Add ImportJobStage.order field
  • apps/api/src/api/app.py - Register import_jobs router
  • apps/api/src/api/settings.py - Add gedcom_storage_path setting

Next Steps

This establishes the API foundation. Follow-up PRs will implement:

  1. Worker process to claim and execute jobs
  2. Stage-specific logic (validate, parse, normalize, search, match, review)
  3. Frontend UI for job upload and monitoring

Add REST API endpoints and business logic for GEDCOM import job management:

API Endpoints (6):
- POST /api/import-jobs - Create job with file upload
- GET /api/import-jobs - List user's jobs (paginated, filterable)
- GET /api/import-jobs/{job_id} - Get job detail with stages
- POST /api/import-jobs/{job_id}/pause - Pause running job
- POST /api/import-jobs/{job_id}/resume - Resume paused job
- DELETE /api/import-jobs/{job_id} - Cancel job and delete files

Service Layer:
- create_import_job() - Upload GEDCOM, transition UPLOADED → QUEUED
- get_job_with_stages() - Fetch job with pipeline stages
- list_user_jobs() - Paginated job list with filtering
- pause_job() / resume_job() - Job lifecycle management
- cancel_job() - Delete job and cleanup storage
- claim_next_job() - Worker job claiming (lease-based)
- complete_stage() / fail_job() - Stage execution updates

Database Changes:
- Add order field to ImportJobStage for pipeline ordering
- Change ImportJob.user_id from UUID to string (for Google JWT subject IDs)

Tests:
- 17 comprehensive endpoint tests (100% passing)
- Coverage: routes 72%, service layer 26% (worker functions untested)

File Storage:
- Pattern: /data/gedcom/{user_id}/{job_id}/original.ged
- user_id is string (Google subject), job_id is UUID

State Machine:
- Jobs: UPLOADED → QUEUED → IN_PROGRESS → PAUSED/COMPLETED/FAILED/CANCELLED
- Stages: PENDING → IN_PROGRESS → COMPLETED/FAILED/RETRYING
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 46.32353% with 146 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
apps/api/src/api/services/import_pipeline.py 25.49% 114 Missing ⚠️
apps/api/src/api/routes/import_jobs.py 71.68% 32 Missing ⚠️

❌ Your patch check has failed because the patch coverage (46.32%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (1ee23c4) and HEAD (059b3ef). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (1ee23c4) HEAD (059b3ef)
worker 1 0
Flag Coverage Δ
api 81.37% <46.32%> (?)
worker ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
apps/api/src/api/app.py 83.87% <100.00%> (ø)
apps/api/src/api/database.py 99.06% <100.00%> (ø)
apps/api/src/api/settings.py 100.00% <100.00%> (ø)
apps/api/src/api/routes/import_jobs.py 71.68% <71.68%> (ø)
apps/api/src/api/services/import_pipeline.py 25.49% <25.49%> (ø)

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an import-jobs feature to the FastAPI backend, introducing REST endpoints and a service layer to manage GEDCOM import job lifecycle (create/list/detail/pause/resume/delete) and worker-side job claiming helpers.

Changes:

  • Introduces /api/import-jobs REST routes with file upload handling and job lifecycle actions.
  • Adds import_pipeline service functions for job creation, listing, cancellation, and worker helpers (claim/heartbeat/stage transitions).
  • Extends DB models/settings to support stage ordering and configurable GEDCOM storage path, plus adds endpoint tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
apps/api/src/api/routes/import_jobs.py New REST API router + response models for import job management
apps/api/src/api/services/import_pipeline.py New service layer implementing job lifecycle and worker helper operations
apps/api/src/api/database.py Schema updates: ImportJob.user_id string and ImportJobStage.order field
apps/api/src/api/settings.py Adds gedcom_storage_path configuration
apps/api/src/api/app.py Registers the new import-jobs router under /api
apps/api/src/api/services/init.py Adds services package marker/docstring
apps/api/tests/test_import_jobs_routes.py New endpoint tests for all import-job routes
apps/api/pyproject.toml Adds python-multipart dependency for multipart file uploads

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +164 to +167
# Get total count
count_result = await db.execute(query)
total = len(list(count_result.scalars().all()))

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_user_jobs() computes total by executing the full query and doing len(all()), which loads every job row into memory and does extra work on every request. Use a proper SELECT COUNT(*) (e.g., via func.count() on a subquery) so pagination remains fast as the table grows.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Changed to use SELECT COUNT(*) subquery:

# Now uses efficient count query
count_stmt = select(func.count()).select_from(stmt.subquery())
count_result = await db.execute(count_stmt)
total = count_result.scalar_one()

This avoids loading all rows into memory and keeps pagination fast as the table grows.

Comment thread apps/api/src/api/routes/import_jobs.py Outdated
heartbeat_at=job.claimed_at.isoformat() if job.claimed_at else None,
stages=stage_responses,
created_at=job.created_at.isoformat(),
updated_at=job.created_at.isoformat(), # TODO: Add updated_at to model
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImportJobResponse.updated_at is populated with job.created_at even though the ImportJob model doesn't have an updated_at field. This makes the API response misleading for clients. Prefer setting updated_at=None until the DB model has a real updated_at column, or add/update the column and populate it correctly.

Suggested change
updated_at=job.created_at.isoformat(), # TODO: Add updated_at to model
updated_at=None, # TODO: Add updated_at to model

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Changed to updated_at=None with TODO:

updated_at=None,  # TODO: Add updated_at to model

This prevents misleading clients until we add a real updated_at column to the database model.

Comment thread apps/api/src/api/app.py
Comment on lines 149 to 152
app.include_router(auth.router, prefix='/auth')
app.include_router(health.router, prefix='/health')
app.include_router(import_jobs.router, prefix='/api')
app.include_router(user.router, prefix='/user')
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including import_jobs introduces a DELETE /api/import-jobs/{job_id} endpoint, but the CORS middleware is configured with allow_methods=['GET','POST'] (see above in this file). Browser clients will fail preflight for DELETE/OPTIONS. Update CORS allow_methods to include DELETE (and OPTIONS) or use ['*'] if appropriate.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Added DELETE and OPTIONS to CORS:

allow_methods=['GET', 'POST', 'DELETE', 'OPTIONS']

Browser clients can now successfully complete preflight checks for DELETE operations.

Comment on lines 70 to 74
id: UUID = Field(default_factory=uuid4, primary_key=True)
user_id: UUID = Field(foreign_key='app_users.id', index=True)
# user_id stores Google subject ID directly (not FK to app_users)
# because auth uses JWT validation, not database user lookups
user_id: str = Field(index=True)
source_type: str # gedcom | wikitree-export | manual
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing ImportJob.user_id to str is a schema-breaking change. Since the app uses SQLModel.metadata.create_all() (which won't alter existing tables), existing DBs won't be upgraded automatically and may fail at runtime. Consider adding migrations (e.g., Alembic) or documenting/handling required schema upgrades.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented - Added comprehensive migration documentation:

# SCHEMA CHANGE: user_id changed from UUID to str (Google subject IDs)
# Migration required for existing deployments:
# 1. For SQLite: Create new table with correct schema, copy data
# 2. For PostgreSQL: ALTER TABLE import_jobs ALTER COLUMN user_id TYPE TEXT;
# TODO: Add Alembic migration in future PR

The change is necessary to support Google JWT subject IDs like "google|108291827364728364" which are not UUIDs.

Comment thread apps/api/src/api/database.py Outdated
import_job_id: UUID = Field(foreign_key='import_jobs.id', index=True)
stage_name: str # parse | normalize | search | match | review
stage_name: str # validate | parse | normalize | search | match | review
order: int # Pipeline execution order (0=validate, 1=parse, etc.)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the non-null ImportJobStage.order column is a schema-breaking change. With SQLModel.metadata.create_all() (no migrations), existing DBs won't get this column automatically, and any queries ordering by ImportJobStage.order will fail. Consider adding a migration/backfill or making the column nullable with a backfill path.

Suggested change
order: int # Pipeline execution order (0=validate, 1=parse, etc.)
order: int | None = None # Nullable for compatibility with existing DBs until backfilled

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Made field nullable for backward compatibility:

order: int | None = None  # Nullable for compatibility with existing DBs until backfilled

Existing databases without this column will continue to work. Added TODO for future backfill migration.

Comment on lines +457 to +461
job = ImportJob(
user_id=test_user.userid,
source_type='gedcom',
original_filename='test.ged',
stored_path=f'{test_user.userid}/job/original.ged',
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_delete_job_success doesn't create any ImportJobStage rows for the job. Since delete should also remove stages (and may fail under FK enforcement in Postgres), add at least one stage and assert the delete succeeds and stages are removed to catch regressions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Enhanced test to create and verify stage deletion:

# Create 2 stages to verify cascade delete
stage1 = ImportJobStage(...)
stage2 = ImportJobStage(...)
db.add_all([stage1, stage2])
await db.commit()

# After delete, verify stages are removed
result = await db.execute(select(ImportJobStage).where(...))
assert result.scalars().all() == []

The test now creates 2 stages and asserts they are deleted along with the job.

Comment on lines +341 to +348
# Load stages
stages_result = await db.execute(
select(ImportJobStage)
.where(ImportJobStage.import_job_id == job.id)
.order_by(ImportJobStage.id)
)
job.stages = list(stages_result.scalars().all()) # pyrefly: ignore[assignment]

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stages are loaded with .order_by(ImportJobStage.id) and then assigned via job.stages = .... This risks incorrect stage sequencing (should use ImportJobStage.order) and may raise at runtime because ImportJob doesn't define a stages field. Prefer returning stages separately or defining an ORM relationship.

Suggested change
# Load stages
stages_result = await db.execute(
select(ImportJobStage)
.where(ImportJobStage.import_job_id == job.id)
.order_by(ImportJobStage.id)
)
job.stages = list(stages_result.scalars().all()) # pyrefly: ignore[assignment]

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Changed to order by order field and return tuple:

# Order by the order field, not id
.order_by(ImportJobStage.order)  # pyrefly: ignore

# Return job and stages as tuple to avoid assignment error
return job, stages

Stages are now correctly sequenced by the order field. The function returns a tuple to avoid Pydantic field assignment errors since ImportJob doesn't define a stages field.

Comment on lines +404 to +408
job_id: UUID,
stage_name: str,
records_processed: int,
checkpoint_data: dict | None = None,
) -> None:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_stage_progress() accepts records_processed but never uses it, and currently only updates checkpoint_json. Either remove the unused parameter or persist progress (e.g., increment a counter / set timestamps) so worker-side callers get the behavior implied by the function name and docstring.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Removed unused parameter:

async def update_stage_progress(
    db: AsyncSession,
    job_id: UUID,
    stage_name: str,
    checkpoint_json: dict[str, Any] | None = None,
) -> None:

Removed the records_processed parameter since it wasn't being persisted. The function name and behavior now match - it only updates the checkpoint JSON.

Comment on lines +148 to +152
offset: Pagination offset
status_filter: Optional status filter
(pending/running/paused/completed/failed)

Returns:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for status_filter mention values like pending/running, but the implementation maps via ImportJobStatus[status_filter.upper()] (e.g., queued, in_progress, paused, ...). Either update the documented allowed values or add explicit mapping so pending/running work as advertised.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Updated docstring to use actual enum values:

@param status_filter: Filter by status (queued, in_progress, paused, completed, failed, cancelled)

The documentation now lists the actual ImportJobStatus enum values that the API accepts.

Comment on lines +255 to +258
limit: Maximum jobs to return (1-100)
offset: Pagination offset
status: Optional status filter (pending/running/paused/completed/failed)

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status query param docstring lists pending/running/..., but the backend currently expects enum-like values such as queued and in_progress. Align the documented allowed values with what the service actually accepts, or map pending->queued / running->in_progress so clients can use the documented filter values.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - Updated docstring to match actual enum values:

status: str | None = Query(
    None,
    description='Filter by status: queued, in_progress, paused, completed, failed, cancelled'
)

The parameter description now lists the exact ImportJobStatus enum values that the service layer expects.

Performance improvements:
- Use COUNT(*) for pagination instead of loading all rows into memory
- Stream GEDCOM files to disk (8KB chunks) to avoid memory doubling
- Compute SHA256 hash incrementally during upload

Bug fixes:
- Fix get_current_stage() to use .first() instead of scalar_one_or_none()
- Fix cancel_job() to properly DELETE stages instead of SELECT
- Order stages by order field instead of id

API improvements:
- Add DELETE and OPTIONS to CORS allowed methods
- Set updated_at to None until model has the field
- Change logging from INFO to DEBUG for _job_to_response()
- Update status filter docs to match actual enum values

Schema compatibility:
- Make ImportJobStage.order nullable for backward compatibility
- Add migration TODOs for schema-breaking changes
- Document that SQLModel.metadata.create_all() doesn't auto-upgrade

Testing:
- Add stage creation/deletion verification to test_delete_job_success
- Verify cascade delete works correctly

Code cleanup:
- Remove unused records_processed parameter from update_stage_progress()
@sjlangley
Copy link
Copy Markdown
Owner Author

PR Review Feedback Addressed

All 14 review comments have been fixed in commit 059b3ef. Here's a summary of what was changed:

Performance Improvements

COUNT(*) pagination - Changed list_user_jobs() to use proper SELECT COUNT(*) instead of loading all rows into memory. This keeps pagination fast as the table grows.

Streaming file uploads - Changed create_import_job() to stream GEDCOM files to disk in 8KB chunks instead of reading the entire file into memory. This avoids doubling memory usage under concurrency. Also computes SHA256 hash incrementally during streaming.

Bug Fixes

get_current_stage() MultipleResultsFound fix - Changed from scalar_one_or_none() to scalars().first() and now orders by ImportJobStage.order instead of id. This handles multiple PENDING stages correctly.

cancel_job() stage deletion - Fixed to use delete(ImportJobStage) instead of select(ImportJobStage) so stages are actually deleted. Also added test verification.

API Improvements

CORS methods - Added DELETE and OPTIONS to the allow_methods list so browser clients can call the DELETE endpoint.

updated_at field - Set to None until the ImportJob model has a real updated_at column. Added TODO comment.

Logging level - Changed _job_to_response() from INFO to DEBUG to avoid noisy production logs when listing jobs.

Status filter documentation - Updated docstrings to use actual enum values: uploaded/queued/in_progress/paused/completed/failed/cancelled instead of pending/running.

Schema Compatibility

Nullable order field - Made ImportJobStage.order nullable (int | None = None) for backward compatibility with existing DBs until backfilled. Added TODO comment for migration.

user_id schema change docs - Added comprehensive comment explaining that changing user_id from UUID to str is schema-breaking and production deployments should use Alembic migrations.

Testing

Stage deletion test - Enhanced test_delete_job_success to create 2 stages, verify they exist before delete, and verify they're gone after delete. This catches FK enforcement regressions.

Code Cleanup

Unused parameter removed - Removed records_processed parameter from update_stage_progress() since it wasn't being used. Simplified the function signature.


All tests passing: ✅ 17/17 tests pass with these changes.

All review feedback has been addressed. Ready for re-review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants