Branch: 003-dataset-upload-ui
Feature: Web-based file upload interface for invoice processing
Status: ✅ Implementation Complete (73/73 tasks)
This feature implements a web-based upload interface as an alternative to manually placing files in the data/ folder. Users can now upload invoice files directly through the Streamlit dashboard, with automatic processing, duplicate detection, and comprehensive status feedback.
- Single & Multiple File Upload: Upload one or multiple invoice files simultaneously
- Supported Formats: PDF, Excel (.xlsx, .xls), CSV, Images (.jpg, .jpeg, .png, .webp, .avif)
- File Size Limit: 50MB per file (enforced client-side and server-side)
- File Type Validation: Automatic validation against supported formats
- Hash-based Detection: Uses SHA-256 file hashing to detect duplicate files
- User Warning: Shows warning dialog when duplicate is detected
- Force Reprocess: Option to reprocess duplicate files if needed
- Seamless Integration: Uploaded files are automatically processed using existing
process_invoice_file()function - Metadata Tracking: Upload metadata (subfolder, group, category, source) is stored with each invoice
- Status Tracking: Real-time status updates during processing
- Progress Indicators: Visual progress bars and per-file status indicators
- Upload History: Recent uploads displayed with status (processing, completed, failed, duplicate)
- Status Polling: Manual refresh button to check processing status
- Error Handling: User-friendly error messages with retry guidance
- Success Notifications: Clear feedback on upload success with links to view processed invoices
- Upload Metadata: Tracks subfolder, group/batch, category, upload source, and timestamp
- Display in Detail View: Upload metadata displayed in invoice detail page
- Organization: Files can be organized by group/batch and category
┌─────────────────┐
│ Streamlit UI │ ← Upload Component (interface/dashboard/components/upload.py)
│ (Dashboard) │
└────────┬────────┘
│ HTTP POST (multipart/form-data)
▼
┌─────────────────┐
│ FastAPI │ ← Upload Endpoint (interface/api/routes/uploads.py)
│ /api/v1/uploads│
└────────┬────────┘
│
├─→ Duplicate Detection (hash)
├─→ File Storage (data/uploads/)
├─→ Invoice Record Creation (with `storage_path`)
└─→ Automatic Processing
New Field: upload_metadata (JSONB) in invoices table
{
"subfolder": "uploads",
"group": "batch-2024-01",
"category": "monthly-invoices",
"upload_source": "web-ui",
"uploaded_at": "2024-01-05T21:30:00Z"
}Indexes Created:
- GIN index on
upload_metadatafor efficient JSON queries - Functional indexes on
upload_metadata->>'subfolder'andupload_metadata->>'group'
Upload one or more invoice files.
Request:
files: Multipart file uploads (one or more files)subfolder: Optional subfolder within data/ directory (default: "uploads")group: Optional group/batch identifiercategory: Optional category for organizationforce_reprocess: Boolean to force reprocessing of duplicates
Response (202 Accepted):
{
"status": "success",
"data": {
"uploads": [
{
"file_name": "invoice.pdf",
"invoice_id": "uuid",
"status": "processing",
"file_path": "uploads/invoice.pdf",
"file_size": 12345
}
],
"total": 1,
"successful": 1,
"failed": 0,
"skipped": 0
}
}Get current processing status of an upload.
Response:
{
"status": "success",
"data": {
"invoice_id": "uuid",
"file_name": "invoice.pdf",
"processing_status": "completed",
"upload_metadata": { ... },
"error_message": null
}
}interface/api/routes/uploads.py- Upload API endpointsinterface/dashboard/components/upload.py- Streamlit upload UI componentalembic/versions/003_add_upload_metadata.py- Database migrationtests/integration/test_upload_api.py- Integration teststests/unit/test_upload_component.py- Unit tests
core/models.py- Renamedfile_pathtostorage_pathand addedupload_metadatafieldingestion/orchestrator.py- Updated to usestorage_pathand addedupload_metadataparameter supportinterface/api/schemas.py- Added upload-related schemasinterface/api/main.py- Registered upload routerinterface/api/routes/invoices.py- Added metadata to invoice detail responseinterface/dashboard/app.py- Added upload tab and metadata displayinterface/dashboard/queries.py- Added metadata to invoice queries
- Navigate to Upload Tab: Open the Streamlit dashboard and click on the "Upload Files" tab
- Select Files: Use the file uploader to select one or more invoice files
- Optional Metadata: Expand "Upload Options" to set:
- Subfolder (default: "uploads")
- Group/Batch identifier
- Category
- Force reprocess option
- Upload: Click "Upload Files" button
- Monitor Progress: View progress indicators and status updates
- View Results: Check upload history and click "View" to see processed invoices
Running Tests:
# Unit tests
pytest tests/unit/test_upload_component.py
# Integration tests
pytest tests/integration/test_upload_api.pyAPI Testing:
# Start API server
uvicorn interface.api.main:app --reload
# Upload file via curl
curl -X POST "http://localhost:${API_PORT}/api/v1/uploads" \
-F "files=@invoice.pdf" \
-F "subfolder=uploads" \
-F "group=test-batch"- Filename Sanitization: Prevents path traversal attacks by removing
..,/,\and dangerous characters - Subfolder Validation: Validates subfolder names to prevent directory traversal
- File Type Validation: Strict validation against whitelist of supported file types
- File Size Limits: Enforced both client-side and server-side (50MB)
- Network Errors: User-friendly messages with guidance to check API server
- File Validation Errors: Clear messages about unsupported types or size limits
- Processing Errors: Detailed error messages with invoice ID for troubleshooting
- Duplicate Files: Warning with option to force reprocess
- Async Processing: Files are processed asynchronously after upload
- Batch Upload: Multiple files processed independently in parallel
- Efficient Storage: Files stored in organized subfolders
- Hash-based Deduplication: Fast duplicate detection using SHA-256 hashing
- Unit Tests: File validation, duplicate detection, component logic
- Integration Tests: API endpoints, file upload workflow, error handling
- Test Files:
tests/unit/test_upload_component.pytests/integration/test_upload_api.py
Potential improvements for future iterations:
- Automatic status polling with auto-refresh
- Drag-and-drop file upload (Streamlit supports this natively)
- Upload queue management for large batches
- Progress percentage for individual files
- Cancel upload functionality
- Upload templates/presets for common configurations
If upgrading from the previous version:
-
Run Database Migration:
alembic upgrade head
-
Verify Dependencies:
python-multipart>=0.0.12(for FastAPI file uploads)httpx>=0.27.0(for Streamlit API calls)
-
Create Upload Directory:
mkdir -p data/uploads
-
Restart Services:
- Restart FastAPI server
- Restart Streamlit dashboard
Total Tasks: 73
Completed: 73 (100%)
Remaining: 6 validation checklist items (manual testing/QA)
- ✅ Phase 1 (Setup): 4/4 tasks
- ✅ Phase 2 (Foundational): 7/7 tasks
- ✅ Phase 3 (US1 - Single Upload): 17/17 tasks
- ✅ Phase 4 (US2 - Multiple Upload): 8/8 tasks
- ✅ Phase 5 (US3 - Progress/Status): 10/10 tasks
- ✅ Phase 6 (Polish): 11/11 tasks
- Specification:
specs/003-dataset-upload-ui/spec.md - Technical Plan:
specs/003-dataset-upload-ui/plan.md - Data Model:
specs/003-dataset-upload-ui/data-model.md - API Contract:
specs/003-dataset-upload-ui/contracts/openapi.yaml - Tasks:
specs/003-dataset-upload-ui/tasks.md
Implementation Date: January 2025
Last Updated: January 5, 2025