Skip to content

Latest commit

 

History

History
274 lines (218 loc) · 9.06 KB

File metadata and controls

274 lines (218 loc) · 9.06 KB

Dataset Upload UI Implementation Summary

Branch: 003-dataset-upload-ui
Feature: Web-based file upload interface for invoice processing
Status: ✅ Implementation Complete (73/73 tasks)

Overview

This feature implements a web-based upload interface as an alternative to manually placing files in the data/ folder. Users can now upload invoice files directly through the Streamlit dashboard, with automatic processing, duplicate detection, and comprehensive status feedback.

Key Features

1. File Upload Capabilities

  • Single & Multiple File Upload: Upload one or multiple invoice files simultaneously
  • Supported Formats: PDF, Excel (.xlsx, .xls), CSV, Images (.jpg, .jpeg, .png, .webp, .avif)
  • File Size Limit: 50MB per file (enforced client-side and server-side)
  • File Type Validation: Automatic validation against supported formats

2. Duplicate Detection

  • Hash-based Detection: Uses SHA-256 file hashing to detect duplicate files
  • User Warning: Shows warning dialog when duplicate is detected
  • Force Reprocess: Option to reprocess duplicate files if needed

3. Automatic Processing

  • Seamless Integration: Uploaded files are automatically processed using existing process_invoice_file() function
  • Metadata Tracking: Upload metadata (subfolder, group, category, source) is stored with each invoice
  • Status Tracking: Real-time status updates during processing

4. User Experience

  • Progress Indicators: Visual progress bars and per-file status indicators
  • Upload History: Recent uploads displayed with status (processing, completed, failed, duplicate)
  • Status Polling: Manual refresh button to check processing status
  • Error Handling: User-friendly error messages with retry guidance
  • Success Notifications: Clear feedback on upload success with links to view processed invoices

5. Metadata Management

  • Upload Metadata: Tracks subfolder, group/batch, category, upload source, and timestamp
  • Display in Detail View: Upload metadata displayed in invoice detail page
  • Organization: Files can be organized by group/batch and category

Technical Implementation

Architecture

┌─────────────────┐
│ Streamlit UI    │  ← Upload Component (interface/dashboard/components/upload.py)
│  (Dashboard)    │
└────────┬────────┘
         │ HTTP POST (multipart/form-data)
         ▼
┌─────────────────┐
│ FastAPI         │  ← Upload Endpoint (interface/api/routes/uploads.py)
│  /api/v1/uploads│
└────────┬────────┘
         │
         ├─→ Duplicate Detection (hash)
         ├─→ File Storage (data/uploads/)
         ├─→ Invoice Record Creation (with `storage_path`)
         └─→ Automatic Processing

Database Changes

New Field: upload_metadata (JSONB) in invoices table

{
  "subfolder": "uploads",
  "group": "batch-2024-01",
  "category": "monthly-invoices",
  "upload_source": "web-ui",
  "uploaded_at": "2024-01-05T21:30:00Z"
}

Indexes Created:

  • GIN index on upload_metadata for efficient JSON queries
  • Functional indexes on upload_metadata->>'subfolder' and upload_metadata->>'group'

API Endpoints

POST /api/v1/uploads

Upload one or more invoice files.

Request:

  • files: Multipart file uploads (one or more files)
  • subfolder: Optional subfolder within data/ directory (default: "uploads")
  • group: Optional group/batch identifier
  • category: Optional category for organization
  • force_reprocess: Boolean to force reprocessing of duplicates

Response (202 Accepted):

{
  "status": "success",
  "data": {
    "uploads": [
      {
        "file_name": "invoice.pdf",
        "invoice_id": "uuid",
        "status": "processing",
        "file_path": "uploads/invoice.pdf",
        "file_size": 12345
      }
    ],
    "total": 1,
    "successful": 1,
    "failed": 0,
    "skipped": 0
  }
}

GET /api/v1/uploads/{upload_id}/status

Get current processing status of an upload.

Response:

{
  "status": "success",
  "data": {
    "invoice_id": "uuid",
    "file_name": "invoice.pdf",
    "processing_status": "completed",
    "upload_metadata": { ... },
    "error_message": null
  }
}

Files Created

New Files

  • interface/api/routes/uploads.py - Upload API endpoints
  • interface/dashboard/components/upload.py - Streamlit upload UI component
  • alembic/versions/003_add_upload_metadata.py - Database migration
  • tests/integration/test_upload_api.py - Integration tests
  • tests/unit/test_upload_component.py - Unit tests

Modified Files

  • core/models.py - Renamed file_path to storage_path and added upload_metadata field
  • ingestion/orchestrator.py - Updated to use storage_path and added upload_metadata parameter support
  • interface/api/schemas.py - Added upload-related schemas
  • interface/api/main.py - Registered upload router
  • interface/api/routes/invoices.py - Added metadata to invoice detail response
  • interface/dashboard/app.py - Added upload tab and metadata display
  • interface/dashboard/queries.py - Added metadata to invoice queries

Usage

For Users

  1. Navigate to Upload Tab: Open the Streamlit dashboard and click on the "Upload Files" tab
  2. Select Files: Use the file uploader to select one or more invoice files
  3. Optional Metadata: Expand "Upload Options" to set:
    • Subfolder (default: "uploads")
    • Group/Batch identifier
    • Category
    • Force reprocess option
  4. Upload: Click "Upload Files" button
  5. Monitor Progress: View progress indicators and status updates
  6. View Results: Check upload history and click "View" to see processed invoices

For Developers

Running Tests:

# Unit tests
pytest tests/unit/test_upload_component.py

# Integration tests
pytest tests/integration/test_upload_api.py

API Testing:

# Start API server
uvicorn interface.api.main:app --reload

# Upload file via curl
curl -X POST "http://localhost:${API_PORT}/api/v1/uploads" \
  -F "files=@invoice.pdf" \
  -F "subfolder=uploads" \
  -F "group=test-batch"

Security Features

  1. Filename Sanitization: Prevents path traversal attacks by removing .., /, \ and dangerous characters
  2. Subfolder Validation: Validates subfolder names to prevent directory traversal
  3. File Type Validation: Strict validation against whitelist of supported file types
  4. File Size Limits: Enforced both client-side and server-side (50MB)

Error Handling

  • Network Errors: User-friendly messages with guidance to check API server
  • File Validation Errors: Clear messages about unsupported types or size limits
  • Processing Errors: Detailed error messages with invoice ID for troubleshooting
  • Duplicate Files: Warning with option to force reprocess

Performance Considerations

  • Async Processing: Files are processed asynchronously after upload
  • Batch Upload: Multiple files processed independently in parallel
  • Efficient Storage: Files stored in organized subfolders
  • Hash-based Deduplication: Fast duplicate detection using SHA-256 hashing

Testing Coverage

  • Unit Tests: File validation, duplicate detection, component logic
  • Integration Tests: API endpoints, file upload workflow, error handling
  • Test Files:
    • tests/unit/test_upload_component.py
    • tests/integration/test_upload_api.py

Future Enhancements (Not Implemented)

Potential improvements for future iterations:

  • Automatic status polling with auto-refresh
  • Drag-and-drop file upload (Streamlit supports this natively)
  • Upload queue management for large batches
  • Progress percentage for individual files
  • Cancel upload functionality
  • Upload templates/presets for common configurations

Migration Guide

If upgrading from the previous version:

  1. Run Database Migration:

    alembic upgrade head
  2. Verify Dependencies:

    • python-multipart>=0.0.12 (for FastAPI file uploads)
    • httpx>=0.27.0 (for Streamlit API calls)
  3. Create Upload Directory:

    mkdir -p data/uploads
  4. Restart Services:

    • Restart FastAPI server
    • Restart Streamlit dashboard

Task Completion Summary

Total Tasks: 73
Completed: 73 (100%)
Remaining: 6 validation checklist items (manual testing/QA)

Phase Breakdown

  • ✅ Phase 1 (Setup): 4/4 tasks
  • ✅ Phase 2 (Foundational): 7/7 tasks
  • ✅ Phase 3 (US1 - Single Upload): 17/17 tasks
  • ✅ Phase 4 (US2 - Multiple Upload): 8/8 tasks
  • ✅ Phase 5 (US3 - Progress/Status): 10/10 tasks
  • ✅ Phase 6 (Polish): 11/11 tasks

Related Documentation

  • Specification: specs/003-dataset-upload-ui/spec.md
  • Technical Plan: specs/003-dataset-upload-ui/plan.md
  • Data Model: specs/003-dataset-upload-ui/data-model.md
  • API Contract: specs/003-dataset-upload-ui/contracts/openapi.yaml
  • Tasks: specs/003-dataset-upload-ui/tasks.md

Implementation Date: January 2025
Last Updated: January 5, 2025