Taiga 2 Architecture

Taiga 2 is a web application designed to store and retrieve immutable data in a folder-organized way. This document provides an overview of the system architecture.

System Overview

Taiga is built as a modern web application with a clear separation between frontend and backend components:

Frontend: React-based single-page application written in TypeScript
Backend: Python Flask API with Swagger/OpenAPI specification (the swagger document may be wrong. It was manually generated and then the api was implemented referring to that doc as the guidance.)
Storage: PostgreSQL database for metadata and Amazon S3 Storage for file storage
Background Processing: Celery with Redis for asynchronous tasks

Component Architecture

Database Layer

Taiga uses a hybrid storage approach:

PostgreSQL: Stores all metadata, user information, folder structures, and references to data files
S3: Stores the actual data files (raw files, converted files, etc.)

The database schema is managed through SQLAlchemy ORM with Alembic for migrations. Key models include:

User: User accounts and authentication
Entry: Base class for all content (folders, datasets, etc.)
Folder: Directory-like containers for organizing content
Dataset: Collections of data files with versioning
DatasetVersion: Specific versions of datasets
DataFile: References to actual files stored in S3/GCS

In other words the schema is like:

Dataset has many
- DataVersions has many
  - DataFile

Those all have a "ownership model" (meaning, if you delete a dataset, all the versions and files should also be deleted).

Independent of the Dataset model, we have Folders which can contain references to other Folders and Datasets. The relation between a Folder and its contents is modeled by Entry records.

Backend API

The backend is built with Flask and Connexion (for OpenAPI/Swagger integration):

API Endpoints: Defined in Swagger YAML and implemented in Python
Controllers: Handle business logic and database operations
Authentication: Token-based authentication system
File Processing: Conversion between different file formats (CSV, HDF5, etc.)

Key components:

taiga2/controllers/endpoint.py: API endpoint implementations
taiga2/controllers/models_controller.py: Database operations
taiga2/conv/: File conversion utilities
taiga2/third_party_clients/: Integrations with S3, GCS, Figshare, etc.

Frontend Application

The frontend is a React application written in TypeScript:

Models: TypeScript interfaces matching backend data structures
API Client: Typed API client for backend communication
UI Components: React components for user interface
State Management: React state management for application state

Background Processing

Taiga uses Celery with Redis for handling long-running tasks:

File Conversion: Converting between file formats
Data Import/Export: Handling large file uploads/downloads
Integration Tasks: Communication with third-party services

Data Flow

User Authentication: Users authenticate via token-based auth
Data Organization: Users create folders and datasets to organize their data
Data Upload: Files are uploaded to S3/GCS via presigned URLs
Processing: Background tasks process and convert files as needed
Versioning: Dataset versions are created to track changes
Access Control: Permissions control who can view/edit content

Integration Points

Taiga integrates with several external services:

AWS S3: Primary file storage
Google Cloud Storage: Alternative file storage (but in practice, not used)
Figshare: Publishing datasets to Figshare (Doesn't work. We've externalized this into a script. All figshare UI code would be best off deleted.)
Email Notifications: Subscription-based notifications for dataset changes (I think this must have stopped working)

Deployment Architecture

Taiga is deployed as a containerized application:

Docker: Application packaged as Docker containers
CI/CD: Travis CI for testing and building images (this hasn't happened in years)
GCP: Container images stored in Google Container Registry (Note: GCR was shutdown last year, so if this still works, it's actually an alias for Google Artificate Registry)
Systemd: Service management on deployment servers

Security Model

Authentication: Token-based authentication
Authorization: Group-based access control (This is not really fully implemented. If there is code about group-based access, I am willing to bet it is half-implement if at all)
Secure Storage: Taiga will sign URLs for fetching data from S3 and give them to the client, so that the transfer of large data happens directly between client and S3.

Deployment Diagram

graph TD
    subgraph "Client"
        Browser[Web Browser]
    end

    subgraph "Application Server"
        Flask[Flask API]
        Celery[Celery Workers]
        Redis[Redis Queue]
    end

    subgraph "Storage"
        Postgres[PostgreSQL]
        S3[AWS S3]
        GCS[Google Cloud Storage]
    end

    subgraph "External Services"
        Figshare[Figshare]
        Email[Email Service]
    end

    Browser -->|HTTP/HTTPS| Flask
    Flask -->|Query/Update| Postgres
    Flask -->|Enqueue Tasks| Redis
    Redis -->|Process Tasks| Celery
    Celery -->|File Operations| S3
    Celery -->|File Operations| GCS
    Celery -->|Publish| Figshare
    Celery -->|Send Notifications| Email
    Flask -->|Generate Signed URLs| S3
    Flask -->|Generate Signed URLs| GCS

Service Architecture

flowchart LR
    subgraph "Frontend"
        React[React App]
        TypeScript[TypeScript Models]
    end

    subgraph "Backend API"
        Endpoints[API Endpoints]
        Controllers[Controllers]
        Models[Data Models]
        Auth[Authentication]
    end

    subgraph "Background Processing"
        TaskQueue[Celery Queue]
        Workers[Celery Workers]
        Conversion[File Conversion]
        Import[Data Import]
        Export[Data Export]
    end

    subgraph "Storage"
        MetadataDB[PostgreSQL]
        FileStorage[S3/GCS]
    end

    React -->|API Calls| Endpoints
    TypeScript -->|Data Types| React
    Endpoints -->|Process Requests| Controllers
    Controllers -->|CRUD Operations| Models
    Models -->|Persist| MetadataDB
    Controllers -->|Enqueue Tasks| TaskQueue
    TaskQueue -->|Process| Workers
    Workers -->|Execute| Conversion
    Workers -->|Execute| Import
    Workers -->|Execute| Export
    Conversion -->|Read/Write| FileStorage
    Import -->|Write| FileStorage
    Export -->|Read| FileStorage
    Auth -->|Secure| Endpoints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taiga 2 Architecture

System Overview

Component Architecture

Database Layer

Backend API

Frontend Application

Background Processing

Data Flow

Integration Points

Deployment Architecture

Security Model

Deployment Diagram

Service Architecture

FilesExpand file tree

arch.md

Latest commit

History

arch.md

File metadata and controls

Taiga 2 Architecture

System Overview

Component Architecture

Database Layer

Backend API

Frontend Application

Background Processing

Data Flow

Integration Points

Deployment Architecture

Security Model

Deployment Diagram

Service Architecture