CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

StatGPT is an AI-driven Talk-To-Your-Data platform that enables users to interact with official statistics data using natural language. It leverages LLMs to provide relevant data from statistical databases through conversational interfaces.

Key Capabilities

Natural language querying of SDMX datasets
Wide indicator search (semantic + keyword + LLM reasoning)
Data grounding with hallucination prevention
Multi-language support for queries and responses
Automated data visualization with Plotly
Glossary of terms for consistent terminology

Version Control

GitHub is used for version control.

Current main branch: development

Commands

Code Quality

make format              # Format code (autoflake, black, isort)
make lint                # Run all linters (flake8, black, isort, autoflake, mypy)
make mypy                # Run only mypy type checking

Testing

make test                # Run all tests (unit + integration)
make test_unit           # Run unit tests only
make test_integration    # Run integration tests (requires test DB containers)
make test_db_migrate     # Run migrations on test database

Database Operations

make db_migrate          # Apply alembic migrations
make db_downgrade        # Rollback last migration
make db_autogenerate MESSAGE="Your migration message"  # Generate new migration

Localization

make extract_messages    # Extract translatable strings from formatters
make update_messages     # Update .po files from template
make compile_messages    # Compile .po files to .mo files
make locales             # Shorthand for compile_messages

Virtual Environment

make install_dev         # Install dev dependencies
make install_all         # Install dev + experiments dependencies
make remove_venv         # Remove and recreate venv

CLI (Interactive Command-Line Interface)

make statgpt_cli        # Start the StatGPT CLI

CLI uses STATGPT_CLI_* prefixed environment variables. See statgpt/cli/README.md for full documentation.

Available Commands:

Command	Description
`auth login`	Authenticate with the admin API
`auth logout`	Clear cached authentication token
`auth status`	Show current authentication status
`channel list`	List all available channels
`channel import`	Import channel from zip archive
`channel status`	Show dataset preprocessing status
`channel deduplicate`	Deduplicate embeddings for a channel
`channel reindex`	Reindex dataset embeddings for a channel
`content init`	Initialize channels, data sources, datasets, glossaries
`settings`	Show current CLI settings

Architecture Overview

Project Structure

statgpt/
├── app/                 # Chat Backend (DIAL Application)
│   ├── application/     # App factory and DIAL app setup
│   ├── chains/          # LangChain orchestration and agent tools
│   │   ├── data_query/  # SDMX data query tool
│   │   ├── file_rags/   # Publications RAG tool
│   │   ├── web_search/  # Web search tool
│   │   ├── datasets_meta/       # Available datasets tool
│   │   ├── glossary_tools.py    # Glossary tools
│   │   └── supreme_agent.py     # Main agent orchestrator
│   ├── schemas/         # Pydantic models
│   ├── services/        # Business logic
│   ├── settings/        # Pydantic Settings configuration
│   └── utils/           # Utilities (formatters with i18n)
├── admin/               # Admin Backend (FastAPI standalone)
│   ├── routers/         # API routes (channels, datasets, data sources, glossary)
│   ├── services/        # Admin business logic
│   ├── auth/            # OIDC authentication
│   ├── alembic/         # Database migrations
│   └── settings/        # Admin configuration
├── common/              # Shared code
│   ├── models/          # SQLAlchemy database models
│   ├── data/            # Data access layer and SDMX handling
│   │   ├── sdmx/        # SDMX protocol implementation (v2.1)
│   │   └── quanthub/    # QuantHub SDMX provider
│   ├── vectorstore/     # PGVector storage implementation
│   ├── hybrid_indexer/  # Vector indexing for semantic search
│   ├── services/        # Shared services
│   └── schemas/         # Shared Pydantic models
└── cli/                 # Interactive command-line interface
    ├── commands/        # Command implementations (auth, channel, content, settings)
    └── shared/          # Shared utilities (admin client, console, prompts, settings)
        └── auth/        # Pluggable auth providers (azure, keycloak)

tests/
├── unit/                # Unit tests (no external dependencies)
└── integration/         # Integration tests (requires test DB containers)

Agent Architecture

StatGPT uses a tool-calling agent approach:

Main agent (supreme_agent.py) orchestrates all tools
History consists of static (system prompt, predefined calls) and dynamic (user queries, tool calls, responses) blocks
All data responses are grounded in actual query results to prevent hallucinations

Available Agent Tools

Category	Tool	Purpose
Data	Available Datasets	List datasets with metadata
Data	Data Query	Build and execute SDMX queries from natural language
Publications	Available Publications	List publication types
Publications	Publications RAG	Query publications using RAG
Glossary	Glossary Terms	List available terms
Glossary	Glossary Definitions	Retrieve term definitions
Web	Web Search	Search and retrieve web content

Data Query Pipeline

The data query tool (statgpt/app/chains/data_query/) implements:

Query Normalization - Process user input
Named Entity Recognition - Extract countries, time periods
Indicator Selection - Semantic + keyword search for indicators
Dataset Selection - Identify appropriate datasets
Availability Queries - Verify data availability
Query Execution - Fetch and format SDMX data

Key Design Patterns

DIAL SDK Integration: Built on aidial-sdk for platform integration
Tool-Calling Architecture: Structured tool calls instead of code generation
Multi-Layer Search: Keyword + semantic + LLM reasoning
Grounded Responses: All data responses cite sources and use exact values
Async Architecture: Async/await throughout (asyncpg, aiohttp)
Pydantic v2: All validation uses Pydantic models

Code Style Requirements

Python Style

snake_case for functions/methods
PascalCase for classes
UPPER_CASE for constants
Protected visibility (_) by default
Modern type hints: list[str], dict[str, int], str | None
Import from collections.abc for abstract types
Use typing.Self for factory methods

Pydantic Best Practices

Use Pydantic models for validation (not dicts)
Use Field(default_factory=list) for mutable defaults
Validate complex function arguments with Pydantic models

Testing Guidelines

Test database uses Docker containers (vectordb-test, elasticsearch-test)
Integration tests require TEST_DATABASE_* environment variables
Use pytest fixtures for common test setup

Environment Setup

Required Services

PostgreSQL with pgvector extension
ElasticSearch (optional, for hybrid search)
AI DIAL Core deployment

Key Environment Variables

DIAL_URL - DIAL Core URL
DIAL_API_KEY - API key for DIAL Core
PGVECTOR_* - Database connection settings
ELASTIC_* - ElasticSearch settings (optional)
STATGPT_CLI_* - CLI-specific settings (see statgpt/cli/README.md)
See statgpt/common/README.md, statgpt/app/README.md, statgpt/admin/README.md

Development Workflow

Run make format before committing
Ensure make lint passes
Run relevant tests based on changes
Update alembic migrations when modifying models
Follow existing patterns in neighboring files
See CODE_STYLE.md for detailed style guidelines

Related Repositories

StatGPT Admin Frontend - Admin UI (React/Next.js)
StatGPT Portal Frontend - User portal UI library
StatGPT Helm - Kubernetes deployment charts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Key Capabilities

Version Control

Commands

Code Quality

Testing

Database Operations

Localization

Virtual Environment

CLI (Interactive Command-Line Interface)

Architecture Overview

Project Structure

Agent Architecture

Available Agent Tools

Data Query Pipeline

Key Design Patterns

Code Style Requirements

Python Style

Pydantic Best Practices

Testing Guidelines

Environment Setup

Required Services

Key Environment Variables

Development Workflow

Related Repositories

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Key Capabilities

Version Control

Commands

Code Quality

Testing

Database Operations

Localization

Virtual Environment

CLI (Interactive Command-Line Interface)

Architecture Overview

Project Structure

Agent Architecture

Available Agent Tools

Data Query Pipeline

Key Design Patterns

Code Style Requirements

Python Style

Pydantic Best Practices

Testing Guidelines

Environment Setup

Required Services

Key Environment Variables

Development Workflow

Related Repositories