This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
StatGPT is an AI-driven Talk-To-Your-Data platform that enables users to interact with official statistics data using natural language. It leverages LLMs to provide relevant data from statistical databases through conversational interfaces.
- Natural language querying of SDMX datasets
- Wide indicator search (semantic + keyword + LLM reasoning)
- Data grounding with hallucination prevention
- Multi-language support for queries and responses
- Automated data visualization with Plotly
- Glossary of terms for consistent terminology
GitHub is used for version control.
Current main branch: development
make format # Format code (autoflake, black, isort)
make lint # Run all linters (flake8, black, isort, autoflake, mypy)
make mypy # Run only mypy type checkingmake test # Run all tests (unit + integration)
make test_unit # Run unit tests only
make test_integration # Run integration tests (requires test DB containers)
make test_db_migrate # Run migrations on test databasemake db_migrate # Apply alembic migrations
make db_downgrade # Rollback last migration
make db_autogenerate MESSAGE="Your migration message" # Generate new migrationmake extract_messages # Extract translatable strings from formatters
make update_messages # Update .po files from template
make compile_messages # Compile .po files to .mo files
make locales # Shorthand for compile_messagesmake install_dev # Install dev dependencies
make install_all # Install dev + experiments dependencies
make remove_venv # Remove and recreate venvmake statgpt_cli # Start the StatGPT CLICLI uses STATGPT_CLI_* prefixed environment variables.
See statgpt/cli/README.md for full documentation.
Available Commands:
| Command | Description |
|---|---|
auth login |
Authenticate with the admin API |
auth logout |
Clear cached authentication token |
auth status |
Show current authentication status |
channel list |
List all available channels |
channel import |
Import channel from zip archive |
channel status |
Show dataset preprocessing status |
channel deduplicate |
Deduplicate embeddings for a channel |
channel reindex |
Reindex dataset embeddings for a channel |
content init |
Initialize channels, data sources, datasets, glossaries |
settings |
Show current CLI settings |
statgpt/
├── app/ # Chat Backend (DIAL Application)
│ ├── application/ # App factory and DIAL app setup
│ ├── chains/ # LangChain orchestration and agent tools
│ │ ├── data_query/ # SDMX data query tool
│ │ ├── file_rags/ # Publications RAG tool
│ │ ├── web_search/ # Web search tool
│ │ ├── datasets_meta/ # Available datasets tool
│ │ ├── glossary_tools.py # Glossary tools
│ │ └── supreme_agent.py # Main agent orchestrator
│ ├── schemas/ # Pydantic models
│ ├── services/ # Business logic
│ ├── settings/ # Pydantic Settings configuration
│ └── utils/ # Utilities (formatters with i18n)
├── admin/ # Admin Backend (FastAPI standalone)
│ ├── routers/ # API routes (channels, datasets, data sources, glossary)
│ ├── services/ # Admin business logic
│ ├── auth/ # OIDC authentication
│ ├── alembic/ # Database migrations
│ └── settings/ # Admin configuration
├── common/ # Shared code
│ ├── models/ # SQLAlchemy database models
│ ├── data/ # Data access layer and SDMX handling
│ │ ├── sdmx/ # SDMX protocol implementation (v2.1)
│ │ └── quanthub/ # QuantHub SDMX provider
│ ├── vectorstore/ # PGVector storage implementation
│ ├── hybrid_indexer/ # Vector indexing for semantic search
│ ├── services/ # Shared services
│ └── schemas/ # Shared Pydantic models
└── cli/ # Interactive command-line interface
├── commands/ # Command implementations (auth, channel, content, settings)
└── shared/ # Shared utilities (admin client, console, prompts, settings)
└── auth/ # Pluggable auth providers (azure, keycloak)
tests/
├── unit/ # Unit tests (no external dependencies)
└── integration/ # Integration tests (requires test DB containers)
StatGPT uses a tool-calling agent approach:
- Main agent (
supreme_agent.py) orchestrates all tools - History consists of static (system prompt, predefined calls) and dynamic (user queries, tool calls, responses) blocks
- All data responses are grounded in actual query results to prevent hallucinations
| Category | Tool | Purpose |
|---|---|---|
| Data | Available Datasets | List datasets with metadata |
| Data | Data Query | Build and execute SDMX queries from natural language |
| Publications | Available Publications | List publication types |
| Publications | Publications RAG | Query publications using RAG |
| Glossary | Glossary Terms | List available terms |
| Glossary | Glossary Definitions | Retrieve term definitions |
| Web | Web Search | Search and retrieve web content |
The data query tool (statgpt/app/chains/data_query/) implements:
- Query Normalization - Process user input
- Named Entity Recognition - Extract countries, time periods
- Indicator Selection - Semantic + keyword search for indicators
- Dataset Selection - Identify appropriate datasets
- Availability Queries - Verify data availability
- Query Execution - Fetch and format SDMX data
- DIAL SDK Integration: Built on
aidial-sdkfor platform integration - Tool-Calling Architecture: Structured tool calls instead of code generation
- Multi-Layer Search: Keyword + semantic + LLM reasoning
- Grounded Responses: All data responses cite sources and use exact values
- Async Architecture: Async/await throughout (asyncpg, aiohttp)
- Pydantic v2: All validation uses Pydantic models
snake_casefor functions/methodsPascalCasefor classesUPPER_CASEfor constants- Protected visibility (
_) by default - Modern type hints:
list[str],dict[str, int],str | None - Import from
collections.abcfor abstract types - Use
typing.Selffor factory methods
- Use Pydantic models for validation (not dicts)
- Use
Field(default_factory=list)for mutable defaults - Validate complex function arguments with Pydantic models
- Test database uses Docker containers (
vectordb-test,elasticsearch-test) - Integration tests require
TEST_DATABASE_*environment variables - Use pytest fixtures for common test setup
- PostgreSQL with pgvector extension
- ElasticSearch (optional, for hybrid search)
- AI DIAL Core deployment
DIAL_URL- DIAL Core URLDIAL_API_KEY- API key for DIAL CorePGVECTOR_*- Database connection settingsELASTIC_*- ElasticSearch settings (optional)STATGPT_CLI_*- CLI-specific settings (seestatgpt/cli/README.md)- See
statgpt/common/README.md,statgpt/app/README.md,statgpt/admin/README.md
- Run
make formatbefore committing - Ensure
make lintpasses - Run relevant tests based on changes
- Update alembic migrations when modifying models
- Follow existing patterns in neighboring files
- See
CODE_STYLE.mdfor detailed style guidelines
- StatGPT Admin Frontend - Admin UI (React/Next.js)
- StatGPT Portal Frontend - User portal UI library
- StatGPT Helm - Kubernetes deployment charts