A Python API server using the FastAPI framework that:
- Monitors a local filesystem tree of Parquet-backed datasets.
- Exposes REST endpoints to discover datasets/versions/files and read metadata.
- Runs lightweight analyses (row counts, basic stats, column sums/means, distincts).
- Produces simple plots (histograms, boxplots, timeseries line charts) as images or Vega-Lite JSON.
- Orchestrates processing with TextBlaster by publishing jobs to a message queue.
- No arbitrary SQL over datasets (limited curated analyses only).
- No multi-tenant auth/authorization (can be added later).
- No distributed compute (single-node, but parallelized IO/CPU).
datasets/
├── <dataset_slug>/
│ └── original/
│ └── v<semver>/
│ └── <dataset_slug>.parquet
dataset_slug: kebab-case ASCII (e.g.,ai-aktindsigt).- Collections (future): allow multiple parquet files under a version.
- The server must be resilient to missing or partial versions.
- Web framework:
FastAPIrunning on an ASGI server likeuvicorn. - Parquet/frames:
polars(vialazyAPI) for high-performance data manipulation. - Plotting:
altairfor Vega-Lite JSON emission. - FS watching:
watchdog. - Background jobs:
FastAPI BackgroundTasksfor simple, in-process tasks. - Config:
pydantic-settingsfor deserializing settings from environment variables or.envfiles. - Logging/metrics:
structlogfor structured logging,opentelemetry-pythonfor tracing, andprometheus-fastapi-instrumentatorfor Prometheus metrics. - Message Queue (TextBlaster):
aio-pikafor asynchronous communication with RabbitMQ. - Caching: In-memory
dictwithasyncio.Lockor a library likeaiocache.
# .env file
ROOT_DIR="/path/to/datasets"
BIND_ADDR="0.0.0.0"
BIND_PORT=8080
MAX_PARALLEL_SCANS=4
PLOT_BACKEND="vega"
CACHE_TTL_SECONDS=600
WATCH=True
RABBITMQ_URL="amqp://guest:guest@localhost/"
JOB_QUEUE_NAME="textblaster_jobs"Environment variables or a .env file are used for configuration, loaded via pydantic-settings.
Pydantic models are used to define the data structures.
Dataset {
slug: str,
path: str,
variants: List[str] = ["original"],
versions: List[str] = [],
files: List[Dict],
inferred_schema: List[Dict]
}
AnalysisRequest {
dataset: str,
variant: str = "original",
version: str,
operations: List[Dict],
filters: Optional[List[Dict]] = None
}
AnalysisResult {
stats: Dict,
plots: Optional[List[Dict]] = None
}
TextBlasterJobRequest {
input_file: str,
output_file: str,
excluded_file: str,
text_column: str,
id_column: str
}-
Indexing
- On boot: scan
root_dirfor dataset structure; build index. - Extract light metadata (file size, mtime) and attempt schema inference from Parquet metadata.
- On boot: scan
-
Watching
- If
watch=true, subscribe to FS changes (create/remove/modify) and update the index usingwatchdog.
- If
-
Analysis
- Ad-hoc analysis pipeline over Polars lazyframes.
- Column existence/type validation with clear HTTP 400 errors.
- Results cached in-memory with a TTL.
-
Plotting
- For
vegabackend: return Vega-Lite JSON spec generated byaltair.
- For
-
TextBlaster Integration
- Publish job requests to a RabbitMQ queue. The server acts as a producer and does not track job status directly.
Base URL: /api/v1
FastAPI automatically generates OpenAPI (/openapi.json) and interactive documentation (/docs, /redoc).
GET /datasets→ list all datasets.GET /datasets/{slug}→ details for one dataset.GET /datasets/{slug}/{variant}→ versions.GET /datasets/{slug}/{variant}/{version}→ files + inferred schema.
POST /analysis/preview→ lightweight: row_count + column list.POST /analysis/run→ acceptsAnalysisRequest; returnsAnalysisResult.GET /analysis/cache/{hash}→ fetch cached result by key.
POST /plots→ accepts a subset ofAnalysisRequestwith a single plot op; returns Vega-Lite JSON.
POST /textblaster/jobs→ publish job to RabbitMQ; body:TextBlasterJobRequest. This is a "fire-and-forget" operation.
GET /healthz→{ "status": "ok" }.GET /readyz→ verifies root_dir exists and RabbitMQ is reachable.GET /metrics→ Prometheus metrics.
- Use standard FastAPI exception handling to return descriptive JSON error responses.
{
"detail": "Column 'amount' not found"
}- Common codes: 400 (validation), 404 (dataset/missing version), 500 (internal), 503 (downstream service unavailable).
- v1: optional API key via header
X-API-Keyimplemented using FastAPI's dependency injection system. - CORS: configurable via FastAPI's
CORSMiddleware. - Path traversal prevention: ensure all file access is securely contained within
root_dir.
- Polars lazy execution + predicate pushdown.
- Run
uvicornwith multiple worker processes for concurrency. - Parallel file reads up to
max_parallel_scans.
- Structured logging with
structlog. - Tracing with
opentelemetry-python. - Metrics for request durations, scan times, and job queue publishing.
- Python application run via a WSGI server like
gunicornoruvicorn. - Systemd unit example + Dockerfile provided.
- Read-only FS mode supported (cache is in-memory).
- Unit tests:
pytestfor business logic (analysis, plotting, etc.). - Integration:
pytestwithhttpx.AsyncClientto make requests to the test application. - Use temporary directories with sample datasets for testing file operations.
A) Row count + sum
- Client calls
POST /analysis/runwith{dataset:"cellar",variant:"original",version:"v1.0.0",operations:[{"op":"row_count"},{"op":"sum","column":"amount"}]}. - Server executes Polars plan; returns
{ stats: { row_count: 123456, sum_amount: 987654.32 } }.
B) Launch TextBlaster
POST /textblaster/jobswith job details.- Server validates the request and publishes a JSON message to the configured RabbitMQ queue.
A typical FastAPI project structure:
app/
├── api/
│ ├── __init__.py
│ ├── datasets.py
│ ├── analysis.py
│ └── textblaster.py
├── services/
│ ├── __init__.py
│ ├── indexing.py
│ ├── analysis.py
│ └── queue.py
├── models/
│ ├── __init__.py
│ └── domain.py
├── core/
│ ├── __init__.py
│ └── config.py
├── __init__.py
└── main.py
tests/
pyproject.toml
main.py– bootstrap, configuration, and API router setup.core/config.py– load/validate settings using Pydantic.services/indexing.py– scanning, schema inference, and filesystem watching.services/analysis.py– request parsing, polars plans, result formatting, caching, plotting.services/queue.py– RabbitMQ connection and message publishing.api/– FastAPI routers, request/response models.models/– Pydantic models for internal data representation.
- Requires datasets with consistent schemas per version.
- Parquet compression supported by Polars (Snappy, Zstd, Gzip).
- Time-series ops assume a parseable datetime column.
- Arrow Flight or DataFusion SQL for richer queries.
- Materialized views of common analyses.
- User-specified derived columns (safe expressions).
- gRPC API surface alongside REST.
- Role-based access control.
- Multiple files per version: Always a single Parquet file per version. The server will assume exactly one file at
datasets/<slug>/<variant>/<version>/<slug>.parquetand error if missing or multiple files are found. - Plot backend: Vega-Lite JSON preferred — server will emit Vega-Lite specs (via
altair); client will render them. - TextBlaster interface: Message queue integration. The server will publish a job request to a RabbitMQ queue.
- Filter language: Not required for v1 — analyses will be global over the dataset.
- Auth: API key header (
X-API-Key) support is desired and will remain configurable in v1. - Result limits: No enforced limits in v1.
- Schema drift: The server must validate that the single file in a version conforms to the expected schema and report schema drift as a validation error.
- TextBlaster output location: The output location is specified in the message sent to TextBlaster.
- Number formatting: Server returns raw numeric values; client is responsible for formatting.
- Deployment target: Bare-metal (systemd) — Docker is optional but not required.
All requests assume the API server is running on http://localhost:8080.
curl http://localhost:8080/healthzcurl http://localhost:8080/readyzcurl http://localhost:8080/api/v1/datasetscurl http://localhost:8080/api/v1/datasets/ai-aktindsigtcurl http://localhost:8080/api/v1/datasets/ai-aktindsigt/original/v1.0.0curl -X POST http://localhost:8080/api/v1/analysis/run \
-H "Content-Type: application/json" \
-d '{
"dataset": "ai-aktindsigt",
"variant": "original",
"version": "v1.0.0",
"operations": [
{ "op": "row_count" },
{ "op": "sum", "column": "amount" }
]
}'curl -X POST http://localhost:8080/api/v1/plots \
-H "Content-Type: application/json" \
-d '{
"dataset": "ai-aktindsigt",
"variant": "original",
"version": "v1.0.0",
"operations": [
{ "op": "histogram", "column": "token_count", "bins": 20 }
]
}'curl -X POST http://localhost:8080/api/v1/textblaster/jobs \
-H "Content-Type: application/json" \
-d '{
"input_file": "/path/to/datasets/ai-aktindsigt/original/v1.0.0/ai-aktindsigt.parquet",
"output_file": "/path/to/datasets/ai-aktindsigt/processed/v1.0.0/ai-aktindsigt.parquet",
"excluded_file": "/path/to/datasets/ai-aktindsigt/excluded/v1.0.0/ai-aktindsigt.parquet",
"text_column": "text",
"id_column": "id"
}'- Initialize a Python project (e.g., with
poetryor avenvandpip). - Add core dependencies:
fastapifor the web framework.uvicornas the ASGI server.polarsfor data processing.pydanticandpydantic-settingsfor data validation and configuration.aio-pikafor RabbitMQ integration.watchdogfor filesystem monitoring.altairfor plotting.structlog,opentelemetry-python,prometheus-fastapi-instrumentatorfor observability.
- Set up the project structure as outlined in section 15.
- Implement configuration loading (
app/core/config.py) usingpydantic-settings. - (Optional) Add a simple CLI using
typerfor starting the server or running management tasks.
- Implement dataset indexing (
app/services/indexing.py) to scan theROOT_DIR. - Add the filesystem watcher using
watchdogto update the index on changes. - Create the discovery endpoints (
app/api/datasets.py) to serve indexed data.
- Build the analysis engine (
app/services/analysis.py) using Polars LazyFrames. - Implement in-memory caching with a TTL.
- Add Vega-Lite plotting with
altair. - Create the analysis and plotting endpoints (
app/api/analysis.py).
- Implement a RabbitMQ publisher service (
app/services/queue.py) usingaio-pika. - Ensure robust connection management (e.g., retries).
- Create the TextBlaster endpoint (
app/api/textblaster.py) to accept job requests and publish them to the queue.
- Wire everything together in
app/main.py, including routers, middleware, and application state (like the RabbitMQ connection pool). - Implement health and readiness endpoints (
/healthz,/readyz). - Configure the Prometheus middleware.
- Add unit tests with
pytest. - Add integration tests using
httpx.AsyncClient. - Ensure the auto-generated OpenAPI documentation at
/docsis clean and usable. - Create deployment artifacts (Dockerfile, example systemd service file).