This documentation explains how to build and run data processing pipelines on YTsaurus (YT) with the YT Framework Python package.
:maxdepth: 3
:titlesonly:
architecture/layers
configuration/index
testing/yt-cluster-integration
testing/example-pipelines
operations/index
advanced/index
reference/api
reference/ytjobs
reference/environment-variables
troubleshooting/index
YT Framework is a Python library for defining pipelines as ordered stages, running them against a YT cluster in production, or against the local filesystem in development.
You get:
- Pipelines built from stages under
stages/, with YAML configuration per stage and for the pipeline. - A dev mode that mimics table and job behavior locally (no cluster required for basic work).
- Operations such as map, vanilla, YQL (via the YT client), S3 helpers, and related utilities.
- Packaging and upload of job code when you run on the cluster.
- Less wiring for stage discovery and config than rolling everything by hand.
- One codebase: flip
pipeline.modebetween dev and prod instead of maintaining two runners. - YQL and table helpers exposed on the same client you use for reads and writes.
- Python 3.11 or newer
- For prod mode: network access to YT, valid credentials, and a cluster whose images match your job dependencies (see Cluster requirements)
**Cluster Docker image**
In prod mode, code from `ytjobs` runs inside jobs on the cluster. The default or custom Docker image for those jobs must include the Python packages your mappers, reducers, and vanilla scripts import.
Details: Cluster requirements.
pip install yt-frameworkpip install -e .python -c "import yt_framework; print(yt_framework.__version__)"The PyPI distribution is named yt-framework. Import paths are yt_framework (driver) and ytjobs (job-side helpers).
**Secrets for production**
Prod mode expects YT (and optionally S3) credentials in `configs/secrets.env`. Without them, the client cannot talk to the cluster.
Create configs/secrets.env in your pipeline repo:
# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-tokenFor S3-backed operations, add the keys your stage uses (names vary by operation; see Secrets):
S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-keyMore detail: Secrets management.
Minimal pipeline: one stage that writes a small table.
mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configspipeline.py at the repo root:
from yt_framework.core.pipeline import DefaultPipeline
if __name__ == "__main__":
DefaultPipeline.main()configs/config.yaml:
stages:
enabled_stages:
- create_data
pipeline:
mode: "dev" # use "prod" on the clusterstages/create_data/stage.py:
from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage
class CreateDataStage(BaseStage):
def run(self, debug: DebugContext) -> DebugContext:
self.logger.info("Creating data table...")
rows = [
{"id": 1, "name": "Alice", "value": 100},
{"id": 2, "name": "Bob", "value": 200},
{"id": 3, "name": "Charlie", "value": 300},
]
self.deps.yt_client.write_table(
table_path=self.config.client.output_table,
rows=rows,
)
self.logger.info("Created table with %s rows", len(rows))
return debugstages/create_data/config.yaml:
client:
output_table: //tmp/my_first_pipeline/datapython pipeline.pyIn dev mode, rows land under something like my_first_pipeline/.dev/data.jsonl. In prod mode, the same logical path is a YT table at //tmp/my_first_pipeline/data.
A pipeline runs stages in order. Each stage is a class with a run method.
DefaultPipeline: discoversBaseStagesubclasses understages/.BasePipeline: you register stages yourself.BaseStage: base class for stage implementations.
More: Pipelines and stages.
**Start in dev**
Use dev mode first: no cluster credentials, fast feedback, files under `.dev/`.
- Dev: tables as
.jsonlunder.dev/, local subprocesses for map/vanilla-style work, YQL backed by DuckDB where applicable. - Prod: real YT operations, code upload to
build_folder, jobs on the cluster.
Dev vs prod has a full comparison.
configs/config.yaml: pipeline mode, enabled stages, shared options.stages/<name>/config.yaml: settings for that stage.configs/secrets.env: credentials (not committed).
Row-wise transforms with uploaded mapper code. Map operations — example 04_map_operation.
Jobs without mandatory input/output tables (setup, maintenance, one-off scripts). Vanilla — example 05_vanilla_operation.
Table operations through YQL via the YT client (joins, filters, aggregates, etc.). YQL — example 03_yql_operations.
List, download, and related patterns against S3-compatible storage. S3 operations — example 06_s3_integration.
- Code upload — how job bundles are built and sent to YT.
- Docker — custom images for GPU or extra system deps — example 07_custom_docker.
- Checkpoints — model artifacts for inference-style stages.
- Multiple operations — more than one operation in a stage — example 09_multiple_operations.
- API reference —
yt_framework(autodoc from docstrings) - YT jobs (
ytjobs) — mapper helpers, S3, logging, job config path, Cypress checkpoints - Environment variables
- Troubleshooting
Under examples/ on GitHub:
| Example | What it shows |
|---|---|
| 01_hello_world | Minimal pipeline |
| 02_multi_stage_pipeline | Several stages and context |
| 03_yql_operations | YQL |
| 04_map_operation | Map |
| 05_vanilla_operation | Vanilla |
| 06_s3_integration | S3 |
| 07_custom_docker | Custom Docker image |
| 08_multiple_configs | Multiple config files |
| 09_multiple_operations | Multiple operations in one stage |
| 10_custom_upload | Custom upload layout |
| environment_log | Environment logging |
| video_gpu | GPU-oriented sample |
Each example directory has its own README.