Skip to content

Latest commit

 

History

History
274 lines (186 loc) · 9.16 KB

File metadata and controls

274 lines (186 loc) · 9.16 KB

YT Framework documentation

This documentation explains how to build and run data processing pipelines on YTsaurus (YT) with the YT Framework Python package.

Table of contents

:maxdepth: 3
:titlesonly:

architecture/layers
configuration/index
testing/yt-cluster-integration
testing/example-pipelines
operations/index
advanced/index
reference/api
reference/ytjobs
reference/environment-variables
troubleshooting/index

Introduction

YT Framework is a Python library for defining pipelines as ordered stages, running them against a YT cluster in production, or against the local filesystem in development.

You get:

  • Pipelines built from stages under stages/, with YAML configuration per stage and for the pipeline.
  • A dev mode that mimics table and job behavior locally (no cluster required for basic work).
  • Operations such as map, vanilla, YQL (via the YT client), S3 helpers, and related utilities.
  • Packaging and upload of job code when you run on the cluster.

When it helps

  • Less wiring for stage discovery and config than rolling everything by hand.
  • One codebase: flip pipeline.mode between dev and prod instead of maintaining two runners.
  • YQL and table helpers exposed on the same client you use for reads and writes.

Installation

Prerequisites

  • Python 3.11 or newer
  • For prod mode: network access to YT, valid credentials, and a cluster whose images match your job dependencies (see Cluster requirements)

YT cluster requirements

**Cluster Docker image**

In prod mode, code from `ytjobs` runs inside jobs on the cluster. The default or custom Docker image for those jobs must include the Python packages your mappers, reducers, and vanilla scripts import.

Details: Cluster requirements.

Install from PyPI

pip install yt-framework

Install from source

pip install -e .

Verify installation

python -c "import yt_framework; print(yt_framework.__version__)"

The PyPI distribution is named yt-framework. Import paths are yt_framework (driver) and ytjobs (job-side helpers).

Credentials for prod

**Secrets for production**

Prod mode expects YT (and optionally S3) credentials in `configs/secrets.env`. Without them, the client cannot talk to the cluster.

Create configs/secrets.env in your pipeline repo:

# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

For S3-backed operations, add the keys your stage uses (names vary by operation; see Secrets):

S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-key

More detail: Secrets management.

Quick start

Minimal pipeline: one stage that writes a small table.

Step 1: Layout

mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configs

Step 2: Entry point

pipeline.py at the repo root:

from yt_framework.core.pipeline import DefaultPipeline

if __name__ == "__main__":
    DefaultPipeline.main()

Step 3: Pipeline config

configs/config.yaml:

stages:
  enabled_stages:
    - create_data

pipeline:
  mode: "dev"  # use "prod" on the cluster

Step 4: Stage

stages/create_data/stage.py:

from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage

class CreateDataStage(BaseStage):
    def run(self, debug: DebugContext) -> DebugContext:
        self.logger.info("Creating data table...")

        rows = [
            {"id": 1, "name": "Alice", "value": 100},
            {"id": 2, "name": "Bob", "value": 200},
            {"id": 3, "name": "Charlie", "value": 300},
        ]

        self.deps.yt_client.write_table(
            table_path=self.config.client.output_table,
            rows=rows,
        )

        self.logger.info("Created table with %s rows", len(rows))
        return debug

stages/create_data/config.yaml:

client:
  output_table: //tmp/my_first_pipeline/data

Step 5: Run

python pipeline.py

In dev mode, rows land under something like my_first_pipeline/.dev/data.jsonl. In prod mode, the same logical path is a YT table at //tmp/my_first_pipeline/data.

Where to go next

Core concepts

Pipelines and stages

A pipeline runs stages in order. Each stage is a class with a run method.

  • DefaultPipeline: discovers BaseStage subclasses under stages/.
  • BasePipeline: you register stages yourself.
  • BaseStage: base class for stage implementations.

More: Pipelines and stages.

Dev vs prod

**Start in dev**

Use dev mode first: no cluster credentials, fast feedback, files under `.dev/`.
  • Dev: tables as .jsonl under .dev/, local subprocesses for map/vanilla-style work, YQL backed by DuckDB where applicable.
  • Prod: real YT operations, code upload to build_folder, jobs on the cluster.

Dev vs prod has a full comparison.

Configuration

  • configs/config.yaml: pipeline mode, enabled stages, shared options.
  • stages/<name>/config.yaml: settings for that stage.
  • configs/secrets.env: credentials (not committed).

Configuration.

Operations

Map

Row-wise transforms with uploaded mapper code. Map operations — example 04_map_operation.

Vanilla

Jobs without mandatory input/output tables (setup, maintenance, one-off scripts). Vanilla — example 05_vanilla_operation.

YQL

Table operations through YQL via the YT client (joins, filters, aggregates, etc.). YQL — example 03_yql_operations.

S3

List, download, and related patterns against S3-compatible storage. S3 operations — example 06_s3_integration.

Advanced topics

Reference

Examples

Under examples/ on GitHub:

Example What it shows
01_hello_world Minimal pipeline
02_multi_stage_pipeline Several stages and context
03_yql_operations YQL
04_map_operation Map
05_vanilla_operation Vanilla
06_s3_integration S3
07_custom_docker Custom Docker image
08_multiple_configs Multiple config files
09_multiple_operations Multiple operations in one stage
10_custom_upload Custom upload layout
environment_log Environment logging
video_gpu GPU-oriented sample

Each example directory has its own README.