feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format by JackieTien97 · Pull Request #8160 · huggingface/datasets

JackieTien97 · 2026-04-29T09:07:25Z

Summary

Add a packaged builder for TsFile — the columnar time-series file format used as the native storage layer of Apache IoTDB. This enables load_dataset("tsfile", data_files="...") with automatic .tsfile extension detection.

Data model

Unlike tabular formats, TsFile is time-series-aware: each file contains tables with TAG columns (device identifiers), FIELD columns (measurements), and a TIME column. The builder outputs one dataset row per device, where:

TAG columns are scalar string values identifying the device.
The time column and every FIELD column are Arrow list<...> columns holding the device's full sorted time series.

<tag_1>:    string
<tag_2>:    string
time:       list<timestamp[unit, tz]>
<field_1>:  list<original_type>
<field_2>:  list<original_type>

Key features

Per-device reading — data is fetched via TsFileReader.query_table with a push-down tag_filter; peak memory is bounded by a single device's payload, not the entire split.
Time-range pushdown — start_time / end_time are pushed down to TsFile's internal time index. Accepts int epochs, datetime, date, ISO-8601 strings, and pyarrow.TimestampScalar.
Schema evolution — when different files expose different FIELD columns, the loader takes the union and fills missing values with nulls. Numeric types are promoted following IoTDB's widening rules (INT32 → INT64 → DOUBLE, INT32 → FLOAT → DOUBLE).
Case-insensitive table names — table-name lookups use a canonical lowercase form so auto-detected and user-supplied names always match.
on_bad_files — "error" (default) / "warn" / "skip" to control handling of unreadable inputs.
Configurable batching — input_batch_size (rows per Arrow batch from the reader) and output_batch_size (devices per emitted record batch) for memory control.
Timestamp unit & timezone — timestamp_unit ("s" / "ms" / "us" / "ns", default "ms") and timestamp_tz (optional timezone string).

Config knobs (`TsFileConfig`)

Parameter	Default	Description
`table_name`	auto-detect	Table to read (case-insensitive)
`columns`	all	Subset of FIELD columns to keep; TAGs are always returned
`start_time` / `end_time`	unbounded	Inclusive timestamp range, pushed down to TsFile index
`input_batch_size`	65,536	Max rows per Arrow batch from the reader
`output_batch_size`	32	Devices per emitted record batch
`features`	inferred	Explicit `Features` schema (skips metadata scan)
`on_bad_files`	`"error"`	How to handle unreadable files
`timestamp_unit`	`"ms"`	Time unit for the timestamp column
`timestamp_tz`	`None`	Time zone for the timestamp column

Usage

from datasets import load_dataset

# Basic load
ds = load_dataset("tsfile", data_files="my_data.tsfile")

# With options
ds = load_dataset(
    "tsfile",
    data_files="sensor_data.tsfile",
    table_name="sensors",
    columns=["temperature", "humidity"],
    start_time="2024-01-01T00:00:00",
    end_time="2024-12-31T23:59:59",
    timestamp_unit="us",
    timestamp_tz="UTC",
)

Files changed

New files:

src/datasets/packaged_modules/tsfile/__init__.py + tsfile.py — the builder (770 lines)
tests/packaged_modules/test_tsfile.py — 47 tests (713 lines) covering: basic load, table/column selection, time-range pushdown (all accepted input types), schema evolution and numeric promotion, duplicate-timestamp rejection, multi-file × multi-device crossover, large device with small input_batch_size, timezone handling, streaming mode, on_bad_files modes, and _to_epoch boundary helper.
docs/source/tsfile_load.mdx — standalone Time-series loading guide

Modified files:

src/datasets/packaged_modules/__init__.py — register .tsfile extension and module entry
docs/source/_toctree.yml — add "Time-series" section to sidebar
docs/source/loading.mdx — add TsFile to the supported formats list and link to the guide
docs/source/about_dataset_load.mdx — add TsFile cross-reference
docs/source/package_reference/loading_methods.mdx — add TsFileConfig autodoc entry
setup.py — add tsfile>=2.3.0 to TESTS_REQUIRE

Dependencies

This builder requires the tsfile Python package (>=2.3.0), which is added as a test dependency only. The package is lazily imported at runtime — users who don't work with TsFile data pay no import cost.

Test plan

47 unit tests in tests/packaged_modules/test_tsfile.py
CI passes on the new test suite
Documentation renders correctly in the docs build

Add a packaged builder for TsFile (table model), the columnar time-series format used as the storage layer of Apache IoTDB. Each output row corresponds to one device (identified by its TAG columns); the `time` column and every FIELD column are Arrow `list<...>` columns holding that device's full time series, sorted in ascending time order. When a device appears in multiple files within a split, its per-file chunks are merged and sorted; duplicate timestamps for the same device raise `ValueError`. Reading model - Data is fetched per device via `TsFileReader.query_table` with a push-down `tag_filter`; peak memory is bounded by a single device's payload, not by the split size. - `start_time` / `end_time` are pushed down to TsFile's internal time index. They accept `int` epochs, `datetime`, `date`, ISO-8601 strings, and `pyarrow.TimestampScalar`; tz-aware datetimes are normalized to UTC. - Schema evolution across files: FIELD columns are unioned and missing values are filled with nulls; numeric FIELD types are promoted following IoTDB's widening rules (INT32 -> INT64 -> DOUBLE, INT32 -> FLOAT -> DOUBLE). - `on_bad_files` controls handling of unreadable inputs ("error" | "warn" | "skip"). - `input_batch_size` bounds the per-device Arrow batch size pulled from the underlying tsfile reader; `output_batch_size` controls the number of devices packed into each emitted record batch. Config knobs: `table_name`, `columns`, `start_time`, `end_time`, `input_batch_size`, `output_batch_size`, `features`, `on_bad_files`, `timestamp_unit`, `timestamp_tz`. Tests - 47 tests under `tests/packaged_modules/test_tsfile.py` covering: basic load, table/column selection, time-range pushdown (all accepted input types), schema evolution and numeric promotion, duplicate-timestamp rejection, multi-file x multi-device crossover, large device with small `input_batch_size`, timezone handling, streaming mode, `on_bad_files` modes, and the `_to_epoch` boundary helper. Docs - `docs/source/tabular_load.mdx`: dedicated TsFile section with data model, output schema, time-range bounds, schema evolution, bad-file handling, timestamp unit/tz, and batching/memory. - `docs/source/loading.mdx`, `about_dataset_load.mdx`, `package_reference/loading_methods.mdx`: register and cross-reference the TsFile loader and `TsFileConfig` autodoc. Other - `setup.py`: add `tsfile>=2.2.1` to TESTS_REQUIRE. - `src/datasets/packaged_modules/__init__.py`: register the `.tsfile` extension and module entry.

…p to 2.3.0 - Move the TsFile loader documentation out of tabular_load.mdx into a new top-level page docs/source/tsfile_load.mdx, and add a dedicated 'Time-series' section to the sidebar (_toctree.yml). The per-device wide layout (one row per device, list-typed time/FIELD columns) is not a generic tabular convention and warrants its own guide. - tabular_load.mdx now points readers to the new guide via a short cross-reference instead of inlining the section. - loading.mdx: update the 'more details' link to tsfile_load. - setup.py: bump TESTS_REQUIRE entry from tsfile>=2.2.1 to tsfile>=2.3.0.

- Add `_schemas_by_lc` helper and route the three call sites through it so auto-detected and user-supplied table names compare in a single canonical (lowercase) form. - Drop the now-misleading `_generate_shards` comment; the body matches the convention used by arrow.py / pandas.py / hdf5.py. - Remove the TsFile cross-link from `tabular_load.mdx` so that page stays focused on tabular formats; time-series users land via the dedicated Time-series section in the sidebar. - Cover tz-aware ISO-8601 strings in `_to_epoch` via a parametrized test (also drops the `__import__('datetime')` workaround now that `timedelta` is imported directly). - gitignore local dev artifacts produced while iterating on the builder.

JackieTien97 · 2026-04-29T09:14:34Z

Hi team 👋

A bit of background on me: I'm a PMC member of both Apache TsFile and Apache IoTDB, and one of the core contributors to the TsFile format specification and its Python SDK (tsfile on PyPI). I've been working on TsFile's design and implementation for several years, so I'm deeply familiar with its data model, storage internals, and the read path that this builder relies on.

The motivation for this PR is to make time-series data stored in TsFile directly accessible to the HuggingFace ecosystem. TsFile is a columnar format purpose-built for time-series workloads — it powers Apache IoTDB's storage layer and is increasingly used as a standalone interchange format for IoT and industrial data. With the growing interest in applying ML to time-series domains (forecasting, anomaly detection, foundation models, etc.), we believe a native load_dataset("tsfile", ...) integration would lower the barrier for researchers and practitioners who already have data in this format.

I'm happy to iterate on the implementation based on your feedback, and I'll be actively maintaining this builder going forward as the TsFile format evolves. Feel free to ping me on any questions about the format or the read semantics.

JackieTien97 · 2026-04-29T09:21:00Z

@lhoestq Would you mind taking a look at this PR when you get a chance? I'd really appreciate your review. Thanks!

Previously, passing the time column name (e.g. columns=["time"]) added a duplicate all-null list<float64> field that overwrote the real timestamp list in the output schema. Now TIME is treated like TAG: silently skipped from the requested field set so it is emitted exactly once as the real timestamp list. Docs and tests updated.

JackieTien97 · 2026-05-05T01:16:49Z

Hi @lhoestq, just a friendly ping on this PR again. I'd really appreciate it if you could take a look when you get a chance. Happy to address any feedback or make adjustments if needed. Thanks!

Young-Leo and others added 4 commits April 28, 2026 16:57

format code

15d3633

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format#8160

feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format#8160
JackieTien97 wants to merge 5 commits intohuggingface:mainfrom
JackieTien97:ly/tsfile-per-device-wide

JackieTien97 commented Apr 29, 2026 •

edited

Loading

Uh oh!

JackieTien97 commented Apr 29, 2026

Uh oh!

JackieTien97 commented Apr 29, 2026

Uh oh!

JackieTien97 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JackieTien97 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data model

Key features

Config knobs (TsFileConfig)

Usage

Files changed

Dependencies

Test plan

Uh oh!

JackieTien97 commented Apr 29, 2026

Uh oh!

JackieTien97 commented Apr 29, 2026

Uh oh!

JackieTien97 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JackieTien97 commented Apr 29, 2026 •

edited

Loading

Config knobs (`TsFileConfig`)