Skip to content

feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format#8160

Open
JackieTien97 wants to merge 5 commits intohuggingface:mainfrom
JackieTien97:ly/tsfile-per-device-wide
Open

feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format#8160
JackieTien97 wants to merge 5 commits intohuggingface:mainfrom
JackieTien97:ly/tsfile-per-device-wide

Conversation

@JackieTien97
Copy link
Copy Markdown

@JackieTien97 JackieTien97 commented Apr 29, 2026

Closes #7922

Summary

Add a packaged builder for TsFile — the columnar time-series file format used as the native storage layer of Apache IoTDB. This enables load_dataset("tsfile", data_files="...") with automatic .tsfile extension detection.

Data model

Unlike tabular formats, TsFile is time-series-aware: each file contains tables with TAG columns (device identifiers), FIELD columns (measurements), and a TIME column. The builder outputs one dataset row per device, where:

  • TAG columns are scalar string values identifying the device.
  • The time column and every FIELD column are Arrow list<...> columns holding the device's full sorted time series.
<tag_1>:    string
<tag_2>:    string
time:       list<timestamp[unit, tz]>
<field_1>:  list<original_type>
<field_2>:  list<original_type>

Key features

  • Per-device reading — data is fetched via TsFileReader.query_table with a push-down tag_filter; peak memory is bounded by a single device's payload, not the entire split.
  • Time-range pushdownstart_time / end_time are pushed down to TsFile's internal time index. Accepts int epochs, datetime, date, ISO-8601 strings, and pyarrow.TimestampScalar.
  • Schema evolution — when different files expose different FIELD columns, the loader takes the union and fills missing values with nulls. Numeric types are promoted following IoTDB's widening rules (INT32 → INT64 → DOUBLE, INT32 → FLOAT → DOUBLE).
  • Case-insensitive table names — table-name lookups use a canonical lowercase form so auto-detected and user-supplied names always match.
  • on_bad_files"error" (default) / "warn" / "skip" to control handling of unreadable inputs.
  • Configurable batchinginput_batch_size (rows per Arrow batch from the reader) and output_batch_size (devices per emitted record batch) for memory control.
  • Timestamp unit & timezonetimestamp_unit ("s" / "ms" / "us" / "ns", default "ms") and timestamp_tz (optional timezone string).

Config knobs (TsFileConfig)

Parameter Default Description
table_name auto-detect Table to read (case-insensitive)
columns all Subset of FIELD columns to keep; TAGs are always returned
start_time / end_time unbounded Inclusive timestamp range, pushed down to TsFile index
input_batch_size 65,536 Max rows per Arrow batch from the reader
output_batch_size 32 Devices per emitted record batch
features inferred Explicit Features schema (skips metadata scan)
on_bad_files "error" How to handle unreadable files
timestamp_unit "ms" Time unit for the timestamp column
timestamp_tz None Time zone for the timestamp column

Usage

from datasets import load_dataset

# Basic load
ds = load_dataset("tsfile", data_files="my_data.tsfile")

# With options
ds = load_dataset(
    "tsfile",
    data_files="sensor_data.tsfile",
    table_name="sensors",
    columns=["temperature", "humidity"],
    start_time="2024-01-01T00:00:00",
    end_time="2024-12-31T23:59:59",
    timestamp_unit="us",
    timestamp_tz="UTC",
)

Files changed

New files:

  • src/datasets/packaged_modules/tsfile/__init__.py + tsfile.py — the builder (770 lines)
  • tests/packaged_modules/test_tsfile.py — 47 tests (713 lines) covering: basic load, table/column selection, time-range pushdown (all accepted input types), schema evolution and numeric promotion, duplicate-timestamp rejection, multi-file × multi-device crossover, large device with small input_batch_size, timezone handling, streaming mode, on_bad_files modes, and _to_epoch boundary helper.
  • docs/source/tsfile_load.mdx — standalone Time-series loading guide

Modified files:

  • src/datasets/packaged_modules/__init__.py — register .tsfile extension and module entry
  • docs/source/_toctree.yml — add "Time-series" section to sidebar
  • docs/source/loading.mdx — add TsFile to the supported formats list and link to the guide
  • docs/source/about_dataset_load.mdx — add TsFile cross-reference
  • docs/source/package_reference/loading_methods.mdx — add TsFileConfig autodoc entry
  • setup.py — add tsfile>=2.3.0 to TESTS_REQUIRE

Dependencies

This builder requires the tsfile Python package (>=2.3.0), which is added as a test dependency only. The package is lazily imported at runtime — users who don't work with TsFile data pay no import cost.

Test plan

  • 47 unit tests in tests/packaged_modules/test_tsfile.py
  • CI passes on the new test suite
  • Documentation renders correctly in the docs build

Young-Leo and others added 4 commits April 28, 2026 16:57
Add a packaged builder for TsFile (table model), the columnar
time-series format used as the storage layer of Apache IoTDB.

Each output row corresponds to one device (identified by its TAG
columns); the `time` column and every FIELD column are Arrow
`list<...>` columns holding that device's full time series, sorted
in ascending time order. When a device appears in multiple files
within a split, its per-file chunks are merged and sorted; duplicate
timestamps for the same device raise `ValueError`.

Reading model
- Data is fetched per device via `TsFileReader.query_table` with a
  push-down `tag_filter`; peak memory is bounded by a single
  device's payload, not by the split size.
- `start_time` / `end_time` are pushed down to TsFile's internal
  time index. They accept `int` epochs, `datetime`, `date`,
  ISO-8601 strings, and `pyarrow.TimestampScalar`; tz-aware
  datetimes are normalized to UTC.
- Schema evolution across files: FIELD columns are unioned and
  missing values are filled with nulls; numeric FIELD types are
  promoted following IoTDB's widening rules
  (INT32 -> INT64 -> DOUBLE, INT32 -> FLOAT -> DOUBLE).
- `on_bad_files` controls handling of unreadable inputs
  ("error" | "warn" | "skip").
- `input_batch_size` bounds the per-device Arrow batch size pulled
  from the underlying tsfile reader; `output_batch_size` controls
  the number of devices packed into each emitted record batch.

Config knobs: `table_name`, `columns`, `start_time`, `end_time`,
`input_batch_size`, `output_batch_size`, `features`, `on_bad_files`,
`timestamp_unit`, `timestamp_tz`.

Tests
- 47 tests under `tests/packaged_modules/test_tsfile.py` covering:
  basic load, table/column selection, time-range pushdown (all
  accepted input types), schema evolution and numeric promotion,
  duplicate-timestamp rejection, multi-file x multi-device
  crossover, large device with small `input_batch_size`, timezone
  handling, streaming mode, `on_bad_files` modes, and the
  `_to_epoch` boundary helper.

Docs
- `docs/source/tabular_load.mdx`: dedicated TsFile section with
  data model, output schema, time-range bounds, schema evolution,
  bad-file handling, timestamp unit/tz, and batching/memory.
- `docs/source/loading.mdx`, `about_dataset_load.mdx`,
  `package_reference/loading_methods.mdx`: register and
  cross-reference the TsFile loader and `TsFileConfig` autodoc.

Other
- `setup.py`: add `tsfile>=2.2.1` to TESTS_REQUIRE.
- `src/datasets/packaged_modules/__init__.py`: register the
  `.tsfile` extension and module entry.
…p to 2.3.0

- Move the TsFile loader documentation out of tabular_load.mdx into a new top-level page docs/source/tsfile_load.mdx, and add a dedicated 'Time-series' section to the sidebar (_toctree.yml). The per-device wide layout (one row per device, list-typed time/FIELD columns) is not a generic tabular convention and warrants its own guide.

- tabular_load.mdx now points readers to the new guide via a short cross-reference instead of inlining the section.

- loading.mdx: update the 'more details' link to tsfile_load.

- setup.py: bump TESTS_REQUIRE entry from tsfile>=2.2.1 to tsfile>=2.3.0.
- Add `_schemas_by_lc` helper and route the three call sites through it so auto-detected and user-supplied table names compare in a single canonical (lowercase) form.
- Drop the now-misleading `_generate_shards` comment; the body matches the convention used by arrow.py / pandas.py / hdf5.py.
- Remove the TsFile cross-link from `tabular_load.mdx` so that page stays focused on tabular formats; time-series users land via the dedicated Time-series section in the sidebar.
- Cover tz-aware ISO-8601 strings in `_to_epoch` via a parametrized test (also drops the `__import__('datetime')` workaround now that `timedelta` is imported directly).
- gitignore local dev artifacts produced while iterating on the builder.
@JackieTien97
Copy link
Copy Markdown
Author

Hi team 👋

A bit of background on me: I'm a PMC member of both Apache TsFile and Apache IoTDB, and one of the core contributors to the TsFile format specification and its Python SDK (tsfile on PyPI). I've been working on TsFile's design and implementation for several years, so I'm deeply familiar with its data model, storage internals, and the read path that this builder relies on.

The motivation for this PR is to make time-series data stored in TsFile directly accessible to the HuggingFace ecosystem. TsFile is a columnar format purpose-built for time-series workloads — it powers Apache IoTDB's storage layer and is increasingly used as a standalone interchange format for IoT and industrial data. With the growing interest in applying ML to time-series domains (forecasting, anomaly detection, foundation models, etc.), we believe a native load_dataset("tsfile", ...) integration would lower the barrier for researchers and practitioners who already have data in this format.

I'm happy to iterate on the implementation based on your feedback, and I'll be actively maintaining this builder going forward as the TsFile format evolves. Feel free to ping me on any questions about the format or the read semantics.

@JackieTien97
Copy link
Copy Markdown
Author

@lhoestq Would you mind taking a look at this PR when you get a chance? I'd really appreciate your review. Thanks!

Previously, passing the time column name (e.g. columns=["time"]) added a
duplicate all-null list<float64> field that overwrote the real timestamp
list in the output schema. Now TIME is treated like TAG: silently skipped
from the requested field set so it is emitted exactly once as the real
timestamp list. Docs and tests updated.
@JackieTien97
Copy link
Copy Markdown
Author

Hi @lhoestq, just a friendly ping on this PR again. I'd really appreciate it if you could take a look when you get a chance. Happy to address any feedback or make adjustments if needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Apache TsFile Datasets

2 participants