feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format#8160
feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format#8160JackieTien97 wants to merge 5 commits intohuggingface:mainfrom
Conversation
Add a packaged builder for TsFile (table model), the columnar
time-series format used as the storage layer of Apache IoTDB.
Each output row corresponds to one device (identified by its TAG
columns); the `time` column and every FIELD column are Arrow
`list<...>` columns holding that device's full time series, sorted
in ascending time order. When a device appears in multiple files
within a split, its per-file chunks are merged and sorted; duplicate
timestamps for the same device raise `ValueError`.
Reading model
- Data is fetched per device via `TsFileReader.query_table` with a
push-down `tag_filter`; peak memory is bounded by a single
device's payload, not by the split size.
- `start_time` / `end_time` are pushed down to TsFile's internal
time index. They accept `int` epochs, `datetime`, `date`,
ISO-8601 strings, and `pyarrow.TimestampScalar`; tz-aware
datetimes are normalized to UTC.
- Schema evolution across files: FIELD columns are unioned and
missing values are filled with nulls; numeric FIELD types are
promoted following IoTDB's widening rules
(INT32 -> INT64 -> DOUBLE, INT32 -> FLOAT -> DOUBLE).
- `on_bad_files` controls handling of unreadable inputs
("error" | "warn" | "skip").
- `input_batch_size` bounds the per-device Arrow batch size pulled
from the underlying tsfile reader; `output_batch_size` controls
the number of devices packed into each emitted record batch.
Config knobs: `table_name`, `columns`, `start_time`, `end_time`,
`input_batch_size`, `output_batch_size`, `features`, `on_bad_files`,
`timestamp_unit`, `timestamp_tz`.
Tests
- 47 tests under `tests/packaged_modules/test_tsfile.py` covering:
basic load, table/column selection, time-range pushdown (all
accepted input types), schema evolution and numeric promotion,
duplicate-timestamp rejection, multi-file x multi-device
crossover, large device with small `input_batch_size`, timezone
handling, streaming mode, `on_bad_files` modes, and the
`_to_epoch` boundary helper.
Docs
- `docs/source/tabular_load.mdx`: dedicated TsFile section with
data model, output schema, time-range bounds, schema evolution,
bad-file handling, timestamp unit/tz, and batching/memory.
- `docs/source/loading.mdx`, `about_dataset_load.mdx`,
`package_reference/loading_methods.mdx`: register and
cross-reference the TsFile loader and `TsFileConfig` autodoc.
Other
- `setup.py`: add `tsfile>=2.2.1` to TESTS_REQUIRE.
- `src/datasets/packaged_modules/__init__.py`: register the
`.tsfile` extension and module entry.
…p to 2.3.0 - Move the TsFile loader documentation out of tabular_load.mdx into a new top-level page docs/source/tsfile_load.mdx, and add a dedicated 'Time-series' section to the sidebar (_toctree.yml). The per-device wide layout (one row per device, list-typed time/FIELD columns) is not a generic tabular convention and warrants its own guide. - tabular_load.mdx now points readers to the new guide via a short cross-reference instead of inlining the section. - loading.mdx: update the 'more details' link to tsfile_load. - setup.py: bump TESTS_REQUIRE entry from tsfile>=2.2.1 to tsfile>=2.3.0.
- Add `_schemas_by_lc` helper and route the three call sites through it so auto-detected and user-supplied table names compare in a single canonical (lowercase) form.
- Drop the now-misleading `_generate_shards` comment; the body matches the convention used by arrow.py / pandas.py / hdf5.py.
- Remove the TsFile cross-link from `tabular_load.mdx` so that page stays focused on tabular formats; time-series users land via the dedicated Time-series section in the sidebar.
- Cover tz-aware ISO-8601 strings in `_to_epoch` via a parametrized test (also drops the `__import__('datetime')` workaround now that `timedelta` is imported directly).
- gitignore local dev artifacts produced while iterating on the builder.
|
Hi team 👋 A bit of background on me: I'm a PMC member of both Apache TsFile and Apache IoTDB, and one of the core contributors to the TsFile format specification and its Python SDK ( The motivation for this PR is to make time-series data stored in TsFile directly accessible to the HuggingFace ecosystem. TsFile is a columnar format purpose-built for time-series workloads — it powers Apache IoTDB's storage layer and is increasingly used as a standalone interchange format for IoT and industrial data. With the growing interest in applying ML to time-series domains (forecasting, anomaly detection, foundation models, etc.), we believe a native I'm happy to iterate on the implementation based on your feedback, and I'll be actively maintaining this builder going forward as the TsFile format evolves. Feel free to ping me on any questions about the format or the read semantics. |
|
@lhoestq Would you mind taking a look at this PR when you get a chance? I'd really appreciate your review. Thanks! |
Previously, passing the time column name (e.g. columns=["time"]) added a duplicate all-null list<float64> field that overwrote the real timestamp list in the output schema. Now TIME is treated like TAG: silently skipped from the requested field set so it is emitted exactly once as the real timestamp list. Docs and tests updated.
|
Hi @lhoestq, just a friendly ping on this PR again. I'd really appreciate it if you could take a look when you get a chance. Happy to address any feedback or make adjustments if needed. Thanks! |
Closes #7922
Summary
Add a packaged builder for TsFile — the columnar time-series file format used as the native storage layer of Apache IoTDB. This enables
load_dataset("tsfile", data_files="...")with automatic.tsfileextension detection.Data model
Unlike tabular formats, TsFile is time-series-aware: each file contains tables with TAG columns (device identifiers), FIELD columns (measurements), and a TIME column. The builder outputs one dataset row per device, where:
stringvalues identifying the device.timecolumn and every FIELD column are Arrowlist<...>columns holding the device's full sorted time series.Key features
TsFileReader.query_tablewith a push-downtag_filter; peak memory is bounded by a single device's payload, not the entire split.start_time/end_timeare pushed down to TsFile's internal time index. Acceptsintepochs,datetime,date, ISO-8601 strings, andpyarrow.TimestampScalar.INT32 → INT64 → DOUBLE,INT32 → FLOAT → DOUBLE).on_bad_files—"error"(default) /"warn"/"skip"to control handling of unreadable inputs.input_batch_size(rows per Arrow batch from the reader) andoutput_batch_size(devices per emitted record batch) for memory control.timestamp_unit("s"/"ms"/"us"/"ns", default"ms") andtimestamp_tz(optional timezone string).Config knobs (
TsFileConfig)table_namecolumnsstart_time/end_timeinput_batch_sizeoutput_batch_sizefeaturesFeaturesschema (skips metadata scan)on_bad_files"error"timestamp_unit"ms"timestamp_tzNoneUsage
Files changed
New files:
src/datasets/packaged_modules/tsfile/__init__.py+tsfile.py— the builder (770 lines)tests/packaged_modules/test_tsfile.py— 47 tests (713 lines) covering: basic load, table/column selection, time-range pushdown (all accepted input types), schema evolution and numeric promotion, duplicate-timestamp rejection, multi-file × multi-device crossover, large device with smallinput_batch_size, timezone handling, streaming mode,on_bad_filesmodes, and_to_epochboundary helper.docs/source/tsfile_load.mdx— standalone Time-series loading guideModified files:
src/datasets/packaged_modules/__init__.py— register.tsfileextension and module entrydocs/source/_toctree.yml— add "Time-series" section to sidebardocs/source/loading.mdx— add TsFile to the supported formats list and link to the guidedocs/source/about_dataset_load.mdx— add TsFile cross-referencedocs/source/package_reference/loading_methods.mdx— addTsFileConfigautodoc entrysetup.py— addtsfile>=2.3.0toTESTS_REQUIREDependencies
This builder requires the
tsfilePython package (>=2.3.0), which is added as a test dependency only. The package is lazily imported at runtime — users who don't work with TsFile data pay no import cost.Test plan
tests/packaged_modules/test_tsfile.py