Fix loading manifests from stdin (closes #810) by deekshaNVIDIA · Pull Request #1572 · lhotse-speech/lhotse

deekshaNVIDIA · 2026-06-13T22:44:29Z

Summary

Closes #810.

Piping a manifest into a lhotse CLI command that takes - for stdin (the original report used lhotse split) failed with orjson.JSONDecodeError: unexpected character: line 1 column 1 (char 0):

gunzip -c data/librispeech/raw_cuts_dev-clean.jsonl.gz \
  | lhotse split -s 16 - data/librispeech/tmp

Root cause

load_manifest_lazy_or_eager("-") routes to load_manifest_lazy, which:

Reads one line of stdin to detect the manifest class.
Returns a LazyManifestIterator that opens the input again on every iteration.

But stdin is a one-shot stream — it cannot be re-opened or seeked. When the CLI later materialises the iterator (e.g. in split_sequence), the second pass reads either truncated or empty data and raises JSONDecodeError.

Fix

lhotse/serialization.py

When path == "-", eagerly slurp stdin into a list of dicts and dispatch to the same from_dicts candidate-class logic as load_manifest. This was extracted into a small _load_manifest_from_stdin helper to keep load_manifest_lazy strict (lazy means lazy).

lhotse/bin/modes/manipulation.py

lhotse split and lhotse split-lazy derived output filenames from the input path's stem/suffix. With manifest = Path("-") that produced out/-.0 (no extension) and tripped store_manifest's "Unknown serialization format" check. They now fall back to manifest.<idx>.jsonl.gz when the input is -, so the user's command works end-to-end.

Verification

test/test_load_manifest_stdin.py (new) — covers the stdin path with monkeypatched sys.stdin:

full eager load returns the right items in order
split(num_splits=4) succeeds and preserves all items (the original #810 regression)
the manifest can be iterated multiple times
type auto-detection works (RecordingSet vs SupervisionSet)
empty input returns None
explicit manifest_cls= is honoured
garbage input raises an error

I also ran an end-to-end CLI test that mirrors the issue's gunzip | lhotse split - invocation: 8 dummy recordings → manifest.{0..3}.jsonl.gz with all recordings preserved.

black, isort --profile black, and flake8 --select=E9,F63,F7,F82 are clean on the changed files. The local test suite has unrelated Windows-only NamedTemporaryFile PermissionError failures (present on master too) — CI on Linux is unaffected.

…ch#242) Lhotse treats manifest objects as immutable and widely uses the `fastcopy(obj, field=...)` idiom to create modified copies. This adds a `.copy_with(**kwargs)` member method so the same can be done without importing fastcopy and in a way that reads naturally and composes with comprehensions, e.g. `supervision.copy_with(text=...)`. The method is added to CustomFieldMixin (covering SupervisionSegment, DataCut/MonoCut/MultiCut, and text examples), to the base Cut class (covering PaddingCut and MixedCut; it delegates to the existing Cut.copy), and to Recording, Features, Array, and TemporalArray. Tests in test/test_copy_with.py verify, for every manifest type, that copy_with overwrites the requested field, leaves the original object unmutated, and matches fastcopy semantics.

Piping a manifest into `lhotse split - <out>` (or any other command using load_manifest_lazy_or_eager) failed with a JSONDecodeError. Root cause: load_manifest_lazy_or_eager routes "-" to load_manifest_lazy, which consumes one line of stdin to detect the manifest class and then builds a LazyManifestIterator that re-opens the input on every iteration. stdin is a one-shot stream that cannot be re-opened, so subsequent reads see either truncated or empty data. Fix: - load_manifest_lazy_or_eager now eagerly slurps stdin into a list and dispatches to from_dicts (mirroring load_manifest's logic), so the manifest is fully materialized in a single pass. - The `lhotse split` and `lhotse split-lazy` CLI commands derived output filenames from the input path's stem/suffix, which produced bogus names like `out/-.0` (no extension) when reading from stdin. They now fall back to a `manifest.<idx>.jsonl.gz` naming scheme when the input is "-". Tests: - test/test_load_manifest_stdin.py exercises the stdin path with monkeypatched sys.stdin, covering full-load, split(), double-iteration, type detection, empty input, explicit manifest_cls, and garbage input.

deekshaNVIDIA added 2 commits June 13, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading manifests from stdin (closes #810)#1572

Fix loading manifests from stdin (closes #810)#1572
deekshaNVIDIA wants to merge 2 commits into
lhotse-speech:masterfrom
deekshaNVIDIA:fix-load-manifest-from-stdin

deekshaNVIDIA commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deekshaNVIDIA commented Jun 13, 2026

Summary

Root cause

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant