Skip to content

Fix loading manifests from stdin (closes #810)#1572

Open
deekshaNVIDIA wants to merge 2 commits into
lhotse-speech:masterfrom
deekshaNVIDIA:fix-load-manifest-from-stdin
Open

Fix loading manifests from stdin (closes #810)#1572
deekshaNVIDIA wants to merge 2 commits into
lhotse-speech:masterfrom
deekshaNVIDIA:fix-load-manifest-from-stdin

Conversation

@deekshaNVIDIA

Copy link
Copy Markdown

Summary

Closes #810.

Piping a manifest into a lhotse CLI command that takes - for stdin (the original report used lhotse split) failed with orjson.JSONDecodeError: unexpected character: line 1 column 1 (char 0):

gunzip -c data/librispeech/raw_cuts_dev-clean.jsonl.gz \
  | lhotse split -s 16 - data/librispeech/tmp

Root cause

load_manifest_lazy_or_eager("-") routes to load_manifest_lazy, which:

  1. Reads one line of stdin to detect the manifest class.
  2. Returns a LazyManifestIterator that opens the input again on every iteration.

But stdin is a one-shot stream — it cannot be re-opened or seeked. When the CLI later materialises the iterator (e.g. in split_sequence), the second pass reads either truncated or empty data and raises JSONDecodeError.

Fix

lhotse/serialization.py

  • When path == "-", eagerly slurp stdin into a list of dicts and dispatch to the same from_dicts candidate-class logic as load_manifest. This was extracted into a small _load_manifest_from_stdin helper to keep load_manifest_lazy strict (lazy means lazy).

lhotse/bin/modes/manipulation.py

  • lhotse split and lhotse split-lazy derived output filenames from the input path's stem/suffix. With manifest = Path("-") that produced out/-.0 (no extension) and tripped store_manifest's "Unknown serialization format" check. They now fall back to manifest.<idx>.jsonl.gz when the input is -, so the user's command works end-to-end.

Verification

test/test_load_manifest_stdin.py (new) — covers the stdin path with monkeypatched sys.stdin:

  • full eager load returns the right items in order
  • split(num_splits=4) succeeds and preserves all items (the original #810 regression)
  • the manifest can be iterated multiple times
  • type auto-detection works (RecordingSet vs SupervisionSet)
  • empty input returns None
  • explicit manifest_cls= is honoured
  • garbage input raises an error

I also ran an end-to-end CLI test that mirrors the issue's gunzip | lhotse split - invocation: 8 dummy recordings → manifest.{0..3}.jsonl.gz with all recordings preserved.

black, isort --profile black, and flake8 --select=E9,F63,F7,F82 are clean on the changed files. The local test suite has unrelated Windows-only NamedTemporaryFile PermissionError failures (present on master too) — CI on Linux is unaffected.

…ch#242)

Lhotse treats manifest objects as immutable and widely uses the
`fastcopy(obj, field=...)` idiom to create modified copies. This adds a
`.copy_with(**kwargs)` member method so the same can be done without
importing fastcopy and in a way that reads naturally and composes with
comprehensions, e.g. `supervision.copy_with(text=...)`.

The method is added to CustomFieldMixin (covering SupervisionSegment,
DataCut/MonoCut/MultiCut, and text examples), to the base Cut class
(covering PaddingCut and MixedCut; it delegates to the existing
Cut.copy), and to Recording, Features, Array, and TemporalArray.

Tests in test/test_copy_with.py verify, for every manifest type, that
copy_with overwrites the requested field, leaves the original object
unmutated, and matches fastcopy semantics.
Piping a manifest into `lhotse split - <out>` (or any other command using
load_manifest_lazy_or_eager) failed with a JSONDecodeError. Root cause:
load_manifest_lazy_or_eager routes "-" to load_manifest_lazy, which
consumes one line of stdin to detect the manifest class and then builds
a LazyManifestIterator that re-opens the input on every iteration. stdin
is a one-shot stream that cannot be re-opened, so subsequent reads see
either truncated or empty data.

Fix:
- load_manifest_lazy_or_eager now eagerly slurps stdin into a list and
  dispatches to from_dicts (mirroring load_manifest's logic), so the
  manifest is fully materialized in a single pass.
- The `lhotse split` and `lhotse split-lazy` CLI commands derived output
  filenames from the input path's stem/suffix, which produced bogus names
  like `out/-.0` (no extension) when reading from stdin. They now fall
  back to a `manifest.<idx>.jsonl.gz` naming scheme when the input is "-".

Tests:
- test/test_load_manifest_stdin.py exercises the stdin path with
  monkeypatched sys.stdin, covering full-load, split(), double-iteration,
  type detection, empty input, explicit manifest_cls, and garbage input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in using pipe with split

1 participant