Fix loading manifests from stdin (closes #810)#1572
Open
deekshaNVIDIA wants to merge 2 commits into
Open
Conversation
…ch#242) Lhotse treats manifest objects as immutable and widely uses the `fastcopy(obj, field=...)` idiom to create modified copies. This adds a `.copy_with(**kwargs)` member method so the same can be done without importing fastcopy and in a way that reads naturally and composes with comprehensions, e.g. `supervision.copy_with(text=...)`. The method is added to CustomFieldMixin (covering SupervisionSegment, DataCut/MonoCut/MultiCut, and text examples), to the base Cut class (covering PaddingCut and MixedCut; it delegates to the existing Cut.copy), and to Recording, Features, Array, and TemporalArray. Tests in test/test_copy_with.py verify, for every manifest type, that copy_with overwrites the requested field, leaves the original object unmutated, and matches fastcopy semantics.
Piping a manifest into `lhotse split - <out>` (or any other command using load_manifest_lazy_or_eager) failed with a JSONDecodeError. Root cause: load_manifest_lazy_or_eager routes "-" to load_manifest_lazy, which consumes one line of stdin to detect the manifest class and then builds a LazyManifestIterator that re-opens the input on every iteration. stdin is a one-shot stream that cannot be re-opened, so subsequent reads see either truncated or empty data. Fix: - load_manifest_lazy_or_eager now eagerly slurps stdin into a list and dispatches to from_dicts (mirroring load_manifest's logic), so the manifest is fully materialized in a single pass. - The `lhotse split` and `lhotse split-lazy` CLI commands derived output filenames from the input path's stem/suffix, which produced bogus names like `out/-.0` (no extension) when reading from stdin. They now fall back to a `manifest.<idx>.jsonl.gz` naming scheme when the input is "-". Tests: - test/test_load_manifest_stdin.py exercises the stdin path with monkeypatched sys.stdin, covering full-load, split(), double-iteration, type detection, empty input, explicit manifest_cls, and garbage input.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #810.
Piping a manifest into a
lhotseCLI command that takes-for stdin (the original report usedlhotse split) failed withorjson.JSONDecodeError: unexpected character: line 1 column 1 (char 0):gunzip -c data/librispeech/raw_cuts_dev-clean.jsonl.gz \ | lhotse split -s 16 - data/librispeech/tmpRoot cause
load_manifest_lazy_or_eager("-")routes toload_manifest_lazy, which:LazyManifestIteratorthat opens the input again on every iteration.But stdin is a one-shot stream — it cannot be re-opened or seeked. When the CLI later materialises the iterator (e.g. in
split_sequence), the second pass reads either truncated or empty data and raisesJSONDecodeError.Fix
lhotse/serialization.pypath == "-", eagerly slurp stdin into a list of dicts and dispatch to the samefrom_dictscandidate-class logic asload_manifest. This was extracted into a small_load_manifest_from_stdinhelper to keepload_manifest_lazystrict (lazy means lazy).lhotse/bin/modes/manipulation.pylhotse splitandlhotse split-lazyderived output filenames from the input path'sstem/suffix. Withmanifest = Path("-")that producedout/-.0(no extension) and trippedstore_manifest's "Unknown serialization format" check. They now fall back tomanifest.<idx>.jsonl.gzwhen the input is-, so the user's command works end-to-end.Verification
test/test_load_manifest_stdin.py(new) — covers the stdin path withmonkeypatchedsys.stdin:split(num_splits=4)succeeds and preserves all items (the original#810regression)RecordingSetvsSupervisionSet)Nonemanifest_cls=is honouredI also ran an end-to-end CLI test that mirrors the issue's
gunzip | lhotse split -invocation: 8 dummy recordings →manifest.{0..3}.jsonl.gzwith all recordings preserved.black,isort --profile black, andflake8 --select=E9,F63,F7,F82are clean on the changed files. The local test suite has unrelated Windows-onlyNamedTemporaryFilePermissionErrorfailures (present onmastertoo) — CI on Linux is unaffected.