consolidate_data_by_period(period=1d) produces 84 files from 168 daily-spanning fragments — design intent or bug? #3824
Unanswered
stefanholi
asked this question in
Q&A
Replies: 1 comment
-
|
It is a bug: a one-day period should result in seven daily files, not 84. The "halving" suggests the tool is incorrectly performing a simple pairwise merge instead of grouping data by the time intervals you defined. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to use
ParquetDataCatalog.consolidate_data_by_periodto produce one file per UTC day from a live data collector's fragment-per-flush on-disk state. The docs describe this method as:I expected "1-day period" to produce one file per UTC day. The observed behavior is different and I want to confirm whether it's intended.
Environment
Minimal reproduction
Seed a fresh tempdir catalog with 168 synthetic
Barobjects spanning exactly 7 UTC days (hourly bars), each written as a separate fragment viacatalog.write_data([bar], skip_disjoint_check=True). Then:catalog.consolidate_data_by_period(
data_cls=Bar,
identifier=bar_type_str,
period=pd.Timedelta(days=1),
ensure_contiguous_files=False,
)
Observed
Bars are preserved (zero data loss), but the output file count is exactly half of the input — regardless of the period=1 day argument. It looks like pairwise adjacent-file merging rather than period-based partitioning. Running it again would presumably halve further.
For a 7-day span with period=1d, my intuition (and my reading of the "split into fixed time periods" doc) was 7 output files. Is 84 the intended behavior? If yes, could the docs clarify the cardinality relationship? If no, I'll re-file as a bug with the full minimal reproducer.
Side note — ensure_contiguous_files=True
With ensure_contiguous_files=True on the same fragment inputs:
AssertionError: Intervals are not contiguous. When ensure_contiguous_files=True, all files in the consolidation range must have contiguous timestamps.
The docs don't mention this mode for the period variants. Is ensure_contiguous_files=True documented to be incompatible with fragment-per-write collector catalogs?
I'm using NT to write a live data collector that flushes bars in small batches (NT-native filenames, one fragment per flush). Over time these fragments accumulate to thousands per instrument-timeframe. I want a periodic consolidation that produces bounded per-day files, so my rsync transfers only touch the current day's file. The base consolidate_data works but produces one ever-growing file per instrument. consolidate_data_by_period looked like the right tool — hence the question.
If the current behavior is intended and my use case is out of scope for this method, any recommendation on how to achieve per-day consolidation would be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions