Skip to content

fix(job-orchestration): Clamp invalid archive timestamps during compression#2228

Open
goynam wants to merge 1 commit intoy-scope:mainfrom
goynam:fix/clamp-archive-timestamps
Open

fix(job-orchestration): Clamp invalid archive timestamps during compression#2228
goynam wants to merge 1 commit intoy-scope:mainfrom
goynam:fix/clamp-archive-timestamps

Conversation

@goynam
Copy link
Copy Markdown
Contributor

@goynam goynam commented Apr 28, 2026

…ession

Archives with bogus begin_timestamp/end_timestamp values (e.g., from upstream services emitting non-epoch values in the timestamp field) create extremely wide time spans that match virtually every search query, severely degrading query scheduler performance.

Add _clamp_archive_timestamps() validation in update_archive_metadata() that:

  • Clamps begin_timestamp before 2020-01-01 to end_timestamp
  • Clamps end_timestamp more than 1 year in the future to begin_timestamp
  • Sets both to 0 if both are invalid

This prevents new archives from being written with poisoned time ranges while preserving backward compatibility (logger parameter is optional).

Description

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved timestamp validation for archive metadata to prevent malformed timestamps from creating excessively wide time ranges. When invalid timestamp ranges are detected, the system now logs warnings and automatically corrects the timestamps to ensure data integrity.

…ession

Archives with bogus begin_timestamp/end_timestamp values (e.g., from
upstream services emitting non-epoch values in the timestamp field)
create extremely wide time spans that match virtually every search
query, severely degrading query scheduler performance.

Add _clamp_archive_timestamps() validation in update_archive_metadata()
that:
- Clamps begin_timestamp before 2020-01-01 to end_timestamp
- Clamps end_timestamp more than 1 year in the future to begin_timestamp
- Sets both to 0 if both are invalid

This prevents new archives from being written with poisoned time ranges
while preserving backward compatibility (logger parameter is optional).
@goynam goynam requested a review from a team as a code owner April 28, 2026 16:16
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Walkthrough

Adds timestamp validation and clamping logic to archive metadata handling in the compression task. A new helper function validates and optionally clamps begin_timestamp and end_timestamp to prevent malformed time ranges. The update_archive_metadata function signature is updated to accept an optional logger parameter, and the StorageEngine.CLP_S compression flow is updated to pass the logger.

Changes

Cohort / File(s) Summary
Archive Metadata Timestamp Validation
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Added helper function for timestamp validation/clamping that detects and corrects invalid time ranges, logging warnings when issues are found. Updated update_archive_metadata function signature to accept optional logger parameter and integrated clamping logic during metadata insertion. Updated compression flow to pass logger to enable timestamp validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and clearly summarizes the main change: clamping invalid archive timestamps during the compression process.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.15.12)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py

Unexpected Ruff issue shape at index 39


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@goynam goynam changed the title fix(job-orchestration): Clamp invalid archive timestamps during compr… fix(job-orchestration): Clamp invalid archive timestamps during compression Apr 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@components/job-orchestration/job_orchestration/executor/compress/compression_task.py`:
- Around line 103-109: The current module-level computation of
max_valid_timestamp_ms uses time.time() directly which makes tests flaky;
introduce an injectable "now" in milliseconds by replacing the module-level
max_valid_timestamp_ms with a small helper like
get_max_valid_timestamp_ms(now_ms=None) that returns int((now_ms or
int(time.time()*1000)) + 365*86400*1000), and update any callers (e.g.,
timestamp validation logic in this module) to accept an optional now_ms
parameter and call get_max_valid_timestamp_ms(now_ms) so tests can pass a fixed
now_ms for deterministic boundary checks while defaulting to the runtime clock.
- Line 151: The timestamp clamping is currently skipped when the optional logger
parameter is None; change the call sites in compression_task.py so the
timestamp-clamping helper (e.g., clamp_timestamps or the "timestamp clamping"
call) is invoked unconditionally regardless of logger presence, and instead move
the logger presence check into the helper so it only emits warnings when a
logger is provided; update the other similar call sites (the block covering
lines ~174-177) to follow the same pattern so clamping always runs but warning
emission is guarded inside the helper that accepts an optional logger.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1cf08ef6-235c-44d5-83ab-b7b25103d012

📥 Commits

Reviewing files that changed from the base of the PR and between aa28a0c and 77fd1cd.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py

Comment on lines +103 to +109
import time

# 2020-01-01 00:00:00 UTC in milliseconds
MIN_VALID_TIMESTAMP_MS = 1577836800000
# 1 year from now in milliseconds
max_valid_timestamp_ms = int((time.time() + 365 * 86400) * 1000)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Make the “current time” input injectable for deterministic boundary tests.

Using time.time() directly makes edge-case tests around the 1-year threshold flaky. Consider an optional now_ms parameter defaulting to runtime clock.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/executor/compress/compression_task.py`
around lines 103 - 109, The current module-level computation of
max_valid_timestamp_ms uses time.time() directly which makes tests flaky;
introduce an injectable "now" in milliseconds by replacing the module-level
max_valid_timestamp_ms with a small helper like
get_max_valid_timestamp_ms(now_ms=None) that returns int((now_ms or
int(time.time()*1000)) + 365*86400*1000), and update any callers (e.g.,
timestamp validation logic in this module) to accept an optional now_ms
parameter and call get_max_valid_timestamp_ms(now_ms) so tests can pass a fixed
now_ms for deterministic boundary checks while defaulting to the runtime clock.

table_prefix: str,
dataset: str | None,
archive_stats: dict[str, Any],
logger=None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Always apply timestamp clamping regardless of logger presence.

Right now, omitting logger skips clamping entirely, which re-opens the poisoned time-range path. Clamp unconditionally, and only guard warning emission inside the helper.

Proposed fix
 def _clamp_archive_timestamps(
     archive_stats: dict[str, Any],
-    logger,
+    logger=None,
 ) -> None:
@@
-    if bad_begin and bad_end:
-        logger.warning(
+    if bad_begin and bad_end:
+        if logger is not None:
+            logger.warning(
             "Archive %s has both invalid begin_timestamp=%s and end_timestamp=%s, "
             "setting both to 0.",
             archive_stats.get("id", "unknown"),
             begin_ts,
             end_ts,
-        )
+            )
         archive_stats["begin_timestamp"] = 0
         archive_stats["end_timestamp"] = 0
     elif bad_begin:
-        logger.warning(
+        if logger is not None:
+            logger.warning(
             "Archive %s has invalid begin_timestamp=%s (before 2020), clamping to "
             "end_timestamp=%s.",
             archive_stats.get("id", "unknown"),
             begin_ts,
             end_ts,
-        )
+            )
         archive_stats["begin_timestamp"] = end_ts
     elif bad_end:
-        logger.warning(
+        if logger is not None:
+            logger.warning(
             "Archive %s has invalid end_timestamp=%s (>1 year in future), clamping to "
             "begin_timestamp=%s.",
             archive_stats.get("id", "unknown"),
             end_ts,
             begin_ts,
-        )
+            )
         archive_stats["end_timestamp"] = begin_ts
@@
-    if logger is not None:
-        _clamp_archive_timestamps(stats_to_update, logger)
+    _clamp_archive_timestamps(stats_to_update, logger)

Also applies to: 174-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/job-orchestration/job_orchestration/executor/compress/compression_task.py`
at line 151, The timestamp clamping is currently skipped when the optional
logger parameter is None; change the call sites in compression_task.py so the
timestamp-clamping helper (e.g., clamp_timestamps or the "timestamp clamping"
call) is invoked unconditionally regardless of logger presence, and instead move
the logger presence check into the helper so it only emits warnings when a
logger is provided; update the other similar call sites (the block covering
lines ~174-177) to follow the same pattern so clamping always runs but warning
emission is guarded inside the helper that accepts an optional logger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant