Skip to content

Sanitize NaN/Infinity in metrics JSON before parsing#139

Merged
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
sutaakar:progress-fix
Apr 7, 2026
Merged

Sanitize NaN/Infinity in metrics JSON before parsing#139
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
sutaakar:progress-fix

Conversation

@sutaakar
Copy link
Copy Markdown
Collaborator

@sutaakar sutaakar commented Apr 7, 2026

Python's json.dumps() allows NaN and Infinity by default, but these are not valid JSON per the spec. When training metrics contain these values (e.g., grad_norm when loss=0, eval_loss with bf16 on ROCm), Go's json.Unmarshal fails, causing the operator to never capture training progress.

Add sanitizeJSON() to replace NaN/Infinity with null before parsing in both PollTrainingProgress and CaptureMetricsFromTerminationMessage.

Ref: RHOAIENG-56898

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Summary by CodeRabbit

  • Bug Fixes

    • Metrics parsing now treats non-standard numeric tokens (NaN, Infinity, -Infinity) as null in JSON payloads to prevent errors when reporting training progress or termination-message metrics; quoted string contents are preserved. Sanitized payloads emit a low-verbosity debug log when modifications occur.
  • Tests

    • Added tests for polling and termination-message handling with non-standard numeric values and a unit test verifying sanitization behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 93bcd739-69dd-4a68-8154-5488562e6a6f

📥 Commits

Reviewing files that changed from the base of the PR and between d5100d1 and aadeadf.

📒 Files selected for processing (2)
  • pkg/rhai/progression/progression.go
  • pkg/rhai/progression/progression_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/rhai/progression/progression.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/rhai/progression/progression_test.go

📝 Walkthrough

Walkthrough

The PR adds a JSON sanitization helper sanitizeJSON and a compiled regex jsonSanitizer to replace non-standard numeric tokens (NaN, Infinity, -Infinity) with JSON null. PollTrainingProgress and CaptureMetricsFromTerminationMessage now sanitize payloads before unmarshalling and log at verbosity level 1 when modifications occur. Added bytes and regexp imports. Tests updated: new TestSanitizeJSON, additional TestPollTrainingProgress cases for NaN/Infinity/-Infinity, and an extended TestCaptureMetricsFromTerminationMessage case validating sanitized outputs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Security and Code Quality Findings

  • Regex-based JSON mutation risks incorrect replacements or missed cases (e.g., edge-case token placements, escaped quotes, Unicode separators). CWE-20 (Improper Input Validation). Actionable: replace regex approach with a JSON-aware sanitizer — parse into tokens or into a map/tree and recursively replace non-finite numbers; add property-path reporting for replacements.

  • Silent conversion of non-finite numbers to null changes observable behavior and may hide upstream issues. CWE-703 (Improper Check or Handling of Exceptional Conditions). Actionable: make sanitization opt-in or emit structured telemetry (field paths/counts) and consider failing fast by default with an explicit sanitize flag.

  • Lack of preservation of original payload impedes forensic/debugging when data is mutated. Actionable: retain original payload (or cryptographic hash and snippet) when sanitization occurs; log at debug/trace level rather than mutating silently.

  • Potential DoS/CPU amplification when regex-processing large or many payloads. CWE-400 (Uncontrolled Resource Consumption). Actionable: limit payload size before sanitization, enforce timeouts, or switch to streaming/tokenized parsing to bound cost.

  • Test coverage gaps remain for complex/edge inputs: escaped quotes, keys containing numeric-like tokens, various whitespace/newline patterns, large payloads, and deeply nested structures. Actionable: add fuzz/edge-case tests and assertions that sanitized output preserves JSON structure and reports modification locations.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding JSON sanitization for NaN/Infinity values before parsing in the metrics processing workflow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@sutaakar sutaakar requested a review from abhijeet-dhumal April 7, 2026 07:47
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/rhai/progression/progression.go (1)

281-284: ⚠️ Potential issue | 🟠 Major

Missing io.LimitReader — potential memory exhaustion (CWE-400).

Unbounded io.ReadAll on an HTTP response body can exhaust memory if the metrics endpoint returns unexpectedly large payloads, whether maliciously or due to misconfiguration.

Proposed fix
-	body, err := io.ReadAll(resp.Body)
+	const maxResponseSize = 1 << 20 // 1 MiB limit for metrics JSON
+	body, err := io.ReadAll(io.LimitReader(resp.Body, maxResponseSize))

As per coding guidelines: "Use io.LimitReader for HTTP response bodies (prevent memory exhaustion)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rhai/progression/progression.go` around lines 281 - 284, Replace the
unbounded io.ReadAll(resp.Body) call with a limited reader to prevent memory
exhaustion: define a constant like maxResponseBodySize (e.g., 1<<20 for 1MiB)
and call io.ReadAll(io.LimitReader(resp.Body, maxResponseBodySize)) instead of
io.ReadAll(resp.Body); after reading, if len(body) == maxResponseBodySize return
an error indicating the response is too large. Update the code around the
resp.Body read (the variables resp.Body and body and the existing error
handling) to use the LimitReader and the new size constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/rhai/progression/progression_test.go`:
- Around line 618-623: Add a new failing unit test that exercises the regex bug
by putting ": NaN" inside a string and ensuring only the actual NaN value is
replaced; specifically add a test case alongside the existing table entry (the
one named "NaN in string value not replaced") with name like "NaN in string with
colon should not be replaced", input `{"message": "Error: NaN detected",
"grad_norm": NaN}` and expected output `{"message": "Error: NaN detected",
"grad_norm": null}` so the test highlights the regex failing to respect string
boundaries in the code exercised by the progression tests.

In `@pkg/rhai/progression/progression.go`:
- Around line 296-306: The current sanitizeJSON uses regexes (nanPattern,
infPattern) that can match inside JSON strings; update sanitizeJSON to only
replace unquoted tokens by scanning the input and skipping quoted string regions
(handle escapes) and replacing bare NaN or Infinity tokens with null; keep the
nanPattern/infPattern as simple literals or remove them and perform the
replacement in the scanner logic inside sanitizeJSON so only unquoted
occurrences (e.g., the grad_norm: NaN token) are changed while string values
like "Error: NaN detected" remain untouched.

---

Outside diff comments:
In `@pkg/rhai/progression/progression.go`:
- Around line 281-284: Replace the unbounded io.ReadAll(resp.Body) call with a
limited reader to prevent memory exhaustion: define a constant like
maxResponseBodySize (e.g., 1<<20 for 1MiB) and call
io.ReadAll(io.LimitReader(resp.Body, maxResponseBodySize)) instead of
io.ReadAll(resp.Body); after reading, if len(body) == maxResponseBodySize return
an error indicating the response is too large. Update the code around the
resp.Body read (the variables resp.Body and body and the existing error
handling) to use the LimitReader and the new size constant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: d20a1720-b647-4924-89b7-833c5c9fbd63

📥 Commits

Reviewing files that changed from the base of the PR and between 7d53779 and 0f22576.

📒 Files selected for processing (2)
  • pkg/rhai/progression/progression.go
  • pkg/rhai/progression/progression_test.go

Comment thread pkg/rhai/progression/progression_test.go
Comment thread pkg/rhai/progression/progression.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/rhai/progression/progression_test.go (1)

618-622: Missing edge case: Infinity in string value.

Test coverage includes NaN-in-string but not Infinity-in-string. For symmetry with the regex patterns.

Optional test case
+		{
+			name:  "Infinity in string value not replaced",
+			input: `{"message": "limit: Infinity reached", "value": Infinity}`,
+			want:  `{"message": "limit: Infinity reached", "value": null}`,
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rhai/progression/progression_test.go` around lines 618 - 622, Add a
parallel test case for Infinity-in-string symmetry with the NaN case: locate the
table entry with name "NaN in string value not replaced" in progression_test.go
and add a new case named "Infinity in string value not replaced" with input
`{"message": "Error: Infinity detected", "grad_norm": Infinity}` and expected
want `{"message": "Error: Infinity detected", "grad_norm": null}` so the
regex/path that replaces numeric Infinity is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/rhai/progression/progression_test.go`:
- Around line 618-622: Add a parallel test case for Infinity-in-string symmetry
with the NaN case: locate the table entry with name "NaN in string value not
replaced" in progression_test.go and add a new case named "Infinity in string
value not replaced" with input `{"message": "Error: Infinity detected",
"grad_norm": Infinity}` and expected want `{"message": "Error: Infinity
detected", "grad_norm": null}` so the regex/path that replaces numeric Infinity
is covered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 5a299bab-38ef-4d62-826a-da2f25081aab

📥 Commits

Reviewing files that changed from the base of the PR and between 0f22576 and fc8b7c9.

📒 Files selected for processing (2)
  • pkg/rhai/progression/progression.go
  • pkg/rhai/progression/progression_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/rhai/progression/progression.go

Comment thread pkg/rhai/progression/progression.go Outdated
@sutaakar sutaakar force-pushed the progress-fix branch 2 times, most recently from 154e00e to d5100d1 Compare April 7, 2026 12:02
Python's json.dumps() allows NaN and Infinity by default, but these
are not valid JSON per the spec. When training metrics contain these
values (e.g., grad_norm when loss=0, eval_loss with bf16 on ROCm),
Go's json.Unmarshal fails, causing the operator to never capture
training progress.

Add sanitizeJSON() to replace NaN/Infinity with null before parsing
in both PollTrainingProgress and CaptureMetricsFromTerminationMessage.
The regex patterns are anchored to trailing JSON delimiters to avoid
corrupting string values.

Ref: RHOAIENG-56898

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abhijeet-dhumal abhijeet-dhumal merged commit a4e5b5a into opendatahub-io:main Apr 7, 2026
8 checks passed
@sutaakar sutaakar deleted the progress-fix branch April 7, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants