Skip to content

Conversation

@jakesmith
Copy link
Member

@jakesmith jakesmith commented Jan 6, 2026

The creation of split points for small JSON files was causing what should have been trailing empty split points to contain huge lengths due to unsigned underflow.
This was caused by the JSOM partitioner incorrectly resetting the cursor inputOffset which in turn was used to calculate length on all furture splits.

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copilot AI review requested due to automatic review settings January 6, 2026 18:01
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-35567

Jirabot Action Result:
Assigning user: [email protected]
Workflow Transition To: Merge Pending
Updated PR

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in JSON spray split point calculation for small files where unsigned integer underflow was causing trailing empty split points to have incorrectly large lengths. The root cause was the JSON partitioner incorrectly resetting the cursor.inputOffset to 0, which was then used to calculate split point lengths.

Key Changes

  • Modified the conditional logic for when cursor.inputOffset is reset to 0 to preserve its value when EOF is encountered
  • Added defensive assertex check to guard against length underflow
  • Improved code clarity by introducing an intermediate variable for the adjusted start offset calculation

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dali/ft/daftformat.ipp Modified findSplitPoint to conditionally reset cursor.inputOffset only for first split or before EOF check, preserving the offset value when EOF is reached
dali/ft/daftformat.cpp Added defensive assertion and intermediate variable to prevent unsigned underflow when calculating partition point lengths

@jakesmith jakesmith requested a review from ghalliday January 6, 2026 21:59
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith 1 question

if (eof) // leave cursor.inputOffset untouched
return;

cursor.inputOffset = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct for the case !foundRowEnd?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I agree it is not.
I don't think correct if !checkFoundRowStartRes either..

The creation of split points for small JSON files was causing
what should have been trailing empty split points to contain
huge lengths due to unsigned underflow.
This was caused by the JSOM partitioner incorrectly resetting
the cursor inputOffset which in turn was used to calculate length
on all furture splits.

Signed-off-by: Jake Smith <[email protected]>
@jakesmith jakesmith force-pushed the HPCC-35567-json-spray-split-fix branch from 4c509dc to 193056e Compare January 8, 2026 17:00
@jakesmith jakesmith requested a review from ghalliday January 9, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants