Skip to content

[SPARK-47444][SQL] Validate numeric table stats in ALTER TABLE SET TBLPROPERTIES#55550

Open
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi:SPARK-47444-validate-table-stats
Open

[SPARK-47444][SQL] Validate numeric table stats in ALTER TABLE SET TBLPROPERTIES#55550
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi:SPARK-47444-validate-table-stats

Conversation

@shrirangmhalgi
Copy link
Copy Markdown

@shrirangmhalgi shrirangmhalgi commented Apr 25, 2026

What changes were proposed in this pull request?

This PR adds validation for table statistics properties (numRows, totalSize, rawDataSize) in ALTER TABLE SET TBLPROPERTIES to reject non-numeric values.

The PR changes the below 4 files:

  1. error-conditions.json - Added a new error condition INVALID_TABLE_STATS_VALUE (SQLSTATE 22023) with the message: "The value <value> for table statistics property <key> is not a valid numeric value."
  2. CheckAnalysis.scala - Added new case SetTableProperties(_, properties) match in checkAnalysis0() that validates stats property values can be parsed as BigInt. This catches invalid values at analysis time for the v2 catalog code path (e.g., ALTER TABLE ... SET TBLPROPERTIES resolved through DataSourceV2).
  3. ddl.scala - Added the same validation in AlterTableSetPropertiesCommand.run() before properties are written to the catalog. This catches invalid values at execution time for the v1 catalog code path (Hive/in-memory catalog).
  4. AlterTableSetTblPropertiesSuiteBase.scala - Added a new test that covers both invalid and valid inputs for all three stats properties.

Why are the changes needed?

As reported in SPARK-47444, ALTER TABLE SET TBLPROPERTIES currently accepts empty strings and non-numeric values for numRows, totalSize, and rawDataSize. While SPARK-30262 added a defensive filter when reading stats (to avoid NumberFormatException), invalid values can still be written to the catalog. Downstream tools and applications that consume these stats may break or produce incorrect results.

As mentioned in SPARK-47444 - Hive and Beeline already validate these properties on write. Spark should do the same.

Does this PR introduce any user-facing change?

Yes. ALTER TABLE SET TBLPROPERTIES now throws an AnalysisException with error condition INVALID_TABLE_STATS_VALUE if numRows, totalSize, or rawDataSize is set to a non-numeric value (including empty strings). Previously, these invalid values were silently accepted.

How was this patch tested?

Added a new test in AlterTableSetTblPropertiesSuiteBase

  • Empty string values are rejected for all three stats properties (numRows, totalSize, rawDataSize)
  • Non-numeric string values (e.g., 'abc') are rejected for all three stats properties
  • Valid numeric values (e.g., '100', '5000') continue to be accepted without error

All assertions use checkError to verify the exact error condition (INVALID_TABLE_STATS_VALUE) and parameter values. The full AlterTableSetTblPropertiesSuite passes (4/4 tests including the new one)

Was this patch authored or co-authored using generative AI tooling?

Yes. All changes were reviewed and verified by the author.

@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-47444-validate-table-stats branch from f3aaea1 to 38f8a27 Compare April 25, 2026 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant