[SPARK-47444][SQL] Validate numeric table stats in ALTER TABLE SET TBLPROPERTIES#55550
Open
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-47444][SQL] Validate numeric table stats in ALTER TABLE SET TBLPROPERTIES#55550shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
Conversation
f3aaea1 to
38f8a27
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds validation for table statistics properties
(numRows, totalSize, rawDataSize)inALTER TABLE SET TBLPROPERTIESto reject non-numeric values.The PR changes the below 4 files:
error-conditions.json- Added a new error conditionINVALID_TABLE_STATS_VALUE (SQLSTATE 22023)with the message:"The value <value> for table statistics property <key> is not a valid numeric value."CheckAnalysis.scala- Added new caseSetTableProperties(_, properties)match incheckAnalysis0()that validates stats property values can be parsed asBigInt. This catches invalid values at analysis time for thev2 catalogcode path (e.g.,ALTER TABLE ... SET TBLPROPERTIESresolved throughDataSourceV2).ddl.scala- Added the same validation inAlterTableSetPropertiesCommand.run()before properties are written to the catalog. This catches invalid values at execution time for thev1 catalogcode path (Hive/in-memory catalog).AlterTableSetTblPropertiesSuiteBase.scala- Added a new test that covers both invalid and valid inputs for all three stats properties.Why are the changes needed?
As reported in SPARK-47444,
ALTER TABLE SET TBLPROPERTIEScurrently accepts empty strings and non-numeric values fornumRows, totalSize, and rawDataSize. While SPARK-30262 added a defensive filter when reading stats (to avoidNumberFormatException), invalid values can still be written to the catalog. Downstream tools and applications that consume these stats may break or produce incorrect results.As mentioned in SPARK-47444 - Hive and Beeline already validate these properties on write. Spark should do the same.
Does this PR introduce any user-facing change?
Yes.
ALTER TABLE SET TBLPROPERTIESnow throws anAnalysisExceptionwith error conditionINVALID_TABLE_STATS_VALUEifnumRows, totalSize, or rawDataSizeis set to a non-numeric value (including empty strings). Previously, these invalid values were silently accepted.How was this patch tested?
Added a new test in
AlterTableSetTblPropertiesSuiteBase(numRows, totalSize, rawDataSize)(e.g., 'abc')are rejected for all three stats properties(e.g., '100', '5000')continue to be accepted without errorAll assertions use
checkErrorto verify the exact error condition(INVALID_TABLE_STATS_VALUE)and parameter values. The fullAlterTableSetTblPropertiesSuitepasses (4/4 tests including the new one)Was this patch authored or co-authored using generative AI tooling?
Yes. All changes were reviewed and verified by the author.