Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Redact sensitive information in catalog queries #24563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Redact sensitive information in catalog queries #24563
Changes from all commits
67fa908
e0919af
0352081
20aee8e
362bc90
6437e20
e32ef86
5060c7b
152bd6a
76ec66f
0fa189b
3c0af7b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an error, actually. It should throw IllegalArgumentException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then the query will fail during redaction. The idea is to avoid disrupting the natural flow and let it fail where it normally would if redaction didn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a query fail during redaction if it hasn’t first failed during analysis? I.e., it’s a condition that should never occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We perform redaction at a very early stage (before the query state machine is created) to modify query text exposed in query events and
QueryInfo
. I believe that verifying the existence of a given connector happens only during execution, for example, inCreateCatalogTask
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m very confused about the purpose of this change, then. It redaction happens before analysis, how is the analyzer and execution engine able to see the unredacted values so that it can to its job?
Can you describe the technical approach at a high level so that I don’t have to reverse engineer what the code is trying to achieve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if the following helps:
Problem
Currently, when we execute a CREATE CATALOG statement containing plaintext secrets, unredacted query text is exposed via the REST API, the
system.runtime.queries table
, and query events.Goal
Instead of displaying the query text in its raw form in the locations mentioned above, such as:
we aim to redact security-sensitive property values:
Proposed Solution
The REST API, the
system.runtime.queries
table, and query events obtain query text from theQueryInfo
object. Based on our research, the query text contained inQueryInfo
is not interpreted anywhere in the engine.The
QueryInfo
object is created by theQueryStateMachine
. To redact the query text, we propose performing redaction after the query is parsed (to ensure we have the AST, available for traversal and redaction) but before theQueryStateMachine
is created.Since redaction occurs at an early stage of query processing, we need to duplicate some logic that is typically performed during analysis and execution. For example, this includes evaluating catalog properties. Additionally, we do not want to disrupt the normal query processing flow; therefore, we ensure the query never fails due to redaction. If, for any reason, redaction is not possible, we will resort to masking all properties.
To identify security-sensitive properties for a given connector, we propose introducing a new SPI to expose them: #24562
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this automatically also handles things like event listener and
QueryResource
right?Might be worth to explicitly call it out in the commit message (although you do imply that by mentioning anything using QueryInfo/BasicQueryInfo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
I extracted tests confirming that to separate commits into separate commits to avoid distracting from the core functionality of redacting.
I refined the commit message and included your suggestion.