-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Adding native support of AWS Cloudtrail input format #24479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Adding native support of AWS Cloudtrail input format #24479
Conversation
This pull request has gone a while without any activity. Tagging for triage help: @mosabua |
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
We had @zhaner08 and others attend the contributor call to ask about help with this PR. @pettyjamesm and @dain and @electrum will be able to help driving this so I added stale-ignore since we know this PR will come to completion over time. |
@zhaner08 let us know if you have any further questions or work planned on this PR or if you are waiting for first review beyond the input we provided during the contributor call. |
Will work on another revision of this this week |
This PR is ready to be reviewed, as discussed during the call, this currently only supports CloudTrail + Hive Json combination. |
Added @electrum as reviewer since he is file system lead and was part of the discussion in the contributor call. |
@@ -101,6 +102,10 @@ public enum HiveStorageFormat | |||
REGEX( | |||
REGEX_SERDE_CLASS, | |||
TEXT_INPUT_FORMAT_CLASS, | |||
HIVE_IGNORE_KEY_OUTPUT_FORMAT_CLASS), | |||
CLOUDTRAIL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document it in hive.md
so that the Trino users get to know about it.
@@ -189,6 +190,7 @@ public static Set<HivePageSourceFactory> getDefaultHivePageSourceFactories(Trino | |||
.add(new RcFilePageSourceFactory(fileSystemFactory, hiveConfig)) | |||
.add(new OrcPageSourceFactory(new OrcReaderConfig(), fileSystemFactory, stats, hiveConfig)) | |||
.add(new ParquetPageSourceFactory(fileSystemFactory, stats, new ParquetReaderConfig(), hiveConfig)) | |||
.add(new CloudTrailJsonPageSourceFactory(fileSystemFactory, hiveConfig)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add test coverage for this functionality.
Consider working with a resource directory to add coverage for the new code and showcase the new functionality.
import static java.nio.charset.StandardCharsets.UTF_8; | ||
import static java.util.Objects.requireNonNull; | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a javadoc comment
public CloudTrailJsonPageSourceFactory(TrinoFileSystemFactory trinoFileSystemFactory, HiveConfig config) | ||
{ | ||
super(trinoFileSystemFactory, | ||
new JsonDeserializerFactory(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? The CloudTrail reader code in Hive is expected to be just a simple wrapper around the Hive JsonSerDe format, or is does it have custom code? I ask because the exact Hive JsonSerDe has some strange behavior and we need to be sure that this correct behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from Athena side, we updated our doc to suggest customers to use this Hive json wrapper from ~2 years ago, it literally just extracting the json within the 'records' and use whatever customer choose to process the json itself. Customer can also choose to use openx json. We haven't heard any complaints about using hive/openx json underneath likely due to cloud trail logs are mostly proper formatted json. With this native wrapper, we are also planning to migrate those people that using emr.cloudtrail serde over to this, not planning to write an exact bug to bug copy of that serde
Description
Adding native support of AWS Cloudtrail input format
Additional context and related issues
Publishing this revision out to get some feedbacks while testing is ongoing and tests are being added
Specific question regarding the implementation:
I saw the current code more about supporting multiple input format mapped to single SerDe instead of single InputFormat mapped to multiple SerDe like in this CR, is there a better way to do this? Or we want to pass the input format down to
TextLineReaderFactory
so it can create Cloudtrail line reader at that the creation time.Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: