Skip to content

Adding native support of AWS Cloudtrail input format #24479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

zhaner08
Copy link
Contributor

Description

Adding native support of AWS Cloudtrail input format

Additional context and related issues

Publishing this revision out to get some feedbacks while testing is ongoing and tests are being added

Specific question regarding the implementation:

I saw the current code more about supporting multiple input format mapped to single SerDe instead of single InputFormat mapped to multiple SerDe like in this CR, is there a better way to do this? Or we want to pass the input format down to TextLineReaderFactory so it can create Cloudtrail line reader at that the creation time.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Dec 15, 2024
@github-actions github-actions bot added the hive Hive connector label Dec 15, 2024
@zhaner08 zhaner08 requested a review from dain December 15, 2024 22:41
Copy link

github-actions bot commented Jan 6, 2025

This pull request has gone a while without any activity. Tagging for triage help: @mosabua

@github-actions github-actions bot added the stale label Jan 6, 2025
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jan 28, 2025
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Jan 28, 2025
@mosabua
Copy link
Member

mosabua commented Jan 28, 2025

We had @zhaner08 and others attend the contributor call to ask about help with this PR. @pettyjamesm and @dain and @electrum will be able to help driving this so I added stale-ignore since we know this PR will come to completion over time.

@mosabua mosabua reopened this Jan 28, 2025
@mosabua
Copy link
Member

mosabua commented Jan 28, 2025

@zhaner08 let us know if you have any further questions or work planned on this PR or if you are waiting for first review beyond the input we provided during the contributor call.

@zhaner08
Copy link
Contributor Author

zhaner08 commented Feb 4, 2025

Will work on another revision of this this week

@zhaner08 zhaner08 changed the title [WIP] Adding native support of AWS Cloudtrail input format Adding native support of AWS Cloudtrail input format Feb 6, 2025
@zhaner08
Copy link
Contributor Author

zhaner08 commented Feb 6, 2025

This PR is ready to be reviewed, as discussed during the call, this currently only supports CloudTrail + Hive Json combination.

@zhaner08 zhaner08 self-assigned this Feb 6, 2025
@zhaner08 zhaner08 requested a review from pettyjamesm February 6, 2025 02:35
@mosabua mosabua requested a review from electrum February 13, 2025 21:48
@mosabua
Copy link
Member

mosabua commented Feb 13, 2025

Added @electrum as reviewer since he is file system lead and was part of the discussion in the contributor call.

@@ -101,6 +102,10 @@ public enum HiveStorageFormat
REGEX(
REGEX_SERDE_CLASS,
TEXT_INPUT_FORMAT_CLASS,
HIVE_IGNORE_KEY_OUTPUT_FORMAT_CLASS),
CLOUDTRAIL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document it in hive.md so that the Trino users get to know about it.

@@ -189,6 +190,7 @@ public static Set<HivePageSourceFactory> getDefaultHivePageSourceFactories(Trino
.add(new RcFilePageSourceFactory(fileSystemFactory, hiveConfig))
.add(new OrcPageSourceFactory(new OrcReaderConfig(), fileSystemFactory, stats, hiveConfig))
.add(new ParquetPageSourceFactory(fileSystemFactory, stats, new ParquetReaderConfig(), hiveConfig))
.add(new CloudTrailJsonPageSourceFactory(fileSystemFactory, hiveConfig))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add test coverage for this functionality.
Consider working with a resource directory to add coverage for the new code and showcase the new functionality.

import static java.nio.charset.StandardCharsets.UTF_8;
import static java.util.Objects.requireNonNull;

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a javadoc comment

public CloudTrailJsonPageSourceFactory(TrinoFileSystemFactory trinoFileSystemFactory, HiveConfig config)
{
super(trinoFileSystemFactory,
new JsonDeserializerFactory(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The CloudTrail reader code in Hive is expected to be just a simple wrapper around the Hive JsonSerDe format, or is does it have custom code? I ask because the exact Hive JsonSerDe has some strange behavior and we need to be sure that this correct behavior.

Copy link
Contributor Author

@zhaner08 zhaner08 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from Athena side, we updated our doc to suggest customers to use this Hive json wrapper from ~2 years ago, it literally just extracting the json within the 'records' and use whatever customer choose to process the json itself. Customer can also choose to use openx json. We haven't heard any complaints about using hive/openx json underneath likely due to cloud trail logs are mostly proper formatted json. With this native wrapper, we are also planning to migrate those people that using emr.cloudtrail serde over to this, not planning to write an exact bug to bug copy of that serde

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

4 participants