Skip to content

Native Grok Reader Implementation #25205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bangtim
Copy link
Contributor

@bangtim bangtim commented Mar 3, 2025

Description

Native reader implementation for Grok format.

This PR is implementing a GrokDeserializer as well as porting over the entire Grok library (Athena depends on release 0.1.4 with some minor bug fixes and changes to support date data type).

The Java Grok library can be found here: https://github.com/thekrakken/java-grok/tree/grok-0.1.4

  • The library includes an api that allows us to parse logs as well as some basic unit tests

Questions/concerns:

  • One thing to pay attention to is the LICENSE
  • The header is different, thus the build fails (with the same header as other files, the build succeeds locally) - How should we make sure the header is properly citing the authors/contributors of the open source grok library? cc: @martint
  • What should the getHiveSerDeClassNames value be?

The implementation(everything aside from java grok library) for the reader was done in the following files:

  • trino-hive-formats module:
    • GrokDeserializer + GrokDeserializerFactory --> our implementation of the Deserializer
      • Very similar to regex
    • TestGrokFormat --> some additional unit tests + tests against examples found in athena docs (reading line, following format of other native reader tests)
    • pom.xml
  • trino-hive module:
    • HiveModule
    • HiveClassNames
    • HiveMetadata
    • HiveStorageFormat
    • HiveTableProperties
    • GrokFileWriterFactory
    • GrokPageSourceFactory
    • BaseHiveConnectorTest
    • HiveTestUtils
    • TestGrokTable
    • TestHivePageSink
    • pom.xml

Additional context and related issues

Athena supports the GrokSerde and this is a bug-for-bug implementation for what Athena currently has.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Add native Grok file format reader. ({issue}`25205 `)

@cla-bot cla-bot bot added the cla-signed label Mar 3, 2025
@github-actions github-actions bot added the hive Hive connector label Mar 3, 2025
@bangtim bangtim force-pushed the native-grok-reader branch from 242112c to 3917639 Compare March 4, 2025 16:51
@martint
Copy link
Member

martint commented Mar 4, 2025

We need to preserve the copyright notice in those files, but it doesn't need to be laid out verbatim. See how we do it in other places, such as:

// Copyright (C) 2007 The Guava Authors

@bangtim bangtim force-pushed the native-grok-reader branch 3 times, most recently from a36e4bd to 2fdc153 Compare March 4, 2025 22:30
@bangtim bangtim force-pushed the native-grok-reader branch 10 times, most recently from ca31331 to 8c86b48 Compare March 10, 2025 16:08
@bangtim bangtim requested a review from findinpath March 11, 2025 15:00
@bangtim bangtim force-pushed the native-grok-reader branch 2 times, most recently from b3b8609 to ef9f1f4 Compare March 13, 2025 15:34
@bangtim bangtim force-pushed the native-grok-reader branch from ef9f1f4 to 347f52a Compare March 13, 2025 15:50
.ifPresentOrElse(
inputFormat -> {
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.GROK, GROK_INPUT_FORMAT);
// try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this block commented out?

Copy link
Contributor Author

@bangtim bangtim Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the portion above(getRegexPattern), there's a check done to make sure that the regex pattern passed is a valid pattern regex.

At first, I wanted to follow a similar approach and make sure the grok input format passed in is valid regex, but remembered that Pattern.compile() doesn't support when the named capture groups have an underscore

I wasn't sure how to approach this and wanted to bring this to reviewer's attention (I probably should have included that in the PR details) but I think there's two ways to approach:

  1. Remove this all check all together
  2. Add this check back in (should probably also be checking the grok custom pattern too) but implement something similar to what we did in Grok.java where we remove the underscores from the named capture groups before passing it in to Pattern.compile()

Copy link
Contributor

@zhaner08 zhaner08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the other comments, would suggest to go through the copied code and remove any code that is not actually being used

@bangtim bangtim force-pushed the native-grok-reader branch 3 times, most recently from 77f2ce6 to 49ef28b Compare March 19, 2025 18:11
@bangtim bangtim requested review from pettyjamesm and zhaner08 March 19, 2025 18:57
if (BOOLEAN.equals(type)) {
type.writeBoolean(builder, Boolean.parseBoolean(value));
}
else if (BIGINT.equals(type)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other types out there?

Use RegexDeserializer as inspiration.

Add coverage in TestGrokTable for all the types in this method.

Copy link
Contributor Author

@bangtim bangtim Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially had all the types that regex covered, but reduced it down to cover all the ones that we currently support in Athena

cc: @zhaner08 @pettyjamesm should we expand to match the types we have for the RegexDeserializer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will go ahead and tests for other data types in TestGrokTable

@bangtim bangtim force-pushed the native-grok-reader branch 2 times, most recently from 9f1deb7 to ab027a6 Compare March 24, 2025 21:33
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments... I'm not done yet

Comment on lines +116 to +128
Map<String, Object> map = match.toMap();
List<Object> row = new ArrayList<Object>(map.values());

for (int i = 0; i < columns.size(); i++) {
Column column = columns.get(i);
BlockBuilder blockBuilder = builder.getBlockBuilder(i);
String value = i < row.size() ? String.valueOf(row.get(i)) : null;
if (value == null) {
blockBuilder.appendNull();
continue;
}
serializeValue(value, column, blockBuilder, this.grokNullOnParseError);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is going on here. It appears that grok is building a map of name/value pairs. Then the code assumes the values of the map are in specific order that happens to match the order of values in the reader. Then it converts the value to string, so it can be parsed again as the final value type.

This seems like a lot of work when instead we could just read the values into the final expected type directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding there are two types of casting occurring.

The first happens when we specify the data type in the input format. Since we're reading the log lines as strings, the default data type is string. However, we can specify the data type in the input format.

For example, let's say we have a log:
1
with an 'input.format' = "%{NUMBER:num:double}"

We are specifying that we want whatever NUMBER captures to be represented as a type double, so 1 becomes 1.0

The second type casting occurs with the column type. With the example above, let's say the column that will hold the value captured by NUMBER has a type of String. That means the value shown in that column will be "1.0" as a string representation.

The conversion of value to string does seem a bit unnecessary but I wanted to be able to utilize the primitive type wrapper classes like we do in regex deserializer. For instance, if value was "123" and the column type was BIGINT then it would beneficial to make use of the Long.parseLong(value)

Would you instead suggest having value be a type of Object and then casting value to be whatever we specify the column type to be? I believe some additional casting would have to take place to handle the case mentioned above

@bangtim bangtim force-pushed the native-grok-reader branch from ab027a6 to 89d885d Compare March 25, 2025 22:07
@bangtim bangtim force-pushed the native-grok-reader branch from 89d885d to 6a59dea Compare March 25, 2025 22:25
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@bangtim bangtim force-pushed the native-grok-reader branch from 994aa05 to cc1767e Compare April 22, 2025 16:23
@github-actions github-actions bot removed the stale label Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

6 participants