Native Grok Reader Implementation #25205

bangtim · 2025-03-03T19:31:12Z

Description

Native reader implementation for Grok format.

This PR is implementing a GrokDeserializer as well as porting over the entire Grok library (Athena depends on release 0.1.4 with some minor bug fixes and changes to support date data type).

The Java Grok library can be found here: https://github.com/thekrakken/java-grok/tree/grok-0.1.4

The library includes an api that allows us to parse logs as well as some basic unit tests

Questions/concerns:

One thing to pay attention to is the LICENSE
The header is different, thus the build fails (with the same header as other files, the build succeeds locally) - How should we make sure the header is properly citing the authors/contributors of the open source grok library? cc: @martint
What should the getHiveSerDeClassNames value be?

The implementation(everything aside from java grok library) for the reader was done in the following files:

trino-hive-formats module:
- GrokDeserializer + GrokDeserializerFactory --> our implementation of the Deserializer
  - Very similar to regex
- TestGrokFormat --> some additional unit tests + tests against examples found in athena docs (reading line, following format of other native reader tests)
- pom.xml
trino-hive module:
- HiveModule
- HiveClassNames
- HiveMetadata
- HiveStorageFormat
- HiveTableProperties
- GrokFileWriterFactory
- GrokPageSourceFactory
- BaseHiveConnectorTest
- HiveTestUtils
- TestGrokTable
- TestHivePageSink
- pom.xml

Additional context and related issues

Athena supports the GrokSerde and this is a bug-for-bug implementation for what Athena currently has.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Add native Grok file format reader. ({issue}`25205 `)

martint · 2025-03-04T19:14:00Z

We need to preserve the copyright notice in those files, but it doesn't need to be laid out verbatim. See how we do it in other places, such as:

trino/core/trino-spi/src/main/java/io/trino/spi/predicate/Primitives.java

Line 22 in daa0a44

lib/trino-hive-formats/src/test/java/io/trino/hive/formats/line/grok/TestApache.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveModule.java

pettyjamesm · 2025-03-13T18:35:29Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+                .ifPresentOrElse(
+                        inputFormat -> {
+                            checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.GROK, GROK_INPUT_FORMAT);
+//                            try {


Why is this block commented out?

in the portion above(getRegexPattern), there's a check done to make sure that the regex pattern passed is a valid pattern regex.

At first, I wanted to follow a similar approach and make sure the grok input format passed in is valid regex, but remembered that Pattern.compile() doesn't support when the named capture groups have an underscore

I wasn't sure how to approach this and wanted to bring this to reviewer's attention (I probably should have included that in the PR details) but I think there's two ways to approach:

Remove this all check all together

Add this check back in (should probably also be checking the grok custom pattern too) but implement something similar to what we did in Grok.java where we remove the underscores from the named capture groups before passing it in to Pattern.compile()

lib/trino-hive-formats/pom.xml

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/HiveClassNames.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java

zhaner08

Apart from the other comments, would suggest to go through the copied code and remove any code that is not actually being used

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/KeyValue.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Pile.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/exception/GrokError.java

lib/trino-hive-formats/src/main/resources/grok/access_log

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Match.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestGrokTable.java

findinpath · 2025-03-20T20:12:10Z

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java

+            if (BOOLEAN.equals(type)) {
+                type.writeBoolean(builder, Boolean.parseBoolean(value));
+            }
+            else if (BIGINT.equals(type)) {


What about the other types out there?

Use RegexDeserializer as inspiration.

Add coverage in TestGrokTable for all the types in this method.

Initially had all the types that regex covered, but reduced it down to cover all the ones that we currently support in Athena

cc: @zhaner08 @pettyjamesm should we expand to match the types we have for the RegexDeserializer?

will go ahead and tests for other data types in TestGrokTable

dain

some comments... I'm not done yet

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/BooleanConverter.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/IConverter.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java

dain · 2025-03-24T21:59:58Z

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java

+        Map<String, Object> map = match.toMap();
+        List<Object> row = new ArrayList<Object>(map.values());
+
+        for (int i = 0; i < columns.size(); i++) {
+            Column column = columns.get(i);
+            BlockBuilder blockBuilder = builder.getBlockBuilder(i);
+            String value = i < row.size() ? String.valueOf(row.get(i)) : null;
+            if (value == null) {
+                blockBuilder.appendNull();
+                continue;
+            }
+            serializeValue(value, column, blockBuilder, this.grokNullOnParseError);
+        }


I don't understand what is going on here. It appears that grok is building a map of name/value pairs. Then the code assumes the values of the map are in specific order that happens to match the order of values in the reader. Then it converts the value to string, so it can be parsed again as the final value type.

This seems like a lot of work when instead we could just read the values into the final expected type directly.

From my understanding there are two types of casting occurring.

The first happens when we specify the data type in the input format. Since we're reading the log lines as strings, the default data type is string. However, we can specify the data type in the input format.

For example, let's say we have a log:
1
with an 'input.format' = "%{NUMBER:num:double}"

We are specifying that we want whatever NUMBER captures to be represented as a type double, so 1 becomes 1.0

The second type casting occurs with the column type. With the example above, let's say the column that will hold the value captured by NUMBER has a type of String. That means the value shown in that column will be "1.0" as a string representation.

The conversion of value to string does seem a bit unnecessary but I wanted to be able to utilize the primitive type wrapper classes like we do in regex deserializer. For instance, if value was "123" and the column type was BIGINT then it would beneficial to make use of the Long.parseLong(value)

Would you instead suggest having value be a type of Object and then casting value to be whatever we specify the column type to be? I believe some additional casting would have to take place to handle the case mentioned above

github-actions · 2025-04-16T17:03:28Z

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

cla-bot bot added the cla-signed label Mar 3, 2025

github-actions bot added the hive Hive connector label Mar 3, 2025

zhaner08 requested review from dain, pettyjamesm and zhaner08 March 3, 2025 19:40

bangtim force-pushed the native-grok-reader branch from 242112c to 3917639 Compare March 4, 2025 16:51

bangtim force-pushed the native-grok-reader branch 3 times, most recently from a36e4bd to 2fdc153 Compare March 4, 2025 22:30

findinpath reviewed Mar 5, 2025

View reviewed changes

lib/trino-hive-formats/src/test/java/io/trino/hive/formats/line/grok/TestApache.java Outdated Show resolved Hide resolved

findinpath reviewed Mar 5, 2025

View reviewed changes

lib/trino-hive-formats/src/test/java/io/trino/hive/formats/line/grok/TestApache.java Outdated Show resolved Hide resolved

findinpath reviewed Mar 5, 2025

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveModule.java Show resolved Hide resolved

bangtim force-pushed the native-grok-reader branch 10 times, most recently from ca31331 to 8c86b48 Compare March 10, 2025 16:08

bangtim requested a review from findinpath March 11, 2025 15:00

bangtim force-pushed the native-grok-reader branch 2 times, most recently from b3b8609 to ef9f1f4 Compare March 13, 2025 15:34

bangtim added 3 commits March 13, 2025 11:47

Add Native Grok Reader implementation

2482164

Add backward compat for named group w/ underscores

3245f71

Add Grok hive storage format and table properties

347f52a

bangtim force-pushed the native-grok-reader branch from ef9f1f4 to 347f52a Compare March 13, 2025 15:50

pettyjamesm reviewed Mar 13, 2025

View reviewed changes

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java Show resolved Hide resolved

pettyjamesm reviewed Mar 13, 2025

View reviewed changes

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java Outdated Show resolved Hide resolved

zhaner08 reviewed Mar 17, 2025

View reviewed changes

bangtim force-pushed the native-grok-reader branch 3 times, most recently from 77f2ce6 to 49ef28b Compare March 19, 2025 18:11

bangtim requested review from pettyjamesm and zhaner08 March 19, 2025 18:57

pettyjamesm reviewed Mar 20, 2025

View reviewed changes

findinpath reviewed Mar 20, 2025

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestGrokTable.java Outdated Show resolved Hide resolved

findinpath reviewed Mar 20, 2025

View reviewed changes

bangtim force-pushed the native-grok-reader branch 2 times, most recently from 9f1deb7 to ab027a6 Compare March 24, 2025 21:33

dain reviewed Mar 24, 2025

View reviewed changes

bangtim force-pushed the native-grok-reader branch from ab027a6 to 89d885d Compare March 25, 2025 22:07

Add configurable serde properties and refactor/cleanup

6a59dea

bangtim force-pushed the native-grok-reader branch from 89d885d to 6a59dea Compare March 25, 2025 22:25

github-actions bot added the stale label Apr 16, 2025

bangtim requested review from findinpath, pettyjamesm and dain April 22, 2025 15:41

Add support for additional datatypes

cc1767e

bangtim force-pushed the native-grok-reader branch from 994aa05 to cc1767e Compare April 22, 2025 16:23

github-actions bot removed the stale label Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Grok Reader Implementation #25205

Native Grok Reader Implementation #25205

bangtim commented Mar 3, 2025 •

edited

Loading

martint commented Mar 4, 2025

pettyjamesm Mar 13, 2025

bangtim Mar 13, 2025 •

edited

Loading

zhaner08 left a comment

findinpath Mar 20, 2025

bangtim Mar 21, 2025 •

edited

Loading

bangtim Mar 21, 2025

dain left a comment

dain Mar 24, 2025

bangtim Mar 25, 2025

github-actions bot commented Apr 16, 2025

Native Grok Reader Implementation #25205

Are you sure you want to change the base?

Native Grok Reader Implementation #25205

Conversation

bangtim commented Mar 3, 2025 • edited Loading

Description

Additional context and related issues

Release notes

martint commented Mar 4, 2025

pettyjamesm Mar 13, 2025

Choose a reason for hiding this comment

bangtim Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

zhaner08 left a comment

Choose a reason for hiding this comment

findinpath Mar 20, 2025

Choose a reason for hiding this comment

bangtim Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

bangtim Mar 21, 2025

Choose a reason for hiding this comment

dain left a comment

Choose a reason for hiding this comment

dain Mar 24, 2025

Choose a reason for hiding this comment

bangtim Mar 25, 2025

Choose a reason for hiding this comment

github-actions bot commented Apr 16, 2025

bangtim commented Mar 3, 2025 •

edited

Loading

bangtim Mar 13, 2025 •

edited

Loading

bangtim Mar 21, 2025 •

edited

Loading