-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Native Grok Reader Implementation #25205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
242112c
to
3917639
Compare
We need to preserve the copyright notice in those files, but it doesn't need to be laid out verbatim. See how we do it in other places, such as:
|
a36e4bd
to
2fdc153
Compare
lib/trino-hive-formats/src/test/java/io/trino/hive/formats/line/grok/TestApache.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/test/java/io/trino/hive/formats/line/grok/TestApache.java
Outdated
Show resolved
Hide resolved
ca31331
to
8c86b48
Compare
b3b8609
to
ef9f1f4
Compare
ef9f1f4
to
347f52a
Compare
.ifPresentOrElse( | ||
inputFormat -> { | ||
checkFormatForProperty(hiveStorageFormat, HiveStorageFormat.GROK, GROK_INPUT_FORMAT); | ||
// try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this block commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the portion above(getRegexPattern), there's a check done to make sure that the regex pattern passed is a valid pattern regex.
At first, I wanted to follow a similar approach and make sure the grok input format passed in is valid regex, but remembered that Pattern.compile() doesn't support when the named capture groups have an underscore
I wasn't sure how to approach this and wanted to bring this to reviewer's attention (I probably should have included that in the PR details) but I think there's two ways to approach:
- Remove this all check all together
- Add this check back in (should probably also be checking the grok custom pattern too) but implement something similar to what we did in Grok.java where we remove the underscores from the named capture groups before passing it in to Pattern.compile()
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/HiveClassNames.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the other comments, would suggest to go through the copied code and remove any code that is not actually being used
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/KeyValue.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Pile.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/exception/GrokError.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java
Show resolved
Hide resolved
77f2ce6
to
49ef28b
Compare
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Grok.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/GrokDeserializer.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Match.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Match.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Match.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Match.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestGrokTable.java
Outdated
Show resolved
Hide resolved
if (BOOLEAN.equals(type)) { | ||
type.writeBoolean(builder, Boolean.parseBoolean(value)); | ||
} | ||
else if (BIGINT.equals(type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other types out there?
Use RegexDeserializer
as inspiration.
Add coverage in TestGrokTable
for all the types in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially had all the types that regex covered, but reduced it down to cover all the ones that we currently support in Athena
cc: @zhaner08 @pettyjamesm should we expand to match the types we have for the RegexDeserializer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will go ahead and tests for other data types in TestGrokTable
9f1deb7
to
ab027a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments... I'm not done yet
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Converter.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/BooleanConverter.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/IConverter.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java
Outdated
Show resolved
Hide resolved
lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/grok/Discovery.java
Outdated
Show resolved
Hide resolved
Map<String, Object> map = match.toMap(); | ||
List<Object> row = new ArrayList<Object>(map.values()); | ||
|
||
for (int i = 0; i < columns.size(); i++) { | ||
Column column = columns.get(i); | ||
BlockBuilder blockBuilder = builder.getBlockBuilder(i); | ||
String value = i < row.size() ? String.valueOf(row.get(i)) : null; | ||
if (value == null) { | ||
blockBuilder.appendNull(); | ||
continue; | ||
} | ||
serializeValue(value, column, blockBuilder, this.grokNullOnParseError); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what is going on here. It appears that grok is building a map of name/value pairs. Then the code assumes the values of the map are in specific order that happens to match the order of values in the reader. Then it converts the value to string, so it can be parsed again as the final value type.
This seems like a lot of work when instead we could just read the values into the final expected type directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding there are two types of casting occurring.
The first happens when we specify the data type in the input format. Since we're reading the log lines as strings, the default data type is string. However, we can specify the data type in the input format.
For example, let's say we have a log:
1
with an 'input.format' = "%{NUMBER:num:double}"
We are specifying that we want whatever NUMBER
captures to be represented as a type double, so 1
becomes 1.0
The second type casting occurs with the column type. With the example above, let's say the column that will hold the value captured by NUMBER
has a type of String
. That means the value shown in that column will be "1.0"
as a string representation.
The conversion of value to string does seem a bit unnecessary but I wanted to be able to utilize the primitive type wrapper classes like we do in regex deserializer. For instance, if value
was "123"
and the column type was BIGINT
then it would beneficial to make use of the Long.parseLong(value)
Would you instead suggest having value
be a type of Object
and then casting value to be whatever we specify the column type to be? I believe some additional casting would have to take place to handle the case mentioned above
ab027a6
to
89d885d
Compare
89d885d
to
6a59dea
Compare
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
994aa05
to
cc1767e
Compare
Description
Native reader implementation for Grok format.
This PR is implementing a
GrokDeserializer
as well as porting over the entire Grok library (Athena depends on release 0.1.4 with some minor bug fixes and changes to support date data type).The Java Grok library can be found here: https://github.com/thekrakken/java-grok/tree/grok-0.1.4
Questions/concerns:
getHiveSerDeClassNames
value be?The implementation(everything aside from java grok library) for the reader was done in the following files:
trino-hive-formats
module:GrokDeserializer
+GrokDeserializerFactory
--> our implementation of the DeserializerTestGrokFormat
--> some additional unit tests + tests against examples found in athena docs (reading line, following format of other native reader tests)pom.xml
trino-hive
module:HiveModule
HiveClassNames
HiveMetadata
HiveStorageFormat
HiveTableProperties
GrokFileWriterFactory
GrokPageSourceFactory
BaseHiveConnectorTest
HiveTestUtils
TestGrokTable
TestHivePageSink
pom.xml
Additional context and related issues
Athena supports the GrokSerde and this is a bug-for-bug implementation for what Athena currently has.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: