-
Notifications
You must be signed in to change notification settings - Fork 0
WIP: Scribe data persistence to gcs #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: wayfair_release_1_0_2-18
Are you sure you want to change the base?
WIP: Scribe data persistence to gcs #15
Conversation
…ct conversion and flattening nested object
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
|
Changes look good. Please clean up comments and any test code before merge. |
datastream-cloud-storage/src/main/java/com/linkedin/datastream/cloud/storage/ObjectBuilder.java
Outdated
Show resolved
Hide resolved
...torage/src/main/java/com/linkedin/datastream/cloud/storage/committer/GCSObjectCommitter.java
Outdated
Show resolved
Hide resolved
| import org.slf4j.LoggerFactory; | ||
|
|
||
| /** | ||
| * Implementation of {@link File} to support Parquet file format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand this description and add details around what this connector does at a high-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, change all references to Scribe in the comments to call out Scribe 2.0.
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java
Outdated
Show resolved
Hide resolved
| DateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS ZZ"); | ||
| String date = format.format(new Date(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should set it to EST explicitly since GCP is UTC by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, this needed further investigation. Here is the ticket for it
…ll unit test inputs
| } catch (Exception e) { | ||
| LOG.error("Unable to write to WriteLog {}", e); | ||
| aPackage.getAckCallback().onCompletion(new DatastreamRecordMetadata( | ||
| aPackage.getCheckpoint(), aPackage.getTopic(), aPackage.getPartition()), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd let Santosh confirm if it's safe to do this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't catch Exception here. Instead identify the specific exceptions can be raised and handle them.
Make sure you have full coverage of exception handling. Unhandled exception will result in dead object builder thread that could be serving other streams.
|
Did you run style checks and bug checks by running |
|
|
||
| // scribe parquet file structure: events/event_name/eventdate=2020-12-21/scribeKafkatopic+partition+startOffset+endOffset+suffix.parquet | ||
| // Eg: events/healthcheck_evaluated/eventdate=2021-02-22/scribe_internal-healthcheck_evaluated+0+187535+187631+1613970121085.parquet | ||
| if (isScribeParquetFileStructure) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may have different object name requirements. We should work on making this a dynamic datastream config parameter.
| * @param schema parquet compatible avro schema | ||
| * @param avroRecord the incoming record | ||
| * @return GenericRecord the record converted to match the new schema | ||
| * @throws Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the method be declared as throws Exception? Also, why generic Exception object?
| } | ||
| } | ||
| } catch (Exception e) { | ||
| LOG.error(String.format("Exception in getting avro field schema types in ScribeParquetAvroConverter: Schema: %s, field: %s, typeName: %s, exception: %s", schema.getName(), fieldName, e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are logging an error. Should you continue?
If you can. Log this message as warn.
| } | ||
| } | ||
| } | ||
| } catch (NullPointerException e){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would never catch NullPointerException. Please identify the case and handle it properly. This may mask other issues in your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you are experiencing NullPointerException. It's most likely un recoverable error. You need to understand the implication of it, is it data loss, or something else you should be concerned about?
In order to persist scribe 2.0 data to GCP, we are using Brooklin connectors. We will create brooklin connectors to consume from event kafka topics and process it by converting avro record to parquet format in brooklin. And store the final parquet record in gcs bucket.
RFC Doc
Puppet Change for Brooklin config: https://github.csnzoo.com/secure/puppet-cloud/pull/2352/files
Testing
Staging
This code is already being used to store pilot data in GCS buckets
Server properties
Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.
Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md