WIP: Scribe data persistence to gcs #15

gujjariramya · 2020-12-22T22:53:53Z

In order to persist scribe 2.0 data to GCP, we are using Brooklin connectors. We will create brooklin connectors to consume from event kafka topics and process it by converting avro record to parquet format in brooklin. And store the final parquet record in gcs bucket.

RFC Doc
Puppet Change for Brooklin config: https://github.csnzoo.com/secure/puppet-cloud/pull/2352/files

Testing

Tested the code changes by coping the tar file to brooklin node and then created datastreams for kafka topics and verified data is populating in gcs buckets.
On the scribe side validated offline data counts with the datastream and there is less than <1% deviation.

Staging
This code is already being used to store pilot data in GCS buckets

Server properties

##### scribeAvroparquetfile connector config
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.factoryClassName=com.linkedin.datastream.cloud.storage.CloudStorageTransportProviderAdminFactory
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.packageQueueSize=1000
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.objectBuilderThreadCount=3
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.maxFileSize=120108864
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.maxFileAge=900000
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.inflightCommits=2
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.committer.class=com.linkedin.datastream.cloud.storage.committer.GCSObjectCommitter
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.committer.threads=3
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.committer.credentialsPath=/wayfair/app/brooklin/config/brooklingcp.json
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.committer.scribeParquetFileStructure=true
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.committer.writeAtOnceMaxFileSize=1048576
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.io.class=com.linkedin.datastream.cloud.storage.io.ScribeAvroParquetEventFile
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.io.directory=/wayfair/data/brooklincloudstorage
brooklin.server.transportProvider.GCSTransportProviderScribeAvroParquetEventFile.io.schemaRegistryURL=http://kube-kafka-schema-c1.service.intrabo1.consul.csnzoo.com:80

Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.

Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md

…ct conversion and flattening nested object

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java

ckommini · 2021-01-15T20:53:49Z

Changes look good. Please clean up comments and any test code before merge.

…attern

datastream-cloud-storage/src/main/java/com/linkedin/datastream/cloud/storage/ObjectBuilder.java

...torage/src/main/java/com/linkedin/datastream/cloud/storage/committer/GCSObjectCommitter.java

ckommini · 2021-02-19T16:04:56Z

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java

+import org.slf4j.LoggerFactory;
+
+/**
+ * Implementation of {@link File} to support Parquet file format


Let's expand this description and add details around what this connector does at a high-level.

Also, change all references to Scribe in the comments to call out Scribe 2.0.

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java

ckommini · 2021-02-19T17:01:56Z

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java

+    DateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS ZZ");
+    String date = format.format(new Date(value));


You should set it to EST explicitly since GCP is UTC by default.

As discussed offline, this needed further investigation. Here is the ticket for it

…ll unit test inputs

ckommini · 2021-02-23T21:27:45Z

datastream-cloud-storage/src/main/java/com/linkedin/datastream/cloud/storage/ObjectBuilder.java

+            } catch (Exception e) {
+                LOG.error("Unable to write to WriteLog {}", e);
+                aPackage.getAckCallback().onCompletion(new DatastreamRecordMetadata(
+                          aPackage.getCheckpoint(), aPackage.getTopic(), aPackage.getPartition()), e);


I'd let Santosh confirm if it's safe to do this here.

Please don't catch Exception here. Instead identify the specific exceptions can be raised and handle them.

Make sure you have full coverage of exception handling. Unhandled exception will result in dead object builder thread that could be serving other streams.

sdomalap · 2021-03-05T13:29:52Z

Did you run style checks and bug checks by running ./gradlew clean build?
As per https://github.com/linkedin/Brooklin/wiki/Developer-Guide

sdomalap · 2021-03-05T13:20:29Z

...torage/src/main/java/com/linkedin/datastream/cloud/storage/committer/GCSObjectCommitter.java

+
+      // scribe parquet file structure: events/event_name/eventdate=2020-12-21/scribeKafkatopic+partition+startOffset+endOffset+suffix.parquet
+      // Eg: events/healthcheck_evaluated/eventdate=2021-02-22/scribe_internal-healthcheck_evaluated+0+187535+187631+1613970121085.parquet
+      if (isScribeParquetFileStructure) {


Users may have different object name requirements. We should work on making this a dynamic datastream config parameter.

sdomalap · 2021-03-05T13:26:28Z

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java

+   * @param schema parquet compatible avro schema
+   * @param avroRecord the incoming record
+   * @return GenericRecord the record converted to match the new schema
+   * @throws Exception


Should the method be declared as throws Exception? Also, why generic Exception object?

sdomalap · 2021-03-05T13:33:31Z

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java

+          }
+        }
+      } catch (Exception e) {
+        LOG.error(String.format("Exception in getting avro field schema types in ScribeParquetAvroConverter: Schema: %s, field: %s, typeName: %s, exception: %s", schema.getName(), fieldName, e));


You are logging an error. Should you continue?

If you can. Log this message as warn.

sdomalap · 2021-03-05T13:37:34Z

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java

+            }
+          }
+        }
+      } catch (NullPointerException e){


I would never catch NullPointerException. Please identify the case and handle it properly. This may mask other issues in your code.

Also, if you are experiencing NullPointerException. It's most likely un recoverable error. You need to understand the implication of it, is it data loss, or something else you should be concerned about?

Ramya Sri Gujjari added 10 commits December 22, 2020 17:48

Persist scribe data to gcs

247df58

Revert to previous version

bba3da6

Update the file structure

4c89253

Add unit tests for nested obj

3dd6813

United tested with list of Nested object

0fa4f9e

Multi level nested object support and unit test for it

3594dda

Handle not nullable/required fields for schema conversion,nested obje…

78d8ec2

…ct conversion and flattening nested object

Code refactoring

737be4e

Convert avro nested object data to parquet format and code clean up

f900771

Added java docs and code cleanup

7150d4e

gujjariramya commented Jan 14, 2021

View reviewed changes

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeAvroParquetEventFile.java Outdated Show resolved Hide resolved

gujjariramya commented Jan 14, 2021

View reviewed changes

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java Outdated Show resolved Hide resolved

gujjariramya commented Jan 14, 2021

View reviewed changes

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java Outdated Show resolved Hide resolved

gujjariramya commented Jan 14, 2021

View reviewed changes

...orage/src/main/java/com/linkedin/datastream/cloud/storage/io/ScribeParquetAvroConverter.java Outdated Show resolved Hide resolved

gujjariramya changed the title ~~WIP Scribe data persistence to gcs~~ Scribe data persistence to gcs Jan 14, 2021

Update teh kafka topic pattern for 2.0 events

f404bd9

Ramya Sri Gujjari added 8 commits January 15, 2021 16:20

Cleaning up comments

ddc5708

Split scribe header parquet data conversion and unit test pilot event

c0273a7

Add logging and tested code with 1.0 events.Updated the kafka topic p…

57baecc

…attern

Fix logging

f059db9

Raw working code including info logs and commented code

45eae9c

Convert fieldNAmes to lowercase in parquet record

086b3ff

Cleanup comments

a46834f

Clean up code comments

0f19795

ckommini suggested changes Feb 19, 2021

View reviewed changes

Address PR comments and add logical type to eventtimestamp field in a…

63716d5

…ll unit test inputs

ckommini approved these changes Feb 23, 2021

View reviewed changes

sdomalap reviewed Mar 5, 2021

View reviewed changes

Ramya Sri Gujjari added 4 commits March 30, 2021 11:00

Check git commit

56def2e

Build passed and lint and format

48c780b

Support date and decimal type in parquet schema

ec06bc7

Working code for decimal type, nullable field type

5c9168c

gujjariramya changed the title ~~Scribe data persistence to gcs~~ WIP: Scribe data persistence to gcs Apr 2, 2021

		DateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS ZZ");
		String date = format.format(new Date(value));

WIP: Scribe data persistence to gcs #15

Are you sure you want to change the base?

WIP: Scribe data persistence to gcs #15

Uh oh!

Conversation

gujjariramya commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ckommini commented Jan 15, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ckommini Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdomalap commented Mar 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gujjariramya commented Dec 22, 2020 •

edited

Loading

ckommini Feb 23, 2021 •

edited

Loading