You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a big fan of duckdb and i have been following the recent developments of ducklake and i wanted to see if it can be used for one of my custom projects which i built and would like to get some advice to see if its even possible to integrate with ducklake.
So in my company we have a data platform team to which all systems send the 'http' request-response of each interaction & we stitch them using a simple identifiers similar to google analytics id (google clientid & session id). We usually process this data and store maybe 20% of this data and rest is discarded.
We had a requirement where we had to store the raw request-response so that we can use it further for various futuristic use cases for example, backfilling ETL data, datascience requirements, debugging etc.
Now each of this JSON request-response has a generic contract which stipulates providing some few mandatory attributes they include
event_source: the system which generates the event
event_type : the type of the event
time : the request/response time
country : the country to which the event was generated for
...and few more.
Now the problem with storing them as compressed json-lines text, is that its easy to store but requires a lot of effort & infrastructure to extract information.
Now this data is usually JSON API request-response and hence they already have a structure (API contracts), the only problem is that this structure is very fluid and changing (API versions).
So i devised a solution where i would infer this structure at run time, group the data which match the structure and many times the match could yield evolution in the structure.
So to make the schema evolution accurate i applied partitioning of the data based on the generic metadata which is available within each event : event_source/event_type/country.
I further partitioned this data within it based on hourly partitioned event_time.
The evolution of the schema is very simple algorithm, some of the rules are just basic data type cast rules.1
If two structures are disjoint, then they can be merged into a single structure
If two datatypes match or they could be either up cast (INT -> LONG) or
If the schema of the array is more than 1 type, then the entire array could be stored as JSON BLOB.
.. few more such rules.
Now because its a schema structural evolution, the overall grouping of the schema is you can say the 'equivalent of table' is a what i call a 'bucket'
And within the bucket the schema could evolve and each schema evolution is tracked using a simple timestamp (similar to apache druid schema versions).
So each event is compared against these buckets within their partition grouping to find out which matches them, the first matching one is the chosen one.
The inferred schema which i call HNode (Hierarchical Node) schema is my own custom representation (it usually mimics JSON structure) & we can now convert it to parquet schema, with that parquet schema you can now write data to parquet. Whenever the schema evolves, an equivalent parquet schema is created.
Now the metadata shown above is also consistently stored under a struct within each row, this becomes the 'consistent filter' attributes across all of the event types.
So overall the structure of a data could like this
-event_src_1
{
"A" : STRING,"B" : [
{
"C": INTEGER,"D": STRING,"E": LONG
}
]
}
parquet schema:
message root {
optional binary A (UTF8);
optional group B (LIST) {
optional group list {
optional group element {
optional int32 C;
optional binary D (UTF8);
optional int64 E;
}
}
}
}
So this project evolved from a simple log archival system to something that closely resembles lakehouse, also my idea was to make the raw data more easily accessible to anyone and avoid data swamps (raw data is more valuable than the processed one)
Now this experiment was quite successful as by storing this way and by using GZIP compression, i was able to achieve a compression ratio of 93%, just to give it in comparison to numbers, we ingest daily 2.1TiB of raw json response data and using this method we can store it in ~148GiB in S3 (14.5X reduction).
We also built a simple python lambda function (not the best solution, but it was just a POC) that understand how to filter over the partitions structure in S3 (as shown above) and then just reads each file and filters on the metadata. Using this lambda we can filter & extract like 2 hours worth of data in about 2 to 3 minutes, which is quite amazing.
Now this structure in many ways resembles iceberg, however when i checked with iceberg community they said that it would difficult to integrate it (i was unable to understand their explanation, but they said that the schema evolution would that i have would not work in iceberg)
I wanted to check if this method is possible to be integrated with ducklake, because i was already looking at building a metadata database for this which contains metadata & amazingly ducklake was just released 1 month ago.
So overall
Is it possible to integrate the above system into ducklake ( i am just assuming i can think of bucket_id as table and versions as schema evolution, schema is event_src_event_type etc)
If ducklake can't be used, do you have suggestion or improvements on this system so that i can evolve it to be used for ducklake lakehouse architecture ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi ducklake team,
I am a big fan of duckdb and i have been following the recent developments of ducklake and i wanted to see if it can be used for one of my custom projects which i built and would like to get some advice to see if its even possible to integrate with ducklake.
So in my company we have a data platform team to which all systems send the 'http' request-response of each interaction & we stitch them using a simple identifiers similar to google analytics id (google clientid & session id). We usually process this data and store maybe 20% of this data and rest is discarded.
We had a requirement where we had to store the raw request-response so that we can use it further for various futuristic use cases for example, backfilling ETL data, datascience requirements, debugging etc.
Now each of this JSON request-response has a generic contract which stipulates providing some few mandatory attributes they include
...and few more.
Now the problem with storing them as compressed json-lines text, is that its easy to store but requires a lot of effort & infrastructure to extract information.
Now this data is usually JSON API request-response and hence they already have a structure (API contracts), the only problem is that this structure is very fluid and changing (API versions).
So i devised a solution where i would infer this structure at run time, group the data which match the structure and many times the match could yield evolution in the structure.
So to make the schema evolution accurate i applied partitioning of the data based on the generic metadata which is available within each event : event_source/event_type/country.
I further partitioned this data within it based on hourly partitioned event_time.
The evolution of the schema is very simple algorithm, some of the rules are just basic data type cast rules.1
.. few more such rules.
Now because its a schema structural evolution, the overall grouping of the schema is you can say the 'equivalent of table' is a what i call a 'bucket'
And within the bucket the schema could evolve and each schema evolution is tracked using a simple timestamp (similar to apache druid schema versions).
So each event is compared against these buckets within their partition grouping to find out which matches them, the first matching one is the chosen one.
The inferred schema which i call HNode (Hierarchical Node) schema is my own custom representation (it usually mimics JSON structure) & we can now convert it to parquet schema, with that parquet schema you can now write data to parquet. Whenever the schema evolves, an equivalent parquet schema is created.
Now the metadata shown above is also consistently stored under a struct within each row, this becomes the 'consistent filter' attributes across all of the event types.
So overall the structure of a data could like this
-event_src_1
event
HNODE schema:
parquet schema:
message root {
optional binary A (UTF8);
optional group B (LIST) {
optional group list {
optional group element {
optional int32 C;
optional binary D (UTF8);
optional int64 E;
}
}
}
}
So this project evolved from a simple log archival system to something that closely resembles lakehouse, also my idea was to make the raw data more easily accessible to anyone and avoid data swamps (raw data is more valuable than the processed one)
Now this experiment was quite successful as by storing this way and by using GZIP compression, i was able to achieve a compression ratio of 93%, just to give it in comparison to numbers, we ingest daily 2.1TiB of raw json response data and using this method we can store it in ~148GiB in S3 (14.5X reduction).
We also built a simple python lambda function (not the best solution, but it was just a POC) that understand how to filter over the partitions structure in S3 (as shown above) and then just reads each file and filters on the metadata. Using this lambda we can filter & extract like 2 hours worth of data in about 2 to 3 minutes, which is quite amazing.
Now this structure in many ways resembles iceberg, however when i checked with iceberg community they said that it would difficult to integrate it (i was unable to understand their explanation, but they said that the schema evolution would that i have would not work in iceberg)
I wanted to check if this method is possible to be integrated with ducklake, because i was already looking at building a metadata database for this which contains metadata & amazingly ducklake was just released 1 month ago.
So overall
Regards,
Nitish
Beta Was this translation helpful? Give feedback.
All reactions