Skip to content

Commit 4690761

Browse files
Merge pull request #229 from databrickslabs/feature/v0.0.10
- Added support for CDC Multiple sequence cols PR - Added custom function support for kafka and delta tables PR - Update project overview and features tables in docs + readme - Updated release note and change logs
2 parents aa83f2b + 2aa7d94 commit 4690761

File tree

8 files changed

+73
-14
lines changed

8 files changed

+73
-14
lines changed

.github/workflows/onpush.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,15 +61,18 @@ jobs:
6161
- name: Run Unit Tests
6262
run: python -m coverage run -m pytest tests/ -v
6363

64+
- name: Generate coverage XML
65+
run: python -m coverage xml -o coverage.xml
66+
6467
- name: Publish test coverage
6568
if: startsWith(matrix.os,'ubuntu')
6669
uses: codecov/codecov-action@v3
6770
with:
6871
token: ${{ secrets.CODECOV_TOKEN }}
72+
files: ./coverage.xml
6973
env_vars: OS,PYTHON
7074
fail_ci_if_error: true
7175
flags: unittests
7276
name: codecov-umbrella
73-
path_to_write_report: ./coverage/codecov_report.txt
7477
verbose: true
7578

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
- Fixed issue Silver Data Quality not working [PR](https://github.com/databrickslabs/dlt-meta/issues/156)
1919
- Fixed issue Removed DPM flag check inside dataflowpipeline [PR](https://github.com/databrickslabs/dlt-meta/issues/177)
2020
- Fixed issue Updated dlt-meta demos into Delta Live Tables Notebook github [PR](https://github.com/databrickslabs/dlt-meta/issues/158)
21+
- Fixed issue Adding multiple col support for auto_cdc api [PR](https://github.com/databrickslabs/dlt-meta/pull/224)
22+
- Fixed issue Added support for custom transformations for Kafka/Delta [PR](https://github.com/databrickslabs/dlt-meta/pull/228)
2123

2224

2325
## [v.0.0.9]

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
In practice, a single generic pipeline reads the Dataflowspec and uses it to orchestrate and run the necessary data processing workloads. This approach streamlines the development and management of data pipelines, allowing for a more efficient and scalable data processing workflow
2323

24+
[Lakeflow Declarative Pipelines](https://www.databricks.com/product/data-engineering/lakeflow-declarative-pipelines) and `DLT-META` are designed to complement each other. [Lakeflow Declarative Pipelines](https://www.databricks.com/product/data-engineering/lakeflow-declarative-pipelines) provide a declarative, intent-driven foundation for building and managing data workflows, while DLT-META adds a powerful configuration-driven layer that automates and scales pipeline creation. By combining these approaches, teams can move beyond manual coding to achieve true enterprise-level agility, governance, and efficiency, templatizing and automating pipelines for any scale of modern data-driven business
25+
2426
### Components:
2527

2628
#### Metadata Interface
@@ -45,7 +47,7 @@ In practice, a single generic pipeline reads the Dataflowspec and uses it to orc
4547

4648
![DLT-META Stages](./docs/static/images/dlt-meta_stages.png)
4749

48-
## DLT-META Lakeflow Declarative Pipeline Features support
50+
## DLT-META `Lakeflow Declarative Pipelines` Features support
4951
| Features | DLT-META Support |
5052
| ------------- | ------------- |
5153
| Input data sources | Autoloader, Delta, Eventhub, Kafka, snapshot |

docs/content/_index.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@ draft: false
77

88

99
## Project Overview
10-
DLT-META is a metadata-driven framework designed to work with Databricks Lakeflow Declarative Pipelines . This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in an onboarding JSON file. This file, known as the Dataflowspec, serves as the data flow specification, detailing the source and target metadata required for the pipelines.
10+
`DLT-META` is a metadata-driven framework designed to work with [Lakeflow Declarative Pipelines](https://www.databricks.com/product/data-engineering/lakeflow-declarative-pipelines). This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in an onboarding JSON file. This file, known as the Dataflowspec, serves as the data flow specification, detailing the source and target metadata required for the pipelines.
1111

1212
In practice, a single generic pipeline reads the Dataflowspec and uses it to orchestrate and run the necessary data processing workloads. This approach streamlines the development and management of data pipelines, allowing for a more efficient and scalable data processing workflow
1313

14+
[Lakeflow Declarative Pipelines](https://www.databricks.com/product/data-engineering/lakeflow-declarative-pipelines) and `DLT-META` are designed to complement each other. [Lakeflow Declarative Pipelines](https://www.databricks.com/product/data-engineering/lakeflow-declarative-pipelines) provide a declarative, intent-driven foundation for building and managing data workflows, while DLT-META adds a powerful configuration-driven layer that automates and scales pipeline creation. By combining these approaches, teams can move beyond manual coding to achieve true enterprise-level agility, governance, and efficiency, templatizing and automating pipelines for any scale of modern data-driven business
15+
16+
17+
1418
### DLT-META components:
1519

1620
#### Metadata Interface
@@ -40,7 +44,7 @@ In practice, a single generic pipeline reads the Dataflowspec and uses it to orc
4044
- Option#1: [DLT-META CLI](https://databrickslabs.github.io/dlt-meta/getting_started/dltmeta_cli/#dataflow-dlt-pipeline)
4145
- Option#2: [DLT-META MANUAL](https://databrickslabs.github.io/dlt-meta/getting_started/dltmeta_manual/#dataflow-dlt-pipeline)
4246

43-
## DLT-META DLT Features support
47+
## DLT-META `Lakeflow Declarative Pipelines` Features support
4448
| Features | DLT-META Support |
4549
| ------------- | ------------- |
4650
| Input data sources | Autoloader, Delta, Eventhub, Kafka, snapshot |
@@ -50,11 +54,14 @@ In practice, a single generic pipeline reads the Dataflowspec and uses it to orc
5054
| Quarantine table support | Bronze layer |
5155
| [create_auto_cdc_flow](https://docs.databricks.com/aws/en/dlt-ref/dlt-python-ref-apply-changes) API support | Bronze, Silver layer |
5256
| [create_auto_cdc_from_snapshot_flow](https://docs.databricks.com/aws/en/dlt-ref/dlt-python-ref-apply-changes-from-snapshot) API support | Bronze layer|
53-
| [append_flow](https://docs.databricks.com/aws/en/dlt-ref/dlt-python-ref-append-flow) API support | Bronze layer|
54-
| Liquid cluster support | Bronze, Bronze Quarantine, Silver, Silver Quarantine tables|
57+
| [append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#use-append-flow-to-write-to-a-streaming-table-from-multiple-source-streams) API support | Bronze layer|
58+
| Liquid cluster support | Bronze, Bronze Quarantine, Silver tables|
5559
| [DLT-META CLI](https://databrickslabs.github.io/dlt-meta/getting_started/dltmeta_cli/) | ```databricks labs dlt-meta onboard```, ```databricks labs dlt-meta deploy``` |
5660
| Bronze and Silver pipeline chaining | Deploy dlt-meta pipeline with ```layer=bronze_silver``` option using default publishing mode |
57-
| [DLT Sinks](https://docs.databricks.com/aws/en/dlt/dlt-sinks) | Supported formats:external ```delta table```, ```kafka```.Bronze, Silver layers|
61+
| [create_sink](https://docs.databricks.com/aws/en/dlt-ref/dlt-python-ref-sink) API support |Supported formats:```external delta table , kafka``` Bronze, Silver layers|
62+
| [Databricks Asset Bundles](https://docs.databricks.com/aws/en/dev-tools/bundles/) | Supported
63+
| [DLT-META UI](https://github.com/databrickslabs/dlt-meta/tree/main/lakehouse_app#dlt-meta-lakehouse-app-setup) | Uses Databricks Lakehouse DLT-META App
64+
5865
## How much does it cost ?
5966
DLT-META does not have any **direct cost** associated with it other than the cost to run the Databricks Lakeflow Declarative Pipelines
6067
on your environment.The overall cost will be determined primarily by the [Databricks Lakeflow Declarative Pipelines Pricing] (https://www.databricks.com/product/pricing/lakeflow-declarative-pipelines)

docs/content/faq/execution.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ When you launch Lakeflow Declarative Pipeline it will read silver onboarding and
114114
"keys":[
115115
"customer_id"
116116
],
117-
"sequence_by":"dmsTimestamp",
117+
"sequence_by":"dmsTimestamp,enqueueTimestamp,sequenceId",
118118
"scd_type":"2",
119119
"apply_as_deletes":"Op = 'D'",
120120
"except_column_list":[
@@ -180,3 +180,13 @@ DLT-META have tag [source_metadata](https://github.com/databrickslabs/dlt-meta/b
180180
- `autoloader_metadata_col_name` if this provided then will be used to rename _metadata to this value otherwise default is `source_metadata`
181181
- `select_metadata_cols:{key:value}` will be used to extract columns from _metadata. key is target dataframe column name and value is expression used to add column from _metadata column
182182

183+
**Q. After upgrading dlt-meta, why do Lakeflow Declarative Pipeline fail with the message “Materializing tables in custom schemas is not supported,” and how can this be fixed?**
184+
185+
This failure happens because the pipeline was created using Legacy Publishing mode, which does not support saving tables with catalog or schema qualifiers (such as catalog.schema.table). As a result, using qualified table names leads to an error:
186+
187+
``
188+
com.databricks.pipelines.common.errors.DLTAnalysisException: Materializing tables in custom schemas is not supported. Please remove the database qualifier from table 'catalog_name.schema_name.table_name'
189+
``
190+
191+
To resolve this, migrate the pipeline to the default (Databricks Publishing Mode) by following Databricks’ guide: [Migrate to the default publishing mode](https://docs.databricks.com/aws/en/dlt/migrate-to-dpm#migrate-to-the-default-publishing-mode).
192+

docs/content/releases/_index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ draft: false
2222
- Fixed issue Silver Data Quality not working [PR](https://github.com/databrickslabs/dlt-meta/issues/156)
2323
- Fixed issue Removed DPM flag check inside dataflowpipeline [PR](https://github.com/databrickslabs/dlt-meta/issues/177)
2424
- Fixed issue Updated dlt-meta demos into Delta Live Tables Notebook github [PR](https://github.com/databrickslabs/dlt-meta/issues/158)
25+
- Fixed issue Adding multiple col support for auto_cdc api [PR](https://github.com/databrickslabs/dlt-meta/pull/224)
26+
- Fixed issue Added support for custom transformations for Kafka/Delta [PR](https://github.com/databrickslabs/dlt-meta/pull/228)
2527

2628
# v0.0.9
2729
## Enhancements
@@ -44,6 +46,7 @@ draft: false
4446
- Fixed issue DLT-META CLI should use pypi lib instead of whl : [PR](https://github.com/databrickslabs/dlt-meta/pull/79)
4547
- Fixed issue Onboarding with multiple partition columns errors out: [PR](https://github.com/databrickslabs/dlt-meta/pull/134)
4648

49+
4750
# v0.0.8
4851
## Enhancements
4952
- Added dlt append_flow api support: [PR](https://github.com/databrickslabs/dlt-meta/pull/58)

src/dataflow_pipeline.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import ast
66
import dlt
77
from pyspark.sql import DataFrame
8-
from pyspark.sql.functions import expr
8+
from pyspark.sql.functions import expr, struct
99
from pyspark.sql.types import StructType, StructField
1010
from src.dataflow_spec import BronzeDataflowSpec, SilverDataflowSpec, DataflowSpecUtils
1111
from src.pipeline_writers import AppendFlowWriter, DLTSinkWriter
@@ -315,9 +315,9 @@ def read_bronze(self) -> DataFrame:
315315
if bronze_dataflow_spec.sourceFormat == "cloudFiles":
316316
input_df = pipeline_reader.read_dlt_cloud_files()
317317
elif bronze_dataflow_spec.sourceFormat == "delta" or bronze_dataflow_spec.sourceFormat == "snapshot":
318-
return pipeline_reader.read_dlt_delta()
318+
input_df = pipeline_reader.read_dlt_delta()
319319
elif bronze_dataflow_spec.sourceFormat == "eventhub" or bronze_dataflow_spec.sourceFormat == "kafka":
320-
return pipeline_reader.read_kafka()
320+
input_df = pipeline_reader.read_kafka()
321321
else:
322322
raise Exception(f"{bronze_dataflow_spec.sourceFormat} source format not supported")
323323
return self.apply_custom_transform_fun(input_df)
@@ -630,11 +630,18 @@ def cdc_apply_changes(self):
630630
target_table = (
631631
f"{target_cl_name}{target_db_name}.{target_table_name}"
632632
)
633+
634+
# Handle comma-separated sequence columns using struct
635+
sequence_by = cdc_apply_changes.sequence_by
636+
if ',' in sequence_by:
637+
sequence_cols = [col.strip() for col in sequence_by.split(',')]
638+
sequence_by = struct(*sequence_cols) # Use struct() from pyspark.sql.functions
639+
633640
dlt.create_auto_cdc_flow(
634641
target=target_table,
635642
source=self.view_name,
636643
keys=cdc_apply_changes.keys,
637-
sequence_by=cdc_apply_changes.sequence_by,
644+
sequence_by=sequence_by,
638645
where=cdc_apply_changes.where,
639646
ignore_null_updates=cdc_apply_changes.ignore_null_updates,
640647
apply_as_deletes=apply_as_deletes,
@@ -673,8 +680,17 @@ def modify_schema_for_cdc_changes(self, cdc_apply_changes):
673680
for field in struct_schema.fields:
674681
if field.name not in cdc_apply_changes.except_column_list:
675682
modified_schema.add(field)
676-
if field.name == cdc_apply_changes.sequence_by:
677-
sequenced_by_data_type = field.dataType
683+
# For SCD Type 2, get data type of first sequence column
684+
sequence_by = cdc_apply_changes.sequence_by.strip()
685+
if ',' not in sequence_by:
686+
# Single column sequence
687+
if field.name == sequence_by:
688+
sequenced_by_data_type = field.dataType
689+
else:
690+
# Multiple column sequence - use first column's type
691+
first_sequence_col = sequence_by.split(',')[0].strip()
692+
if field.name == first_sequence_col:
693+
sequenced_by_data_type = field.dataType
678694
struct_schema = modified_schema
679695
else:
680696
raise Exception(f"Schema is None for {self.dataflowSpec} for cdc_apply_changes! ")

tests/test_dataflow_pipeline.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1363,6 +1363,22 @@ def test_write_bronze_cdc_apply_changes(self, mock_cdc_apply_changes):
13631363
pipeline.write_bronze()
13641364
assert mock_cdc_apply_changes.called
13651365

1366+
@patch.object(DataflowPipeline, 'cdc_apply_changes', return_value=None)
1367+
def test_write_bronze_cdc_apply_changes_multiple_sequence(self, mock_cdc_apply_changes):
1368+
"""Test write_bronze with CDC apply changes using multiple sequence columns."""
1369+
bronze_dataflow_spec = BronzeDataflowSpec(**self.bronze_dataflow_spec_map)
1370+
bronze_dataflow_spec.cdcApplyChanges = json.dumps({
1371+
"keys": ["id"],
1372+
"sequence_by": "event_timestamp, enqueue_timestamp, sequence_id",
1373+
"scd_type": "1",
1374+
"apply_as_deletes": "operation = 'DELETE'",
1375+
"except_column_list": ["operation", "event_timestamp", "enqueue_timestamp", "sequence_id", "_rescued_data"]
1376+
})
1377+
view_name = f"{bronze_dataflow_spec.targetDetails['table']}_inputview"
1378+
pipeline = DataflowPipeline(self.spark, bronze_dataflow_spec, view_name, None)
1379+
pipeline.write_bronze()
1380+
assert mock_cdc_apply_changes.called
1381+
13661382
@patch('pyspark.sql.SparkSession.readStream')
13671383
def test_get_silver_schema_uc_enabled(self, mock_read_stream):
13681384
"""Test get_silver_schema with Unity Catalog enabled."""

0 commit comments

Comments
 (0)