This repository is structured to support a dynamic and production-ready ETL (Extract, Transform, Load) solution using Azure Data Factory (ADF).
The pipelines developed are capable of handling complex real-world data engineering challenges, including:
- Modern Incremental Loading: Utilizing Change Data Capture (CDC) logic, often stored in JSON files, to track the last loaded timestamp. This ensures only new or updated records are processed, saving compute resources.
- Backdated Refresh: Pipelines are designed to handle bulk loads and incremental loads, and they also allow users to provide a custom historical date parameter to refresh data from a specific point in the past (backdated refresh).
- Dynamic REST API Data Extraction: Fetching data from external APIs, complete with complex rules for handling pagination (such as using Absolute URL or Query Parameters like
offsetandlimit). - Dynamic Schema Mapping: Handling multiple source files with potentially varying schemas using dynamic mappings provided via nested IF conditions and parameters, eliminating the need for creating a unique copy activity for every single table.
- Data Routing/Switch Logic: Creating "Router Pipelines" that use the
Switchactivity within aFor Eachloop to distribute incoming files to different destination folders based on file name or type. - Upsert with Delta Lake: Using Data Flows to transform data and write it directly into the Delta Lake format, applying upsert logic (Update + Insert) using a primary merge key (e.g.,
location ID). - Ad Hoc Event-Driven Triggers: Building pipelines that remain dormant and only execute when a specific external condition is met (e.g., a manager uploads a designated trigger file), leveraging the
Validationactivity and subsequentDeleteactivity to manage the trigger file state.
| Component Category | Artifacts Found | Purpose in ETL |
|---|---|---|
| Pipelines | pipeline |
Orchestrate activities, control data flow, and implement ETL logic. |
| Data Interaction | dataflow |
Used for defining transformations (T) via Spark compute, often used for moving data into Silver/Gold layers. |
| Data Definition | dataset |
Defines the structure and format of the data (e.g., JSON, Parquet, CSV) and can be made dynamic using parameters. |
| Connectivity | linkedService |
Securely stores connection information (e.g., credentials, hostnames) for sources like Azure SQL Database or Data Lake Storage Gen 2. |
| Configuration | factory, publish_config.json |
Configuration and deployment metadata for the ADF instance. |
| Scheduling | trigger |
Defines how and when pipelines run (e.g., scheduled, event-based). |
Pipelines leverage a variety of activities for sophisticated control flow:
- Data Movement:
Copy Data(for data migration). - Data Transformation (SQL/DB):
Script(to run DDL/DML commands likeCOUNTorMAX). - Metadata/Lookup:
Lookup(to fetch control data like the last CDC timestamp from a JSON),Get Metadata(to inspect files/folders, returning items as an array). - Control Flow:
If Condition(to conditionally run activities, e.g., only load data if the record count > 0),For Each(to iterate over arrays, enabling parallelism),Switch(for complex routing based on conditions). - Monitoring/External Calls:
Web(for making HTTP API calls, often used for sending external alerts via Logic Apps),Execute Pipeline(for creating modular, scheduled parent pipelines).
The ETL process uses both Copy Activities for efficient bulk movement and Data Flows for intensive transformation:
- Incremental Load Logic: The system uses
Lookupto read the last loaded timestamp from a control file (e.g.,cdc.json) stored in Azure Data Lake. This timestamp is then used dynamically within theCopy Datasource query (e.g.,WHERE last_updated >= [CDC timestamp]). After a successful load, the pipeline updates the control file with the new maximum timestamp found in the source. - Transformation (Data Flow): Data flows are powered by an underlying Spark cluster. They enable transformations through low-code/no-code methods, including
Derived Column(for standardizing data, e.g., converting text to UPPERCASE),Select(for column removal or renaming), andAlter Row(essential for setting Upsert conditions before sinking to Delta Lake).
A core principle of this project is the use of dynamic expressions and parameters to make pipelines reusable and production-ready.
| Parameter Example | Use Case | Implementation Detail |
|---|---|---|
Schema / Table |
Reusable Injection | Passed into the Copy Data activity's source query to dynamically select the target table (e.g., SELECT * FROM @{pipeline().parameters.Schema}.@{pipeline().parameters.Table}). |
Back Date |
Historical Refresh | An optional parameter used in an IF expression to override the standard CDC timestamp if the user needs to reload data from a custom historical date. |
- Dynamic Datasets: Datasets are defined with parameters (e.g.,
P container,P folder,P file) so the same dataset definition can be reused across multiple activities or files by passing different values dynamically. - Dynamic Mappings: Custom JSON objects containing schema mapping definitions are stored as pipeline parameters or external files. These are selected and injected into the
Copy Dataactivity'smappingproperty using nested IF conditions based on the incoming file name (e.g.,if(equals(item().name, 'customers.csv'), pipeline().parameters.customerSchema, ...)). - System Variables: System variables like
@pipeline().Pipelineand@pipeline().RunIdare used dynamically in theWebactivity payload to send specific execution details when configuring alerts via Logic Apps.
The repository structure implies a connection to a Git provider, supporting collaborative development and robust deployment practices.
- Version Control: The ADF instance is connected to a Git repository (e.g., GitHub or Azure DevOps).
- Branching Strategy: Developers work in isolated feature branches (e.g.,
anbranch) before merging changes into themainbranch via a Pull Request (PR). - Deployment Artifacts: Publishing from the
mainbranch automatically generates theADF_publishbranch, which contains deployable ARM templates (YAML/JSON templates) for use by DevOps engineers to deploy the changes across different environments (QA, Production).