Skip to content

TamilvananK/ADF-ETL-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADF-ETL-Project

This repository is structured to support a dynamic and production-ready ETL (Extract, Transform, Load) solution using Azure Data Factory (ADF).

df1

💡 Key Pipeline Scenarios Implemented

The pipelines developed are capable of handling complex real-world data engineering challenges, including:

  • Modern Incremental Loading: Utilizing Change Data Capture (CDC) logic, often stored in JSON files, to track the last loaded timestamp. This ensures only new or updated records are processed, saving compute resources.
  • Backdated Refresh: Pipelines are designed to handle bulk loads and incremental loads, and they also allow users to provide a custom historical date parameter to refresh data from a specific point in the past (backdated refresh).
  • Dynamic REST API Data Extraction: Fetching data from external APIs, complete with complex rules for handling pagination (such as using Absolute URL or Query Parameters like offset and limit).
  • Dynamic Schema Mapping: Handling multiple source files with potentially varying schemas using dynamic mappings provided via nested IF conditions and parameters, eliminating the need for creating a unique copy activity for every single table.
  • Data Routing/Switch Logic: Creating "Router Pipelines" that use the Switch activity within a For Each loop to distribute incoming files to different destination folders based on file name or type.
  • Upsert with Delta Lake: Using Data Flows to transform data and write it directly into the Delta Lake format, applying upsert logic (Update + Insert) using a primary merge key (e.g., location ID).
  • Ad Hoc Event-Driven Triggers: Building pipelines that remain dormant and only execute when a specific external condition is met (e.g., a manager uploads a designated trigger file), leveraging the Validation activity and subsequent Delete activity to manage the trigger file state.

router main

⚙️ Available Components and Artifacts

Component Category Artifacts Found Purpose in ETL
Pipelines pipeline Orchestrate activities, control data flow, and implement ETL logic.
Data Interaction dataflow Used for defining transformations (T) via Spark compute, often used for moving data into Silver/Gold layers.
Data Definition dataset Defines the structure and format of the data (e.g., JSON, Parquet, CSV) and can be made dynamic using parameters.
Connectivity linkedService Securely stores connection information (e.g., credentials, hostnames) for sources like Azure SQL Database or Data Lake Storage Gen 2.
Configuration factory, publish_config.json Configuration and deployment metadata for the ADF instance.
Scheduling trigger Defines how and when pipelines run (e.g., scheduled, event-based).

Key Activities Used in Pipelines 🧩

Pipelines leverage a variety of activities for sophisticated control flow:

  • Data Movement: Copy Data (for data migration).
  • Data Transformation (SQL/DB): Script (to run DDL/DML commands like COUNT or MAX).
  • Metadata/Lookup: Lookup (to fetch control data like the last CDC timestamp from a JSON), Get Metadata (to inspect files/folders, returning items as an array).
  • Control Flow: If Condition (to conditionally run activities, e.g., only load data if the record count > 0), For Each (to iterate over arrays, enabling parallelism), Switch (for complex routing based on conditions).
  • Monitoring/External Calls: Web (for making HTTP API calls, often used for sending external alerts via Logic Apps), Execute Pipeline (for creating modular, scheduled parent pipelines).

dataflow

🌊 Data Flow Mechanisms

The ETL process uses both Copy Activities for efficient bulk movement and Data Flows for intensive transformation:

  1. Incremental Load Logic: The system uses Lookup to read the last loaded timestamp from a control file (e.g., cdc.json) stored in Azure Data Lake. This timestamp is then used dynamically within the Copy Data source query (e.g., WHERE last_updated >= [CDC timestamp]). After a successful load, the pipeline updates the control file with the new maximum timestamp found in the source.
  2. Transformation (Data Flow): Data flows are powered by an underlying Spark cluster. They enable transformations through low-code/no-code methods, including Derived Column (for standardizing data, e.g., converting text to UPPERCASE), Select (for column removal or renaming), and Alter Row (essential for setting Upsert conditions before sinking to Delta Lake).

df2

🔗 Dynamic Variables and Parameterization

A core principle of this project is the use of dynamic expressions and parameters to make pipelines reusable and production-ready.

Pipeline Parameters

Parameter Example Use Case Implementation Detail
Schema / Table Reusable Injection Passed into the Copy Data activity's source query to dynamically select the target table (e.g., SELECT * FROM @{pipeline().parameters.Schema}.@{pipeline().parameters.Table}).
Back Date Historical Refresh An optional parameter used in an IF expression to override the standard CDC timestamp if the user needs to reload data from a custom historical date.

Component Parameters and Variables

  • Dynamic Datasets: Datasets are defined with parameters (e.g., P container, P folder, P file) so the same dataset definition can be reused across multiple activities or files by passing different values dynamically.
  • Dynamic Mappings: Custom JSON objects containing schema mapping definitions are stored as pipeline parameters or external files. These are selected and injected into the Copy Data activity's mapping property using nested IF conditions based on the incoming file name (e.g., if(equals(item().name, 'customers.csv'), pipeline().parameters.customerSchema, ...)).
  • System Variables: System variables like @pipeline().Pipeline and @pipeline().RunId are used dynamically in the Web activity payload to send specific execution details when configuring alerts via Logic Apps.

🌐 CI/CD Integration

The repository structure implies a connection to a Git provider, supporting collaborative development and robust deployment practices.

  • Version Control: The ADF instance is connected to a Git repository (e.g., GitHub or Azure DevOps).
  • Branching Strategy: Developers work in isolated feature branches (e.g., an branch) before merging changes into the main branch via a Pull Request (PR).
  • Deployment Artifacts: Publishing from the main branch automatically generates the ADF_publish branch, which contains deployable ARM templates (YAML/JSON templates) for use by DevOps engineers to deploy the changes across different environments (QA, Production).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors