tap-azure-cloud-storage

A Singer tap for extracting data from Azure Blob Storage (Azure Cloud Storage).

Overview

tap-azure-cloud-storage is a Singer tap that extracts data from Azure Blob Storage containers. It supports multiple file formats including CSV, JSON-L, Parquet, Avro, and Excel files, and can handle both full table and incremental replication.

Features

Supported File Formats

Delimited text files: CSV, TSV, PSV (pipe-separated), and custom delimiters
JSON: JSON-L (newline-delimited JSON)
Parquet: Apache Parquet columnar format
Avro: Apache Avro format
Excel: .xlsx files
Compressed files: Gzip (.gz) and Zip (.zip) archives

Authentication Methods

Service Principal: Using Azure AD application credentials (recommended for production)
Connection String: Direct connection string authentication
Account Key: Storage account key authentication
Managed Identity: DefaultAzureCredential for Azure-hosted environments

Key Capabilities

Flexible file selection: Specify folder paths and use regex patterns to match files
Multiple tables: Define multiple table configurations within a single connection
Incremental replication: Support for incremental sync based on file modification time
Primary key support: Define primary keys for incremental replication
DateTime field detection: Automatic detection and handling of datetime fields
Schema inference: Automatic schema detection from file contents
UTF-8 encoding: Full UTF-8 support
Compression: Automatic handling of Gzip compressed files

Installation

Prerequisites

Python 3.7 or higher
pip

Install from source

git clone <repository-url>
cd tap-azure-cloud-storage
pip install -e .

Configuration

Required Configuration

Create a config.json file with the following structure:

{
  "storage_account_name": "your_storage_account",
  "container_name": "your_container_name",
  "start_date": "2024-01-01T00:00:00Z",
  "tables": [
    {
      "table_name": "my_table",
      "search_pattern": ".*\\.csv$",
      "search_prefix": "path/to/files/",
      "key_properties": ["id"],
      "date_overrides": ["created_at", "updated_at"],
      "delimiter": ","
    }
  ]
}

Authentication Configuration

Method 1: Service Principal (Recommended for Production)

Add these fields to your config.json:

{
  "storage_account_name": "your_storage_account",
  "tenant_id": "your-tenant-id",
  "client_id": "your-client-id",
  "client_secret": "your-client-secret",
  ...
}

Setting up Service Principal:

Create an Azure AD Application:

az ad app create --display-name "tap-azure-storage-app"

Create a Service Principal:
```
az ad sp create --id <application-id>
```

Assign Storage Blob Data Reader role:

az role assignment create \
  --role "Storage Blob Data Reader" \
  --assignee <service-principal-id> \
  --scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>"

Create a client secret:

az ad app credential reset --id <application-id>

Method 2: Connection String

{
  "storage_account_name": "your_storage_account",
  "connection_string": "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net",
  ...
}

Method 3: Account Key

{
  "storage_account_name": "your_storage_account",
  "account_key": "your-account-key",
  ...
}

Method 4: Managed Identity

For Azure-hosted environments (Azure VMs, Azure Functions, etc.):

{
  "storage_account_name": "your_storage_account",
  ...
}

Configuration Parameters

Parameter	Required	Description
`storage_account_name`	Yes	Azure Storage account name
`container_name`	Yes	Azure Blob Storage container name
`start_date`	Yes	Start date for incremental sync (ISO 8601 format)
`tables`	Yes	Array of table configurations (see below)
`tenant_id`	No	Azure AD tenant ID (for Service Principal auth)
`client_id`	No	Azure AD application client ID (for Service Principal auth)
`client_secret`	No	Azure AD application client secret (for Service Principal auth)
`connection_string`	No	Azure Storage connection string
`account_key`	No	Azure Storage account key
`root_path`	No	Root path prefix within the container

Table Configuration

Each table in the tables array supports the following parameters:

Parameter	Required	Description
`table_name`	Yes	Name of the table/stream
`search_pattern`	Yes	Regex pattern to match file names (e.g., `".*\\.csv$"`)
`search_prefix`	No	Folder path prefix to narrow file search
`key_properties`	Yes	Array of field names to use as primary keys (use `[]` for full table sync)
`date_overrides`	No	Array of field names to treat as datetime fields
`delimiter`	No	Custom delimiter for CSV files (auto-detected for .tsv, .psv)

Example Configurations

Multiple Tables with Different Formats

{
  "storage_account_name": "myaccount",
  "container_name": "data",
  "tenant_id": "...",
  "client_id": "...",
  "client_secret": "...",
  "start_date": "2024-01-01T00:00:00Z",
  "tables": [
    {
      "table_name": "customers",
      "search_pattern": "customers.*\\.csv$",
      "search_prefix": "exports/customers/",
      "key_properties": ["customer_id"],
      "date_overrides": ["created_at"],
      "delimiter": ","
    },
    {
      "table_name": "orders",
      "search_pattern": "orders.*\\.parquet$",
      "search_prefix": "exports/orders/",
      "key_properties": ["order_id"],
      "date_overrides": ["order_date"]
    },
    {
      "table_name": "products",
      "search_pattern": "products.*\\.jsonl$",
      "search_prefix": "exports/products/",
      "key_properties": [],
      "date_overrides": []
    }
  ]
}

TSV Files with Custom Settings

{
  "storage_account_name": "myaccount",
  "container_name": "data",
  "account_key": "...",
  "start_date": "2024-01-01T00:00:00Z",
  "tables": [
    {
      "table_name": "events",
      "search_pattern": "events_.*\\.tsv$",
      "search_prefix": "logs/",
      "key_properties": ["event_id"],
      "date_overrides": ["event_timestamp"],
      "delimiter": "\t"
    }
  ]
}

Usage

Discovery Mode

Discover the schema of your data:

tap-azure-cloud-storage --config config.json --discover > catalog.json

Select Streams

Edit the catalog.json to select streams and fields:

{
  "streams": [
    {
      "tap_stream_id": "my_table",
      "stream": "my_table",
      "schema": {...},
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "selected": true,
            "table-key-properties": ["id"]
          }
        },
        ...
      ]
    }
  ]
}

Sync Data

Run the tap to extract data:

tap-azure-cloud-storage --config config.json --catalog catalog.json

With a Singer Target

Pipe data to a Singer target:

tap-azure-cloud-storage --config config.json --catalog catalog.json | target-jsonl > output.jsonl

Incremental Sync with State

tap-azure-cloud-storage --config config.json --catalog catalog.json --state state.json | target-jsonl > output.jsonl

Replication Methods

Full Table Replication

When key_properties is empty ([]), the tap performs a full table replication on every run, extracting all matched files.

{
  "table_name": "my_table",
  "search_pattern": ".*\\.csv$",
  "key_properties": []
}

Incremental Replication

When key_properties contains field names, the tap performs incremental replication based on file modification time. Only files modified since the last sync are processed.

{
  "table_name": "my_table",
  "search_pattern": ".*\\.csv$",
  "key_properties": ["id"]
}

Metadata Columns

The tap automatically adds metadata columns to each record:

_sdc_source_container: Azure Blob Storage container name
_sdc_source_file: Full path to the source file
_sdc_source_lineno: Line number within the source file
_sdc_extra: Extra fields not defined in the schema (for JSONL files)

File Format Specifics

CSV/TSV/PSV Files

Automatic delimiter detection based on file extension
Custom delimiter support via delimiter parameter
Header row required
UTF-8 encoding

JSON-L Files

One JSON object per line
Automatic schema inference
Supports nested objects and arrays
Extra fields stored in _sdc_extra column

Parquet Files

Schema read directly from Parquet metadata
Efficient columnar reading
Supports nested structures

Avro Files

Schema read from Avro file metadata
Supports complex types
Efficient binary format

Excel Files (.xlsx)

First row treated as headers
Multiple worksheets supported; each sheet processed and sheet name appended to _sdc_source_file
Automatic type inference

Compressed Files

Gzip (.gz): Automatically decompressed
Zip (.zip): All contained files processed
Nested compression not supported

Permissions and Security

Minimum Required Permissions

For production use with Service Principal authentication, assign the Storage Blob Data Reader role:

az role assignment create \
  --role "Storage Blob Data Reader" \
  --assignee <service-principal-id> \
  --scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>/blobServices/default/containers/<container>"

Recommended Security Practices

Use Service Principal authentication in production
Apply least privilege principle - only grant read access to specific containers
Rotate credentials regularly
Use Azure Key Vault to store secrets
Enable Azure Storage logging for audit trails
Use private endpoints when possible to avoid public internet exposure

Troubleshooting

Connection Issues

Problem: "Failed to connect to Azure"

Solutions:

Verify storage_account_name is correct
Check authentication credentials (tenant_id, client_id, client_secret)
Ensure Service Principal has proper permissions
Verify network connectivity to Azure

Authentication Issues

Problem: "Authentication failed"

Solutions:

For Service Principal: Verify client_secret is correct and not expired
Check that the Service Principal has "Storage Blob Data Reader" role
Ensure tenant_id and client_id are correct

File Discovery Issues

Problem: "No objects matched for table"

Solutions:

Check search_pattern regex syntax
Verify search_prefix path exists in container
Ensure files have correct extensions
Check file modification dates against start_date

Schema Detection Issues

Problem: Schema not detected correctly

Solutions:

Ensure files have consistent structure
Check file encoding (must be UTF-8)
For CSV files, verify headers are present
Increase sampling by adding more files matching the pattern

Development

Running Tests

python -m pytest tests/

Code Style

pylint tap_azure_cloud_storage

Support

For issues, questions, or contributions, please contact your Qlik support representative or open an issue in the project repository.

License

MIT License - See LICENSE file for details

Changelog

Version 0.0.1

Initial release
Support for CSV, TSV, PSV, JSON-L, Parquet, Avro, and Excel files
Multiple authentication methods (Service Principal, Connection String, Account Key, Managed Identity)
Full and incremental replication
Gzip and Zip compression support
Regex pattern matching for file selection
Automatic schema inference

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.circleci		.circleci
.github		.github
spike		spike
tap_azure_cloud_storage		tap_azure_cloud_storage
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.sample.json		config.sample.json
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

tap-azure-cloud-storage

Overview

Features

Supported File Formats

Authentication Methods

Key Capabilities

Installation

Prerequisites

Install from source

Configuration

Required Configuration

Authentication Configuration

Method 1: Service Principal (Recommended for Production)

Method 2: Connection String

Method 3: Account Key

Method 4: Managed Identity

Configuration Parameters

Table Configuration

Example Configurations

Multiple Tables with Different Formats

TSV Files with Custom Settings

Usage

Discovery Mode

Select Streams

Sync Data

With a Singer Target

Incremental Sync with State

Replication Methods

Full Table Replication

Incremental Replication

Metadata Columns

File Format Specifics

CSV/TSV/PSV Files

JSON-L Files

Parquet Files

Avro Files

Excel Files (.xlsx)

Compressed Files

Permissions and Security

Minimum Required Permissions

Recommended Security Practices

Troubleshooting

Connection Issues

Authentication Issues

File Discovery Issues

Schema Detection Issues

Development

Running Tests

Code Style

Support

License

Changelog

Version 0.0.1

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages