A Singer tap for extracting data from Azure Blob Storage (Azure Cloud Storage).
tap-azure-cloud-storage is a Singer tap that extracts data from Azure Blob Storage containers. It supports multiple file formats including CSV, JSON-L, Parquet, Avro, and Excel files, and can handle both full table and incremental replication.
- Delimited text files: CSV, TSV, PSV (pipe-separated), and custom delimiters
- JSON: JSON-L (newline-delimited JSON)
- Parquet: Apache Parquet columnar format
- Avro: Apache Avro format
- Excel: .xlsx files
- Compressed files: Gzip (.gz) and Zip (.zip) archives
- Service Principal: Using Azure AD application credentials (recommended for production)
- Connection String: Direct connection string authentication
- Account Key: Storage account key authentication
- Managed Identity: DefaultAzureCredential for Azure-hosted environments
- Flexible file selection: Specify folder paths and use regex patterns to match files
- Multiple tables: Define multiple table configurations within a single connection
- Incremental replication: Support for incremental sync based on file modification time
- Primary key support: Define primary keys for incremental replication
- DateTime field detection: Automatic detection and handling of datetime fields
- Schema inference: Automatic schema detection from file contents
- UTF-8 encoding: Full UTF-8 support
- Compression: Automatic handling of Gzip compressed files
- Python 3.7 or higher
- pip
git clone <repository-url>
cd tap-azure-cloud-storage
pip install -e .Create a config.json file with the following structure:
{
"storage_account_name": "your_storage_account",
"container_name": "your_container_name",
"start_date": "2024-01-01T00:00:00Z",
"tables": [
{
"table_name": "my_table",
"search_pattern": ".*\\.csv$",
"search_prefix": "path/to/files/",
"key_properties": ["id"],
"date_overrides": ["created_at", "updated_at"],
"delimiter": ","
}
]
}Add these fields to your config.json:
{
"storage_account_name": "your_storage_account",
"tenant_id": "your-tenant-id",
"client_id": "your-client-id",
"client_secret": "your-client-secret",
...
}Setting up Service Principal:
-
Create an Azure AD Application:
az ad app create --display-name "tap-azure-storage-app" -
Create a Service Principal:
az ad sp create --id <application-id>
-
Assign Storage Blob Data Reader role:
az role assignment create \ --role "Storage Blob Data Reader" \ --assignee <service-principal-id> \ --scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>"
-
Create a client secret:
az ad app credential reset --id <application-id>
{
"storage_account_name": "your_storage_account",
"connection_string": "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net",
...
}{
"storage_account_name": "your_storage_account",
"account_key": "your-account-key",
...
}For Azure-hosted environments (Azure VMs, Azure Functions, etc.):
{
"storage_account_name": "your_storage_account",
...
}| Parameter | Required | Description |
|---|---|---|
storage_account_name |
Yes | Azure Storage account name |
container_name |
Yes | Azure Blob Storage container name |
start_date |
Yes | Start date for incremental sync (ISO 8601 format) |
tables |
Yes | Array of table configurations (see below) |
tenant_id |
No | Azure AD tenant ID (for Service Principal auth) |
client_id |
No | Azure AD application client ID (for Service Principal auth) |
client_secret |
No | Azure AD application client secret (for Service Principal auth) |
connection_string |
No | Azure Storage connection string |
account_key |
No | Azure Storage account key |
root_path |
No | Root path prefix within the container |
Each table in the tables array supports the following parameters:
| Parameter | Required | Description |
|---|---|---|
table_name |
Yes | Name of the table/stream |
search_pattern |
Yes | Regex pattern to match file names (e.g., ".*\\.csv$") |
search_prefix |
No | Folder path prefix to narrow file search |
key_properties |
Yes | Array of field names to use as primary keys (use [] for full table sync) |
date_overrides |
No | Array of field names to treat as datetime fields |
delimiter |
No | Custom delimiter for CSV files (auto-detected for .tsv, .psv) |
{
"storage_account_name": "myaccount",
"container_name": "data",
"tenant_id": "...",
"client_id": "...",
"client_secret": "...",
"start_date": "2024-01-01T00:00:00Z",
"tables": [
{
"table_name": "customers",
"search_pattern": "customers.*\\.csv$",
"search_prefix": "exports/customers/",
"key_properties": ["customer_id"],
"date_overrides": ["created_at"],
"delimiter": ","
},
{
"table_name": "orders",
"search_pattern": "orders.*\\.parquet$",
"search_prefix": "exports/orders/",
"key_properties": ["order_id"],
"date_overrides": ["order_date"]
},
{
"table_name": "products",
"search_pattern": "products.*\\.jsonl$",
"search_prefix": "exports/products/",
"key_properties": [],
"date_overrides": []
}
]
}{
"storage_account_name": "myaccount",
"container_name": "data",
"account_key": "...",
"start_date": "2024-01-01T00:00:00Z",
"tables": [
{
"table_name": "events",
"search_pattern": "events_.*\\.tsv$",
"search_prefix": "logs/",
"key_properties": ["event_id"],
"date_overrides": ["event_timestamp"],
"delimiter": "\t"
}
]
}Discover the schema of your data:
tap-azure-cloud-storage --config config.json --discover > catalog.jsonEdit the catalog.json to select streams and fields:
{
"streams": [
{
"tap_stream_id": "my_table",
"stream": "my_table",
"schema": {...},
"metadata": [
{
"breadcrumb": [],
"metadata": {
"selected": true,
"table-key-properties": ["id"]
}
},
...
]
}
]
}Run the tap to extract data:
tap-azure-cloud-storage --config config.json --catalog catalog.jsonPipe data to a Singer target:
tap-azure-cloud-storage --config config.json --catalog catalog.json | target-jsonl > output.jsonltap-azure-cloud-storage --config config.json --catalog catalog.json --state state.json | target-jsonl > output.jsonlWhen key_properties is empty ([]), the tap performs a full table replication on every run, extracting all matched files.
{
"table_name": "my_table",
"search_pattern": ".*\\.csv$",
"key_properties": []
}When key_properties contains field names, the tap performs incremental replication based on file modification time. Only files modified since the last sync are processed.
{
"table_name": "my_table",
"search_pattern": ".*\\.csv$",
"key_properties": ["id"]
}The tap automatically adds metadata columns to each record:
_sdc_source_container: Azure Blob Storage container name_sdc_source_file: Full path to the source file_sdc_source_lineno: Line number within the source file_sdc_extra: Extra fields not defined in the schema (for JSONL files)
- Automatic delimiter detection based on file extension
- Custom delimiter support via
delimiterparameter - Header row required
- UTF-8 encoding
- One JSON object per line
- Automatic schema inference
- Supports nested objects and arrays
- Extra fields stored in
_sdc_extracolumn
- Schema read directly from Parquet metadata
- Efficient columnar reading
- Supports nested structures
- Schema read from Avro file metadata
- Supports complex types
- Efficient binary format
- First row treated as headers
- Multiple worksheets supported; each sheet processed and sheet name appended to
_sdc_source_file - Automatic type inference
- Gzip (.gz): Automatically decompressed
- Zip (.zip): All contained files processed
- Nested compression not supported
For production use with Service Principal authentication, assign the Storage Blob Data Reader role:
az role assignment create \
--role "Storage Blob Data Reader" \
--assignee <service-principal-id> \
--scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>/blobServices/default/containers/<container>"- Use Service Principal authentication in production
- Apply least privilege principle - only grant read access to specific containers
- Rotate credentials regularly
- Use Azure Key Vault to store secrets
- Enable Azure Storage logging for audit trails
- Use private endpoints when possible to avoid public internet exposure
Problem: "Failed to connect to Azure"
Solutions:
- Verify
storage_account_nameis correct - Check authentication credentials (tenant_id, client_id, client_secret)
- Ensure Service Principal has proper permissions
- Verify network connectivity to Azure
Problem: "Authentication failed"
Solutions:
- For Service Principal: Verify client_secret is correct and not expired
- Check that the Service Principal has "Storage Blob Data Reader" role
- Ensure tenant_id and client_id are correct
Problem: "No objects matched for table"
Solutions:
- Check
search_patternregex syntax - Verify
search_prefixpath exists in container - Ensure files have correct extensions
- Check file modification dates against
start_date
Problem: Schema not detected correctly
Solutions:
- Ensure files have consistent structure
- Check file encoding (must be UTF-8)
- For CSV files, verify headers are present
- Increase sampling by adding more files matching the pattern
python -m pytest tests/pylint tap_azure_cloud_storageFor issues, questions, or contributions, please contact your Qlik support representative or open an issue in the project repository.
MIT License - See LICENSE file for details
- Initial release
- Support for CSV, TSV, PSV, JSON-L, Parquet, Avro, and Excel files
- Multiple authentication methods (Service Principal, Connection String, Account Key, Managed Identity)
- Full and incremental replication
- Gzip and Zip compression support
- Regex pattern matching for file selection
- Automatic schema inference