Skip to content

Commit 1afcfa1

Browse files
committed
update docs
1 parent 08d36bc commit 1afcfa1

5 files changed

Lines changed: 510 additions & 1 deletion

File tree

docs/authentication.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# Authentication
2+
3+
The dbt-fabric-samdebruyn adapter supports a variety of authentication methods so you can connect to Microsoft Fabric from any environment. This guide walks through each method, explains when to use it, and provides ready-to-use `profiles.yml` examples.
4+
5+
!!! tip "Quick recommendation"
6+
7+
| Scenario | Recommended method |
8+
| --- | --- |
9+
| Local development | [`CLI`](#azure-cli) or [`auto`](#automatic-defaultazurecredential) |
10+
| CI/CD pipelines | [`environment`](#environment-variables) or [`ActiveDirectoryServicePrincipal`](#service-principal) |
11+
| Fabric Notebook | [`environment`](#environment-variables) or [`ActiveDirectoryServicePrincipal`](#service-principal) |
12+
13+
All examples below assume the following base profile structure. Only the authentication-related keys change per method.
14+
15+
```yaml
16+
default:
17+
target: dev
18+
outputs:
19+
dev:
20+
type: fabric
21+
workspace: My Workspace
22+
database: my_data_warehouse
23+
schema: dbt
24+
# + authentication keys shown below
25+
```
26+
27+
??? tip "Use environment variables for secrets"
28+
29+
Never hardcode secrets in your `profiles.yml`. Use Jinja to reference environment variables:
30+
31+
```yaml
32+
client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
33+
```
34+
35+
---
36+
37+
## Local development
38+
39+
### Azure CLI
40+
41+
The simplest way to authenticate during local development. Log in once with the Azure CLI and dbt will reuse that session.
42+
43+
**Step 1 — Log in**
44+
45+
```bash
46+
az login
47+
```
48+
49+
Your account does not need access to any Azure subscription — it only needs access to your Fabric workspace.
50+
51+
**Step 2 — Configure your profile**
52+
53+
```yaml
54+
default:
55+
target: dev
56+
outputs:
57+
dev:
58+
type: fabric
59+
database: my_data_warehouse
60+
schema: dbt
61+
workspace: My Workspace # or use host
62+
authentication: CLI
63+
```
64+
65+
!!! info "Keep your Azure CLI up to date"
66+
67+
There have been reports of issues when using an outdated version of the Azure CLI. Run `az upgrade` to make sure you are on the latest version.
68+
69+
The Azure CLI itself supports [multiple login methods](https://learn.microsoft.com/cli/azure/authenticate-azure-cli?view=azure-cli-latest&WT.mc_id=MVP_310840) (browser, device code, service principal, managed identity, …), making this a flexible option that adapts to many scenarios.
70+
71+
### Automatic (`DefaultAzureCredential`)
72+
73+
Set `authentication` to `auto` (or omit it entirely — it's the default). The adapter uses the Azure Identity SDK's [`DefaultAzureCredential`](https://learn.microsoft.com/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python&WT.mc_id=MVP_310840) which tries several credential sources in order:
74+
75+
1. Environment variables
76+
2. Workload identity
77+
3. Managed identity
78+
4. Azure CLI
79+
5. Azure PowerShell
80+
6. Azure Developer CLI
81+
7. Interactive browser (if available)
82+
83+
```yaml
84+
default:
85+
target: dev
86+
outputs:
87+
dev:
88+
type: fabric
89+
database: my_data_warehouse
90+
schema: dbt
91+
workspace: My Workspace
92+
# authentication: auto ← this is the default, can be omitted
93+
```
94+
95+
This means that if you are logged in with **Azure PowerShell** (`Connect-AzAccount`), it will automatically be picked up — no extra configuration needed.
96+
97+
!!! tip "When to use `auto` vs `CLI`"
98+
99+
`auto` tries multiple credential sources in a chain, which means it can be slightly slower on first connection. It can also pick up credentials you don't intend to use — for example, a managed identity or environment variables left over from another tool. If you know you will always use the Azure CLI, setting `authentication: CLI` explicitly skips the chain, connects faster, and ensures no unexpected credentials are used.
100+
101+
---
102+
103+
## CI/CD & automated environments
104+
105+
### Service Principal
106+
107+
Use a Microsoft Entra ID app registration (service principal) with a client secret. This is ideal for unattended, automated runs.
108+
109+
**Prerequisites:**
110+
111+
- A registered application in Microsoft Entra ID
112+
- The application must have access to your Fabric workspace
113+
- You need the **client ID**, **client secret**, and **tenant ID**
114+
115+
```yaml
116+
default:
117+
target: ci
118+
outputs:
119+
ci:
120+
type: fabric
121+
database: my_data_warehouse
122+
schema: dbt
123+
workspace: My Workspace
124+
authentication: ActiveDirectoryServicePrincipal
125+
tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
126+
client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
127+
client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
128+
```
129+
130+
!!! warning "Tenant ID is required"
131+
132+
When using `ActiveDirectoryServicePrincipal` together with [`workspace_name`](configuration.md#workspace_name) or [`workspace_id`](configuration.md#workspace_id) — or when running Python models — the `tenant_id` must be provided.
133+
134+
### Environment variables
135+
136+
Set `authentication` to `environment` and configure credentials through environment variables. The adapter uses Azure Identity's [`EnvironmentCredential`](https://learn.microsoft.com/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python&WT.mc_id=MVP_310840), which supports the following variables:
137+
138+
=== "Service principal with secret"
139+
140+
| Variable | Description |
141+
| --- | --- |
142+
| `AZURE_TENANT_ID` | Microsoft Entra tenant ID |
143+
| `AZURE_CLIENT_ID` | Application (client) ID |
144+
| `AZURE_CLIENT_SECRET` | Client secret |
145+
146+
=== "Service principal with certificate"
147+
148+
| Variable | Description |
149+
| --- | --- |
150+
| `AZURE_TENANT_ID` | Microsoft Entra tenant ID |
151+
| `AZURE_CLIENT_ID` | Application (client) ID |
152+
| `AZURE_CLIENT_CERTIFICATE_PATH` | Path to a PEM or PKCS12 certificate |
153+
| `AZURE_CLIENT_CERTIFICATE_PASSWORD` | *(optional)* Certificate password |
154+
155+
=== "Username & password"
156+
157+
| Variable | Description |
158+
| --- | --- |
159+
| `AZURE_TENANT_ID` | Microsoft Entra tenant ID |
160+
| `AZURE_CLIENT_ID` | Application (client) ID |
161+
| `AZURE_USERNAME` | Username |
162+
| `AZURE_PASSWORD` | Password |
163+
164+
```yaml
165+
default:
166+
target: ci
167+
outputs:
168+
ci:
169+
type: fabric
170+
database: my_data_warehouse
171+
schema: dbt
172+
workspace: My Workspace
173+
authentication: environment
174+
```
175+
176+
This method keeps your `profiles.yml` completely free of secrets, which is an advantage over the explicit `ActiveDirectoryServicePrincipal` method.
177+
178+
---
179+
180+
## Fabric Notebook
181+
182+
When running dbt inside a **Fabric Notebook**, the recommended approach is to use **environment variable** or **service principal** authentication.
183+
184+
Configure your notebook to set the required environment variables (e.g. `AZURE_TENANT_ID`, `AZURE_CLIENT_ID`, `AZURE_CLIENT_SECRET`) and use the [`environment`](#environment-variables) or [`ActiveDirectoryServicePrincipal`](#service-principal) method.
185+
186+
```yaml
187+
default:
188+
target: notebook
189+
outputs:
190+
notebook:
191+
type: fabric
192+
database: my_data_warehouse
193+
schema: dbt
194+
workspace: My Workspace
195+
authentication: environment
196+
```
197+
198+
Alternatively, with explicit service principal configuration:
199+
200+
```yaml
201+
default:
202+
target: notebook
203+
outputs:
204+
notebook:
205+
type: fabric
206+
database: my_data_warehouse
207+
schema: dbt
208+
workspace: My Workspace
209+
authentication: ActiveDirectoryServicePrincipal
210+
tenant_id: "{{ env_var('AZURE_TENANT_ID') }}"
211+
client_id: "{{ env_var('AZURE_CLIENT_ID') }}"
212+
client_secret: "{{ env_var('AZURE_CLIENT_SECRET') }}"
213+
```
214+
215+
!!! warning "`FabricSpark` is currently broken"
216+
217+
The adapter also has a `FabricSpark` (alias `SynapseSpark`) authentication method that uses [NotebookUtils](https://learn.microsoft.com/fabric/data-engineering/notebook-utilities?WT.mc_id=MVP_310840) to obtain an access token from the notebook session. However, this method is **not working** at the moment because Microsoft's Runtime in the Notebooks returns a credential with a scope that is not allowed to access Data Warehouses and SQL Endpoints. Use one of the alternatives above instead.
218+
219+
---
220+
221+
## Other methods
222+
223+
The adapter supports several additional authentication methods such as managed identity, interactive browser, and pre-acquired access tokens. For a complete list of all supported methods and their configuration options, see the [configuration documentation](configuration.md#authentication).
224+
225+
---
226+
227+
## Troubleshooting
228+
229+
### Which authentication method is being used?
230+
231+
Run `dbt debug` to see the resolved connection information, including the active authentication method.
232+
233+
```bash
234+
dbt debug
235+
```
236+
237+
### Common issues
238+
239+
| Symptom | Likely cause | Fix |
240+
| --- | --- | --- |
241+
| `Login timeout expired` | Slow network or restrictive firewall | Increase [`login_timeout`](configuration.md#login_timeout) (e.g. `30`) |
242+
| `AADSTS700016: Application not found` | Wrong `client_id` or the app isn't registered in the correct tenant | Verify the app registration in Microsoft Entra ID |
243+
| `DefaultAzureCredential failed` | No valid credential source found | Make sure you are logged in (`az login` / `Connect-AzAccount`) or that environment variables are set |
244+
| `Token expired` when using `access_token` | The pre-acquired token has expired | Refresh the token before running dbt |
245+
| `notebookutils not found` | Using `FabricSpark` outside of a Fabric/Synapse notebook | Switch to a different authentication method |

docs/feature-comparison.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ While most authentication methods have been contributed back to dbt-fabric, some
3535

3636
## MERGE in incremental and microbatch models
3737

38+
!!! info
39+
40+
MERGE has recently been added in Microsoft's version as well.
41+
3842
Incremental models in dbt-fabric support the `append`, `insert_overwrite`, and `delete+insert` strategies.
3943

4044
This adapter adds support for [MERGE](https://learn.microsoft.com/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver17&WT.mc_id=MVP_310840).
@@ -72,6 +76,10 @@ select * from source('my_source', 'my_table')
7276
{% endif %}
7377
```
7478

79+
## Better support for [warehouse snapshots](warehouse-snapshots.md)
80+
81+
Both adapters support Fabric [warehouse snapshots](https://learn.microsoft.com/fabric/data-warehouse/warehouse-snapshot?WT.mc_id=MVP_310840), but Microsoft's implementation hijacks Python runtime components and does not respect the proper dbt lifecycle. This adapter exposes a macro you can call from `on-run-start`, `on-run-end`, `post-hook`, or any other Jinja context — giving you full control over when and how often snapshots are taken.
82+
7583
## Better support for popular packages
7684

7785
[dbt-utils](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/) is already fully supported and more packages are being tested and added.

docs/python-models.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Python models
2+
3+
The dbt-fabric-samdebruyn adapter supports [Python models](https://docs.getdbt.com/docs/build/python-models), allowing you to use PySpark DataFrames to transform data in your Fabric Data Warehouse. This is a feature exclusive to this adapter — Microsoft's upstream dbt-fabric does not support it.
4+
5+
Python models are useful when you need transformations that are difficult or impossible to express in SQL, such as machine learning inference, complex string parsing, or calling external APIs.
6+
7+
---
8+
9+
## Prerequisites
10+
11+
To use Python models, your `profiles.yml` must include the following additional configuration options on top of the standard connection settings:
12+
13+
| Option | Description |
14+
| --- | --- |
15+
| [`workspace`](configuration.md#workspace_name) or [`workspace_id`](configuration.md#workspace_id) | Identifies your Fabric Workspace. Required so the adapter can locate the Livy API endpoint. |
16+
| [`lakehouse`](configuration.md#lakehouse_name) or [`lakehouse_id`](configuration.md#lakehouse_id) | Identifies the Lakehouse where Spark sessions run. A Lakehouse must exist in your workspace. |
17+
18+
!!! warning "Tenant ID required for service principal auth"
19+
20+
If you are using [`ActiveDirectoryServicePrincipal`](configuration.md#activedirectoryserviceprincipal) authentication, you must also provide the [`tenant_id`](configuration.md#tenant_id) option.
21+
22+
### Example profile
23+
24+
```yaml
25+
default:
26+
target: dev
27+
outputs:
28+
dev:
29+
type: fabric
30+
workspace: My Workspace
31+
database: my_data_warehouse
32+
schema: dbt
33+
lakehouse: My Lakehouse
34+
authentication: CLI
35+
```
36+
37+
---
38+
39+
## Writing a Python model
40+
41+
A Python model is a `.py` file in your `models/` directory that defines a `model()` function. This function receives a `dbt` object and a `spark` session, and must return a PySpark DataFrame.
42+
43+
```python
44+
def model(dbt, spark):
45+
source_df = dbt.ref("my_upstream_model")
46+
47+
result_df = source_df.withColumn("full_name",
48+
spark.sql("concat(first_name, ' ', last_name)")
49+
)
50+
51+
return result_df
52+
```
53+
54+
### The `dbt` object
55+
56+
The `dbt` object provides the same interface as in other adapters:
57+
58+
- **`dbt.ref("model_name")`** — Returns a PySpark DataFrame for the referenced model.
59+
- **`dbt.source("source_name", "table_name")`** — Returns a PySpark DataFrame for the referenced source.
60+
- **`dbt.config.get("key")`** — Access the model's configuration.
61+
62+
### The `spark` object
63+
64+
The `spark` object is a standard PySpark `SparkSession`. Behind the scenes, the adapter configures it with Fabric's [synapsesql connector](https://learn.microsoft.com/fabric/data-engineering/spark-data-warehouse-connector?WT.mc_id=MVP_310840) so that `dbt.ref()` and `dbt.source()` read directly from your Data Warehouse.
65+
66+
---
67+
68+
## How it works
69+
70+
Understanding the execution flow can help with debugging:
71+
72+
1. **Code generation** — dbt compiles your Python model and wraps it with boilerplate that configures the Spark session and sets up the `synapsesql` connector for reads and writes.
73+
2. **Livy session** — The adapter connects to the [Livy API](https://learn.microsoft.com/fabric/data-engineering/lakehouse-api?WT.mc_id=MVP_310840) on your Lakehouse and either reuses an existing Spark session named `dbt-fabric` or creates a new one.
74+
3. **Statement execution** — The compiled code is submitted as a PySpark statement to the Livy session.
75+
4. **Write back** — The returned DataFrame is written to your Data Warehouse using `synapsesql` in `overwrite` mode.
76+
77+
All Python models in a single dbt run share the same Livy session, which avoids the overhead of starting a new Spark session for each model.
78+
79+
---
80+
81+
## Limitations
82+
83+
| Limitation | Details |
84+
| --- | --- |
85+
| **Table materialization only** | Python models only support the `table` materialization. Incremental models are not supported. |
86+
| **PySpark DataFrames only** | Your `model()` function must return a PySpark DataFrame. Pandas DataFrames are not supported. |
87+
| **Always full refresh** | The table is fully replaced (`overwrite` mode) on each run. |
88+
| **Session timeout** | The adapter polls for session and statement completion with a timeout of approximately 5 minutes. Long-running Spark jobs may hit this limit. |
89+
90+
---
91+
92+
## Troubleshooting
93+
94+
### Common issues
95+
96+
| Symptom | Likely cause | Fix |
97+
| --- | --- | --- |
98+
| `workspace_id must be provided` | Missing workspace configuration | Add [`workspace`](configuration.md#workspace_name) or [`workspace_id`](configuration.md#workspace_id) to your profile |
99+
| `lakehouse_id must be provided` | Missing lakehouse configuration | Add [`lakehouse`](configuration.md#lakehouse_name) or [`lakehouse_id`](configuration.md#lakehouse_id) to your profile |
100+
| Livy session times out | The Spark session took too long to start | Retry — Fabric Spark sessions can be slow to start on first use |
101+
| Statement fails with `synapsesql` error | Connection between Spark and the Data Warehouse failed | Verify that the Lakehouse and Data Warehouse are in the same workspace |
102+
| `HTTP 429` errors in logs | Fabric API rate limiting | The adapter handles this automatically with retries — no action needed |

0 commit comments

Comments
 (0)