Skip to content

Commit 9047840

Browse files
[K9VULN-14660] fix(agentless-azure): harden agentless azure setup (#185)
* fix(agentless-azure): handle unpurged secret key during re-setup with a different RG * fix(agentless-azure): handle resource group mismatch on deploy/destroy * test(agentless-azure): improve test coverage * docs(agentless-azure): update readme * refactor(agentless-azure): remove dead code * add(agentless-azure): make permissions check softer in preflight if RG already exists * perf(agentless-azure): parallelize lookup checks and resource creation * fix(agentless-azure): skip Key Vault secret retries when Secrets Officer already exists * perf(agentless-azure): bump Terraform parallelism 10 -> 20 * perf(agentless-azure): run preflight checks in parallel * test(agentless-azure): improve test coverage * fix(agentless): mark active workflow step FAILED when Azure deploy exits with error * add(agentless-azure): add roleDefinitions/write permission in preflight * agentless azure: derive install_id and discover existing deployments via RG tag * agentless azure: scope storage account and key vault names to install_id * docs(agentless-azure): update readme with required permissions and document new RG behavior * fix(agentless-azure): drop unused local in cmd_deploy * fix(agentless-azure): treat missing state storage account as missing metadata on first deploy * build(agentless-azure): re-build dist scripts
1 parent 6afe253 commit 9047840

16 files changed

Lines changed: 1851 additions & 393 deletions

azure/agentless/README.md

Lines changed: 97 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ This script automates the deployment of Datadog Agentless Scanner on Azure using
77
- Azure Cloud Shell or a machine with:
88
- `az` CLI installed and authenticated (`az login`)
99
- `terraform` CLI installed (>= 1.0)
10-
- Azure subscriptions with appropriate permissions:
11-
- `Owner` (or a role granting role-assignment write + resource creation) on the scanner subscription
12-
- A role granting `Microsoft.Authorization/roleAssignments/write` on each scanned subscription (e.g., `User Access Administrator` or `Owner`), so the scanner's managed identity can be granted the roles it needs to snapshot and read disks
10+
- Azure permissions (see [Permissions](#permissions) for the delegated, least-privilege variant if the SWE running the script is not subscription `Owner`):
11+
- `Owner` on the scanner subscription, **or** the custom role described below at the scanner subscription scope plus both `Contributor` and `User Access Administrator` on the scanner resource group (the latter is required so the script can grant itself `Storage Blob Data Contributor` on the state Storage Account and `Key Vault Secrets Officer` on the Key Vault)
12+
- A role granting `Microsoft.Authorization/roleAssignments/write` and `Microsoft.Authorization/roleDefinitions/write` on each scanned subscription (e.g., `User Access Administrator` or `Owner`), so the scanner's managed identity can be granted scan permissions and the custom scanning role can list the scan-target subscription in its `assignableScopes`
1313
- The following resource providers registered in the scanner subscription (auto-registered by the script when possible): `Microsoft.Compute`, `Microsoft.Network`, `Microsoft.ManagedIdentity`, `Microsoft.Storage`, `Microsoft.KeyVault`, `Microsoft.Authorization`
1414

1515
## Usage
@@ -40,11 +40,13 @@ python azure_agentless_setup.pyz deploy
4040
| `SCANNER_SUBSCRIPTION` | Yes | Azure subscription ID where the scanner will be deployed |
4141
| `SCANNER_LOCATIONS` | Yes | Comma-separated list of Azure locations (max 4) for scanners (e.g., `eastus` or `eastus,westeurope`) |
4242
| `SUBSCRIPTIONS_TO_SCAN` | Yes | Comma-separated list of Azure subscription IDs to scan |
43-
| `SCANNER_RESOURCE_GROUP` | No | Resource group name for scanner resources (default: `datadog-agentless-scanner`) |
43+
| `SCANNER_RESOURCE_GROUP` | No | Resource group name for scanner resources (default: `datadog-agentless-scanner`). When unset, the script auto-discovers the resource group from the `DatadogAgentlessScanner=true` tag the previous deploy applied — this is what makes re-runs from a fresh Cloud Shell session work without re-setting the env var. To relocate an existing deployment to a different resource group, run `destroy` first. |
4444
| `TF_STATE_STORAGE_ACCOUNT` | No | Custom Azure Storage Account for Terraform state (see below) |
4545

4646
Re-running `deploy` with new `SCANNER_LOCATIONS` or `SUBSCRIPTIONS_TO_SCAN` values merges them with the existing deployment (stored in the Terraform state storage account) instead of replacing it.
4747

48+
Only one Agentless Scanner deployment is supported per scanner subscription. If the script detects more than one tagged deployment in the scanner subscription, `deploy` fails fast and lists the resource groups so you can `destroy` the ones you no longer need — it never picks one silently. (On `destroy`, set `SCANNER_RESOURCE_GROUP` to choose which one to remove.)
49+
4850
### Destroy
4951

5052
To remove the scanner infrastructure:
@@ -67,22 +69,25 @@ If only one installation exists locally, `SCANNER_SUBSCRIPTION` can be omitted a
6769
| `DD_APP_KEY` | Yes | Datadog Application key |
6870
| `DD_SITE` | Yes | Datadog site |
6971
| `SCANNER_SUBSCRIPTION` | No* | Scanner subscription ID (*inferred if only one installation exists locally) |
70-
| `SCANNER_RESOURCE_GROUP` | No* | Resource group name (*required only if metadata does not contain it, e.g., installations created before this field was added) |
72+
| `SCANNER_RESOURCE_GROUP` | No* | Resource group name (*auto-discovered from the `DatadogAgentlessScanner=true` tag when exactly one tagged deployment exists in the scanner subscription; required when the resource group was not tagged at deploy time — typically an admin-pre-created resource group — or when multiple tagged deployments exist) |
7173
| `TF_STATE_STORAGE_ACCOUNT` | No | Custom Storage Account (if used during deploy) |
7274
| `SCANNER_LOCATIONS` | No* | Locations to destroy (*fallback only if deployment metadata cannot be read) |
7375
| `SUBSCRIPTIONS_TO_SCAN` | No* | Subscriptions to clean up (*fallback only if deployment metadata cannot be read) |
7476

7577
The destroy command will:
7678
1. Run `terraform destroy` (prompts for confirmation)
7779
2. Disable the Agentless scan options in Datadog for each previously configured subscription
78-
3. Ask if you want to delete the Key Vault holding the API key (kept by default to allow reuse)
79-
4. Leave the resource group and Terraform state storage account intact (manual deletion instructions provided)
80+
3. Remove the deployment metadata blob from the state Storage Account on a successful run; if scan-options cleanup partially failed, the metadata is kept so a follow-up `destroy` can still find the subscription list
81+
4. Ask if you want to delete the Key Vault holding the API key (kept by default to allow reuse)
82+
5. Leave the resource group and Terraform state storage account intact (manual deletion instructions provided)
83+
84+
Unlike `deploy`, `destroy` does not run a permission preflight: missing role assignments surface as `terraform destroy` errors rather than a pre-run summary.
8085

8186
### Terraform State Storage
8287

8388
Terraform state is stored in an Azure Storage Account (blob container `tfstate`, key `datadog-agentless.tfstate`) to ensure persistence across runs and enable future updates or teardown.
8489

85-
**Default behavior:** A storage account with a deterministic name derived from the scanner subscription ID (e.g., `datadog<hash>`) is automatically created inside the scanner resource group. If it already exists (e.g., from a previous run), it is reused. The `azurerm` backend is configured with `use_azuread_auth = true`, and the script grants the current user the `Storage Blob Data Contributor` role on the account.
90+
**Default behavior:** A storage account named `datadog<install-id>` (where `install-id` is the first 12 hex chars of `sha256("<scanner-subscription>|<resource-group>")`) is created inside the scanner resource group. Two deploys against the same `(SCANNER_SUBSCRIPTION, SCANNER_RESOURCE_GROUP)` pair resolve to the same Storage Account name and are therefore the same install. Re-running with a different `SCANNER_RESOURCE_GROUP` resolves to a different Storage Account, which is why this combination uniquely identifies an installation. The `azurerm` backend is configured with `use_azuread_auth = true`, and the script grants the current user the `Storage Blob Data Contributor` role on the account.
8691

8792
**Custom storage account:** Set `TF_STATE_STORAGE_ACCOUNT` to use your own account:
8893
```bash
@@ -95,13 +100,14 @@ The custom storage account must already exist in `SCANNER_RESOURCE_GROUP`; the s
95100

96101
## What it does
97102

98-
1. **Validates prerequisites** - Checks Datadog credentials, Azure authentication, subscription access, required RBAC actions, and registers required resource providers
99-
2. **Creates state storage** - Ensures the resource group, Storage Account, and `tfstate` blob container exist, and grants the current user blob data access
100-
3. **Stores API key in Key Vault** - Creates an RBAC-authorized Key Vault (or recovers a soft-deleted one) and stores the Datadog API key as a secret
101-
4. **Generates Terraform configuration** - Creates `main.tf` referencing the `terraform-module-datadog-agentless-scanner` Azure sub-modules (managed identity, roles, custom data, virtual network, VMSS), one virtual network + VMSS per location
102-
5. **Runs Terraform** - Executes `terraform init` and `terraform apply`
103+
1. **Discovers existing deployment** - Lists resource groups in the scanner subscription tagged `DatadogAgentlessScanner=true`. If exactly one is found and `SCANNER_RESOURCE_GROUP` is unset, the deployment is silently reused; if it disagrees with an explicitly set `SCANNER_RESOURCE_GROUP`, deploy fails with guidance to either reuse it or destroy first; if more than one is found, deploy fails (single-install policy)
104+
2. **Validates prerequisites** - Checks Datadog credentials, Azure authentication, subscription access, required RBAC actions, and registers required resource providers
105+
3. **Creates state storage** - Ensures the resource group, Storage Account, and `tfstate` blob container exist, and grants the current user blob data access. The resource group is tagged with `Datadog=true` and `DatadogAgentlessScanner=true` only when the script creates it; resource groups that already exist (e.g., admin-pre-created) are left untagged so the marker never appears on resources the script does not own
106+
4. **Stores API key in Key Vault** - Creates an RBAC-authorized Key Vault (or recovers a soft-deleted one) and stores the Datadog API key as a secret
107+
5. **Generates Terraform configuration** - Creates `main.tf` referencing the `terraform-module-datadog-agentless-scanner` Azure sub-modules (managed identity, roles, custom data, virtual network, VMSS), one virtual network + VMSS per location
108+
6. **Runs Terraform** - Executes `terraform init` and `terraform apply`
103109

104-
Deployment metadata (locations, subscriptions, resource group) is written to the state storage account after a successful apply so that later `deploy` runs can merge new inputs and `destroy` runs can recover the full configuration without local state.
110+
Deployment metadata (locations, subscriptions, resource group, `install-id`) is written to the state storage account after a successful apply so that later `deploy` runs can merge new inputs and `destroy` runs can recover the full configuration without local state.
105111

106112
## Resources Created
107113

@@ -115,6 +121,83 @@ Deployment metadata (locations, subscriptions, resource group) is written to the
115121
- **Scanned Subscriptions:**
116122
- Role assignments granting the scanner's managed identity the permissions needed to snapshot and read disks
117123

124+
## Permissions
125+
126+
The simplest path is to run the setup as `Owner` on the scanner subscription and on every scanned subscription. Many enterprise tenants instead pre-create the resource group and grant the engineer running the setup a least-privilege custom role; this section documents that delegated path.
127+
128+
The setup needs three independent grants:
129+
130+
1. **Scanner resource group** — write access to the resources created inside the RG (Storage Account, Key Vault, managed identity, VNets, VMSS) and the ability to grant the running user data-plane access on the SA and KV.
131+
2. **Scanner subscription** — read access for discovery + write access to create the custom scanning role definition at the subscription scope.
132+
3. **Each scanned subscription** — write access to attach the scanning role to the managed identity at the scan target's scope, plus the matching `roleDefinitions/write` so the custom role can declare the scan target in its `assignableScopes`.
133+
134+
### 1. Scanner resource group
135+
136+
Pre-create the resource group with the desired name and grant the engineer:
137+
138+
- `Contributor` on the RG — covers Storage Account, Key Vault, managed identity, virtual network, NAT gateway, and VMSS creation.
139+
- `User Access Administrator` on the RG — covers the `roleAssignments/write` needed by the script to grant itself `Storage Blob Data Contributor` on the state Storage Account and `Key Vault Secrets Officer` on the Key Vault.
140+
141+
The Terraform-state Storage Account is created **inside this RG** by default, so the engineer does not need any additional subscription-wide Storage permissions for state.
142+
143+
### 2. Scanner subscription — custom role for the engineer
144+
145+
Create the following custom role at the scanner subscription scope. It bundles every read action the setup performs at the subscription level plus the `roleDefinitions/write` introduced by the custom scanning role:
146+
147+
```json
148+
{
149+
"Name": "Datadog Agentless Scanner Deployer (scanner subscription)",
150+
"Description": "Permissions required by the engineer running the Datadog Agentless Scanner Azure setup on the scanner subscription.",
151+
"Actions": [
152+
"Microsoft.Resources/subscriptions/resourceGroups/read",
153+
"Microsoft.Resources/subscriptions/resourceProviders/read",
154+
"Microsoft.Resources/subscriptions/resourceProviders/register/action",
155+
"Microsoft.KeyVault/locations/deletedVaults/read",
156+
"Microsoft.Authorization/permissions/read",
157+
"Microsoft.Authorization/roleAssignments/read",
158+
"Microsoft.Authorization/roleAssignments/write",
159+
"Microsoft.Authorization/roleAssignments/delete",
160+
"Microsoft.Authorization/roleDefinitions/read",
161+
"Microsoft.Authorization/roleDefinitions/write",
162+
"Microsoft.Authorization/roleDefinitions/delete"
163+
],
164+
"AssignableScopes": [
165+
"/subscriptions/<scanner-subscription-id>"
166+
]
167+
}
168+
```
169+
170+
Notes:
171+
172+
- `resourceProviders/register/action` is only exercised when the required providers are not pre-registered. You can drop it from the role and register the providers manually instead (see the prerequisites list).
173+
- `subscriptions/resourceGroups/read` is needed for tag-based discovery of existing deployments (the `az group list --tag DatadogAgentlessScanner=true` lookup).
174+
- `roleDefinitions/write` is needed because the Terraform module creates a custom scanning role whose primary scope is the scanner subscription.
175+
176+
### 3. Each scanned subscription — custom role for the engineer
177+
178+
For every subscription listed in `SUBSCRIPTIONS_TO_SCAN` (other than the scanner subscription), create the same custom role at the scan-target scope with just the role-management actions:
179+
180+
```json
181+
{
182+
"Name": "Datadog Agentless Scanner Deployer (scan target)",
183+
"Description": "Permissions required by the engineer running the Datadog Agentless Scanner setup, on each scanned subscription.",
184+
"Actions": [
185+
"Microsoft.Authorization/permissions/read",
186+
"Microsoft.Authorization/roleAssignments/read",
187+
"Microsoft.Authorization/roleAssignments/write",
188+
"Microsoft.Authorization/roleAssignments/delete",
189+
"Microsoft.Authorization/roleDefinitions/read",
190+
"Microsoft.Authorization/roleDefinitions/write",
191+
"Microsoft.Authorization/roleDefinitions/delete"
192+
],
193+
"AssignableScopes": [
194+
"/subscriptions/<scan-target-subscription-id>"
195+
]
196+
}
197+
```
198+
199+
The same role is sufficient for both `deploy` and `destroy`. On `deploy`, the preflight will fail fast with a clear error listing the missing actions if any of the above are not granted; on `destroy`, missing actions surface during `terraform destroy` (no preflight is run).
200+
118201
## Building
119202

120203
From the `azure/` directory:
38.2 KB
Binary file not shown.

azure/agentless/src/azure_agentless_setup/config.py

Lines changed: 64 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,9 @@
33

44
"""Configuration parsing from environment variables."""
55

6+
import hashlib
67
import os
7-
from dataclasses import dataclass
8+
from dataclasses import dataclass, replace
89
from pathlib import Path
910
from typing import Optional
1011

@@ -17,10 +18,36 @@
1718

1819
DEFAULT_RESOURCE_GROUP = "datadog-agentless-scanner"
1920

21+
# Length of the install identifier embedded in resource names and local
22+
# paths. Truncated SHA-256 hex; 12 characters matches the existing storage
23+
# account / Key Vault name budget (24 chars, lowercase alphanumeric) and
24+
# keeps the install_id readable in log lines without padding everything.
25+
INSTALL_ID_LEN = 12
2026

21-
def get_config_dir(scanner_subscription: str) -> Path:
22-
"""Get the configuration directory for a scanner subscription."""
23-
return CONFIG_BASE_DIR / scanner_subscription
27+
28+
def compute_install_id(scanner_subscription: str, resource_group: str) -> str:
29+
"""Derive a stable per-install identifier from ``(subscription, RG)``.
30+
31+
Two deploys that share both the scanner subscription and the resource
32+
group resolve to the same install (same Storage Account, Key Vault,
33+
local working dir); changing the resource group produces a different
34+
install. The function is deterministic — no state is needed to
35+
recompute the identifier on a fresh shell.
36+
"""
37+
digest = hashlib.sha256(
38+
f"{scanner_subscription}|{resource_group}".encode()
39+
).hexdigest()
40+
return digest[:INSTALL_ID_LEN]
41+
42+
43+
def get_config_dir(scanner_subscription: str, install_id: str) -> Path:
44+
"""Return the per-install local working directory.
45+
46+
Nested under the subscription so we can keep enumerating subscriptions
47+
in destroy (one folder per scanner subscription) while reserving room
48+
for future multi-install support (one folder per install_id).
49+
"""
50+
return CONFIG_BASE_DIR / scanner_subscription / install_id
2451

2552

2653
@dataclass
@@ -42,6 +69,23 @@ class Config:
4269
# Optional: custom Azure Storage Account for Terraform state
4370
state_storage_account: Optional[str] = None
4471

72+
# Whether ``SCANNER_RESOURCE_GROUP`` was set by the user. Tag-based
73+
# discovery uses this to decide whether to reuse a tagged RG silently
74+
# (env var unset → reuse) or fail with a mismatch error (env var set
75+
# to a different value).
76+
resource_group_explicit: bool = False
77+
78+
@property
79+
def install_id(self) -> str:
80+
"""Stable per-install identifier derived from ``(subscription, RG)``.
81+
82+
Exposed as a property so callers don't have to keep the field in
83+
sync with ``resource_group``: ``with_resource_group`` is the only
84+
way the resource group changes after parsing and the new
85+
``install_id`` is recomputed automatically.
86+
"""
87+
return compute_install_id(self.scanner_subscription, self.resource_group)
88+
4589
@property
4690
def all_subscriptions(self) -> list[str]:
4791
"""All subscriptions including scanner subscription (deduplicated)."""
@@ -61,18 +105,22 @@ def scan_scopes(self) -> list[str]:
61105

62106
def with_merged(self, locations: list[str], subscriptions_to_scan: list[str]) -> "Config":
63107
"""Return a copy with merged locations and subscriptions."""
64-
return Config(
65-
api_key=self.api_key,
66-
app_key=self.app_key,
67-
site=self.site,
68-
workflow_id=self.workflow_id,
69-
scanner_subscription=self.scanner_subscription,
108+
return replace(
109+
self,
70110
locations=locations,
71111
subscriptions_to_scan=subscriptions_to_scan,
72-
resource_group=self.resource_group,
73-
state_storage_account=self.state_storage_account,
74112
)
75113

114+
def with_resource_group(self, resource_group: str) -> "Config":
115+
"""Return a copy targeting ``resource_group``.
116+
117+
Used by tag-based RG discovery to switch to an existing tagged
118+
resource group when the user did not pin one with
119+
``SCANNER_RESOURCE_GROUP``. ``install_id`` is recomputed
120+
automatically because it's a property.
121+
"""
122+
return replace(self, resource_group=resource_group)
123+
76124

77125
def parse_config() -> Config:
78126
"""Parse configuration from environment variables.
@@ -110,7 +158,9 @@ def parse_config() -> Config:
110158
if not subscriptions_str:
111159
errors.append("SUBSCRIPTIONS_TO_SCAN is required (comma-separated list)")
112160

113-
resource_group = os.environ.get("SCANNER_RESOURCE_GROUP", "").strip() or DEFAULT_RESOURCE_GROUP
161+
rg_env = os.environ.get("SCANNER_RESOURCE_GROUP", "").strip()
162+
resource_group = rg_env or DEFAULT_RESOURCE_GROUP
163+
resource_group_explicit = bool(rg_env)
114164

115165
if errors:
116166
usage = """
@@ -164,4 +214,5 @@ def parse_config() -> Config:
164214
subscriptions_to_scan=subscriptions_to_scan,
165215
resource_group=resource_group,
166216
state_storage_account=state_storage_account,
217+
resource_group_explicit=resource_group_explicit,
167218
)

0 commit comments

Comments
 (0)