Skip to content

Commit 6a23ff9

Browse files
authored
feat(google_bigquery_syndicated_dataset): initial revision (#444)
* feat(google_bigquery_syndicated_dataset): initial revision * feat(google_bigquery_syndicated_dataset): support nonauthoritative config
1 parent cc953e0 commit 6a23ff9

File tree

6 files changed

+533
-0
lines changed

6 files changed

+533
-0
lines changed
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
content: |-
2+
{{ .Header }}
3+
4+
## Example
5+
6+
```hcl
7+
module "treeherder" {
8+
source = "github.com/mozilla/terraform-modules//google_bigquery_syndicated_dataset?ref=main"
9+
10+
dataset_id = "for_treeherder_1"
11+
syndicated_dataset_id = "treeherder_db"
12+
realm = var.realm
13+
14+
access = [
15+
{ role = "OWNER", special_group = "projectOwners" },
16+
# projectReaders/projectWriters usage is discouraged, see DSRE-1497
17+
{ role = "READER", special_group = "projectReaders" },
18+
{ role = "WRITER", special_group = "projectWriters" },
19+
]
20+
}
21+
```
22+
23+
{{ .Requirements }}
24+
25+
{{ .Providers }}
26+
27+
{{ .Modules }}
28+
29+
{{ .Resources }}
30+
31+
{{ .Inputs }}
32+
33+
{{ .Outputs }}
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
<!-- BEGIN_TF_DOCS -->
2+
# google\_bigquery\_syndicated\_dataset
3+
4+
Creates a BigQuery dataset configured for syndication to Mozilla Data Platform
5+
infrastructure (mozdata and data-shared projects). This module is meant to
6+
simplify the steps in [Importing Data from OLTP Databases to BigQuery via Federated Queries](https://mozilla-hub.atlassian.net/wiki/spaces/IP/pages/473727279/Importing+Data+from+OLTP+Databases+to+BigQuery+via+Federated+Queries)
7+
8+
This module abstracts away the syndication boilerplate:
9+
- Resolves syndication service accounts via workgroup
10+
- Looks up the org custom role for syndication
11+
- Auto-discovers whether syndicated datasets exist in data platform projects
12+
- Adds dataset authorizations only when targets exist
13+
14+
## Target Inference
15+
16+
The `syndicated_dataset_id` (defaults to `dataset_id`) determines targets:
17+
- Does NOT end in `_syndicate` → user-facing → both mozdata and data-shared
18+
- Ends in `_syndicate` → data-shared only
19+
- Eventually the syndication datasets themselves will be inferred from bqetl metadata available to all MozCloud tenant infrastructure
20+
21+
## State propagation
22+
23+
While this module reduces the amount of PRs required to set up syndication, it will not automatically
24+
propagate those changes. You still need to follow the steps on
25+
https://mozilla-hub.atlassian.net/wiki/spaces/SRE/pages/27924945/Atlantis+-+Terraform+Automation#Invoking-Atlantis-without-terraform-changes
26+
in order to authorize datasets on the tenant infra side. Eventually policy-as-code and drift
27+
detection automation will make these manual steps unnecessary.
28+
29+
## Example
30+
31+
```hcl
32+
module "treeherder" {
33+
source = "github.com/mozilla/terraform-modules//google_bigquery_syndicated_dataset?ref=main"
34+
35+
dataset_id = "for_treeherder_1"
36+
syndicated_dataset_id = "treeherder_db"
37+
realm = var.realm
38+
39+
access = [
40+
{ role = "OWNER", special_group = "projectOwners" },
41+
# projectReaders/projectWriters usage is discouraged, see DSRE-1497
42+
{ role = "READER", special_group = "projectReaders" },
43+
{ role = "WRITER", special_group = "projectWriters" },
44+
]
45+
}
46+
```
47+
48+
## Requirements
49+
50+
| Name | Version |
51+
|------|---------|
52+
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.0 |
53+
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 4.0 |
54+
55+
## Providers
56+
57+
| Name | Version |
58+
|------|---------|
59+
| <a name="provider_google"></a> [google](#provider\_google) | >= 4.0 |
60+
| <a name="provider_terraform"></a> [terraform](#provider\_terraform) | n/a |
61+
62+
## Modules
63+
64+
| Name | Source | Version |
65+
|------|--------|---------|
66+
| <a name="module_syndication_workgroup"></a> [syndication\_workgroup](#module\_syndication\_workgroup) | github.com/mozilla/terraform-modules//mozilla_workgroup | main |
67+
68+
## Resources
69+
70+
| Name | Type |
71+
|------|------|
72+
| [google_bigquery_dataset.dataset](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset) | resource |
73+
| [google_bigquery_dataset_access.syndicated_authorization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset_access) | resource |
74+
| [google_bigquery_dataset_access.syndication_role](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset_access) | resource |
75+
| [terraform_remote_state.org](https://registry.terraform.io/providers/hashicorp/terraform/latest/docs/data-sources/remote_state) | data source |
76+
| [terraform_remote_state.syndication_target](https://registry.terraform.io/providers/hashicorp/terraform/latest/docs/data-sources/remote_state) | data source |
77+
78+
## Inputs
79+
80+
| Name | Description | Type | Default | Required |
81+
|------|-------------|------|---------|:--------:|
82+
| <a name="input_access"></a> [access](#input\_access) | Application-specific access blocks for this dataset. projectOwners OWNER access is included by default unless disable\_project\_owners\_access is set. | <pre>set(object({<br/> role = optional(string)<br/> user_by_email = optional(string)<br/> group_by_email = optional(string)<br/> special_group = optional(string)<br/> domain = optional(string)<br/> iam_member = optional(string)<br/> dataset = optional(object({<br/> dataset = object({<br/> project_id = string<br/> dataset_id = string<br/> })<br/> target_types = list(string)<br/> }))<br/> view = optional(object({<br/> project_id = string<br/> dataset_id = string<br/> table_id = string<br/> }))<br/> }))</pre> | `[]` | no |
83+
| <a name="input_create_dataset"></a> [create\_dataset](#input\_create\_dataset) | Whether to create the BigQuery dataset. Set to false to only manage syndication access on an existing dataset. | `bool` | `true` | no |
84+
| <a name="input_dataset_id"></a> [dataset\_id](#input\_dataset\_id) | A unique ID for this dataset, without the project name. | `string` | n/a | yes |
85+
| <a name="input_default_partition_expiration_ms"></a> [default\_partition\_expiration\_ms](#input\_default\_partition\_expiration\_ms) | The default partition expiration for all partitioned tables, in milliseconds. | `number` | `null` | no |
86+
| <a name="input_default_table_expiration_ms"></a> [default\_table\_expiration\_ms](#input\_default\_table\_expiration\_ms) | The default lifetime of all tables in the dataset, in milliseconds. | `number` | `null` | no |
87+
| <a name="input_delete_contents_on_destroy"></a> [delete\_contents\_on\_destroy](#input\_delete\_contents\_on\_destroy) | If true, delete all tables in the dataset when destroying the resource. | `bool` | `false` | no |
88+
| <a name="input_description"></a> [description](#input\_description) | A user-friendly description of the dataset. | `string` | `null` | no |
89+
| <a name="input_disable_project_owners_access"></a> [disable\_project\_owners\_access](#input\_disable\_project\_owners\_access) | Disable the implied projectOwners OWNER access on this dataset. This should almost never be set. | `bool` | `false` | no |
90+
| <a name="input_friendly_name"></a> [friendly\_name](#input\_friendly\_name) | A descriptive name for the dataset. | `string` | `null` | no |
91+
| <a name="input_labels"></a> [labels](#input\_labels) | Labels to apply to the dataset. | `map(string)` | `{}` | no |
92+
| <a name="input_location"></a> [location](#input\_location) | The geographic location where the dataset should reside. | `string` | `"US"` | no |
93+
| <a name="input_max_time_travel_hours"></a> [max\_time\_travel\_hours](#input\_max\_time\_travel\_hours) | Defines the time travel window in hours. | `number` | `null` | no |
94+
| <a name="input_realm"></a> [realm](#input\_realm) | Source infrastructure realm. | `string` | n/a | yes |
95+
| <a name="input_syndicated_dataset_id"></a> [syndicated\_dataset\_id](#input\_syndicated\_dataset\_id) | Name of the dataset in target projects. Defaults to dataset\_id. If name ends in '\_syndicate', only data-shared is targeted (no mozdata). | `string` | `null` | no |
96+
| <a name="input_syndication_workgroup_ids"></a> [syndication\_workgroup\_ids](#input\_syndication\_workgroup\_ids) | Workgroup identifiers for service accounts that perform syndication. | `list(string)` | <pre>[<br/> "workgroup:dataplatform/jenkins"<br/>]</pre> | no |
97+
| <a name="input_target_realm"></a> [target\_realm](#input\_target\_realm) | Target realm for syndication. Defaults to realm. Set override, e.g. nonprod source syndicating to prod targets. | `string` | `null` | no |
98+
99+
## Outputs
100+
101+
| Name | Description |
102+
|------|-------------|
103+
| <a name="output_dataset_id"></a> [dataset\_id](#output\_dataset\_id) | The dataset ID. |
104+
| <a name="output_id"></a> [id](#output\_id) | The fully-qualified dataset ID (projects/PROJECT/datasets/DATASET). |
105+
| <a name="output_self_link"></a> [self\_link](#output\_self\_link) | The URI of the created resource. |
106+
| <a name="output_syndication_role_id"></a> [syndication\_role\_id](#output\_syndication\_role\_id) | The custom role ID used for syndication access. |
107+
| <a name="output_syndication_service_accounts"></a> [syndication\_service\_accounts](#output\_syndication\_service\_accounts) | The service account emails used for syndication. |
108+
| <a name="output_syndication_targets_active"></a> [syndication\_targets\_active](#output\_syndication\_targets\_active) | Map of syndication target names to whether authorized dataset access is active. |
109+
<!-- END_TF_DOCS -->
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
/**
2+
* # google_bigquery_syndicated_dataset
3+
*
4+
* Creates a BigQuery dataset configured for syndication to Mozilla Data Platform
5+
* infrastructure (mozdata and data-shared projects). This module is meant to
6+
* simplify the steps in [Importing Data from OLTP Databases to BigQuery via Federated Queries](https://mozilla-hub.atlassian.net/wiki/spaces/IP/pages/473727279/Importing+Data+from+OLTP+Databases+to+BigQuery+via+Federated+Queries)
7+
*
8+
* This module abstracts away the syndication boilerplate:
9+
* - Resolves syndication service accounts via workgroup
10+
* - Looks up the org custom role for syndication
11+
* - Auto-discovers whether syndicated datasets exist in data platform projects
12+
* - Adds dataset authorizations only when targets exist
13+
*
14+
* ## Target Inference
15+
*
16+
* The `syndicated_dataset_id` (defaults to `dataset_id`) determines targets:
17+
* - Does NOT end in `_syndicate` → user-facing → both mozdata and data-shared
18+
* - Ends in `_syndicate` → data-shared only
19+
* - Eventually the syndication datasets themselves will be inferred from bqetl metadata available to all MozCloud tenant infrastructure
20+
*
21+
* ## State propagation
22+
*
23+
* While this module reduces the amount of PRs required to set up syndication, it will not automatically
24+
* propagate those changes. You still need to follow the steps on
25+
* https://mozilla-hub.atlassian.net/wiki/spaces/SRE/pages/27924945/Atlantis+-+Terraform+Automation#Invoking-Atlantis-without-terraform-changes
26+
* in order to authorize datasets on the tenant infra side. Eventually policy-as-code and drift
27+
* detection automation will make these manual steps unnecessary.
28+
*
29+
*/
30+
31+
locals {
32+
target_realm = coalesce(var.target_realm, var.realm)
33+
syndicated_dataset_id = coalesce(var.syndicated_dataset_id, var.dataset_id)
34+
is_user_facing = !endswith(local.syndicated_dataset_id, "_syndicate")
35+
36+
target_env = local.target_realm == "prod" ? "prod" : "stage"
37+
38+
# Syndication target configuration: data-shared always, mozdata only for user-facing datasets
39+
target_config = merge(
40+
{
41+
data-shared = {
42+
project_ids = { prod = "moz-fx-data-shared-prod", nonprod = "moz-fx-data-shar-nonprod-efed" }
43+
state_path = "bigquery-new"
44+
}
45+
},
46+
local.is_user_facing ? {
47+
mozdata = {
48+
project_ids = { prod = "mozdata", nonprod = "mozdata-nonprod" }
49+
state_path = "bigquery"
50+
}
51+
} : {}
52+
)
53+
54+
targets = {
55+
for name, cfg in local.target_config :
56+
name => {
57+
project_id = cfg.project_ids[local.target_realm]
58+
state_prefix = "projects/${name}/${local.target_realm}/envs/${local.target_env}/${cfg.state_path}"
59+
}
60+
}
61+
}
62+
63+
# Remote state from syndication targets to check if datasets exist
64+
data "terraform_remote_state" "syndication_target" {
65+
for_each = local.targets
66+
67+
backend = "gcs"
68+
69+
config = {
70+
bucket = "${each.value.project_id}-tf"
71+
prefix = each.value.state_prefix
72+
}
73+
}
74+
75+
locals {
76+
# Authorized dataset access for targets where the syndicated dataset exists
77+
syndication_dataset_access = [
78+
for name, target in local.targets : {
79+
project_id = target.project_id
80+
dataset_id = local.syndicated_dataset_id
81+
}
82+
if contains(
83+
values(data.terraform_remote_state.syndication_target[name].outputs.syndicate_datasets),
84+
local.syndicated_dataset_id
85+
)
86+
]
87+
}
88+
89+
data "terraform_remote_state" "org" {
90+
backend = "gcs"
91+
92+
config = {
93+
bucket = "moz-fx-platform-mgmt-global-tf"
94+
prefix = "projects/org"
95+
}
96+
}
97+
98+
# Service accounts that perform syndication
99+
# Currently Jenkins with plans to move to Airflow, see https://mozilla-hub.atlassian.net/browse/SVCSE-3005
100+
module "syndication_workgroup" {
101+
source = "github.com/mozilla/terraform-modules//mozilla_workgroup?ref=main"
102+
ids = var.syndication_workgroup_ids
103+
# TODO this config will need to be removed when SVCSE-4008 is complete
104+
terraform_remote_state_bucket = "moz-fx-data-terraform-state-global"
105+
terraform_remote_state_prefix = "projects/data-shared/global/access-groups"
106+
}
107+
108+
resource "google_bigquery_dataset" "dataset" {
109+
count = var.create_dataset ? 1 : 0
110+
111+
dataset_id = var.dataset_id
112+
location = var.location
113+
friendly_name = var.friendly_name
114+
description = var.description
115+
labels = var.labels
116+
default_table_expiration_ms = var.default_table_expiration_ms
117+
default_partition_expiration_ms = var.default_partition_expiration_ms
118+
max_time_travel_hours = var.max_time_travel_hours
119+
delete_contents_on_destroy = var.delete_contents_on_destroy
120+
121+
# projectOwners access is implied unless explicitly disabled
122+
dynamic "access" {
123+
for_each = var.disable_project_owners_access ? [] : [1]
124+
content {
125+
role = "OWNER"
126+
special_group = "projectOwners"
127+
}
128+
}
129+
130+
# App-specific IAM access
131+
dynamic "access" {
132+
for_each = [for a in var.access : a if a.role != null && a.dataset == null && a.view == null]
133+
content {
134+
role = access.value.role
135+
user_by_email = access.value.user_by_email
136+
group_by_email = access.value.group_by_email
137+
special_group = access.value.special_group
138+
domain = access.value.domain
139+
iam_member = access.value.iam_member
140+
}
141+
}
142+
143+
# App-specific non-syndicate authorized dataset access
144+
dynamic "access" {
145+
for_each = [for a in var.access : a if a.dataset != null]
146+
content {
147+
dataset {
148+
dataset {
149+
project_id = access.value.dataset.dataset.project_id
150+
dataset_id = access.value.dataset.dataset.dataset_id
151+
}
152+
target_types = access.value.dataset.target_types
153+
}
154+
}
155+
}
156+
157+
# App-specific authorized views
158+
dynamic "access" {
159+
for_each = [for a in var.access : a if a.view != null]
160+
content {
161+
view {
162+
project_id = access.value.view.project_id
163+
dataset_id = access.value.view.dataset_id
164+
table_id = access.value.view.table_id
165+
}
166+
}
167+
}
168+
169+
# Syndication service account access
170+
dynamic "access" {
171+
for_each = module.syndication_workgroup.service_accounts
172+
content {
173+
role = data.terraform_remote_state.org.outputs.bigquery_jobs_manage_syndicate_dataset_role_id
174+
user_by_email = access.value
175+
}
176+
}
177+
178+
# Syndication authorized dataset access for syndicates
179+
dynamic "access" {
180+
for_each = local.syndication_dataset_access
181+
content {
182+
dataset {
183+
dataset {
184+
project_id = access.value.project_id
185+
dataset_id = access.value.dataset_id
186+
}
187+
target_types = ["VIEWS"]
188+
}
189+
}
190+
}
191+
}
192+
193+
# Non-authoritative syndication access for externally-managed datasets
194+
resource "google_bigquery_dataset_access" "syndication_role" {
195+
for_each = var.create_dataset ? {} : {
196+
for sa in module.syndication_workgroup.service_accounts : sa => sa
197+
}
198+
199+
dataset_id = var.dataset_id
200+
role = data.terraform_remote_state.org.outputs.bigquery_jobs_manage_syndicate_dataset_role_id
201+
user_by_email = each.value
202+
}
203+
204+
resource "google_bigquery_dataset_access" "syndicated_authorization" {
205+
for_each = var.create_dataset ? {} : {
206+
for entry in local.syndication_dataset_access : "${entry.project_id}/${entry.dataset_id}" => entry
207+
}
208+
209+
dataset_id = var.dataset_id
210+
211+
dataset {
212+
dataset {
213+
project_id = each.value.project_id
214+
dataset_id = each.value.dataset_id
215+
}
216+
target_types = ["VIEWS"]
217+
}
218+
}

0 commit comments

Comments
 (0)