Skip to content
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
8b7c0f5
Update document data extraction module
laurencegoolsby Dec 12, 2025
d488d37
Update per PR feedback.
laurencegoolsby Dec 16, 2025
1a96c19
Update per PR feedback.
laurencegoolsby Dec 17, 2025
2459d71
Fix override_configuration variable naming
laurencegoolsby Dec 17, 2025
57f7673
Update infra/app-flask/service/main.tf, infra/app-flask/service/outpu…
laurencegoolsby Dec 19, 2025
ddf24c6
Update per PR feedback.
laurencegoolsby Dec 22, 2025
07e4928
Rename bucket_policy_arns to data_access_policy_arns
laurencegoolsby Jan 5, 2026
84b07a3
Address PR feedback
laurencegoolsby Jan 8, 2026
50a2bbf
Add missing document data extraction infrastructure configuration
laurencegoolsby Jan 8, 2026
22097c6
Update document data extraction configuration per PR feedback
laurencegoolsby Jan 9, 2026
6f46a1d
Rename dde_profile_arn to bda_profile_arn
laurencegoolsby Jan 9, 2026
153c233
Fix DDE S3 environment variables to use S3 URI format
laurencegoolsby Jan 10, 2026
d919bd1
Address PR feedback - add comments, GovCloud TODO, update bucket naming
laurencegoolsby Jan 22, 2026
436c674
Update DDE and storage modules
laurencegoolsby Jan 23, 2026
a8c76aa
Add InvokeDataAutomationAsync permission and profile resource access
laurencegoolsby Jan 23, 2026
5eb248f
Refactor BDA blueprint configuration and remove unused permissions
laurencegoolsby Jan 27, 2026
1c2397c
Merge branch 'main' into lgoolsby/add-bedrock-data-automation
laurencegoolsby Feb 20, 2026
ce9d7d5
Add required_providers blocks to storage and document-data-extraction…
laurencegoolsby Feb 20, 2026
627ce48
Merge branch 'lgoolsby/add-bedrock-data-automation' of https://github…
laurencegoolsby Feb 20, 2026
6164c7e
Update storage module encryption.tf and main.tf to match main
laurencegoolsby Feb 20, 2026
2d8671a
Update document_data_extraction.tf to match platform-test
laurencegoolsby Feb 20, 2026
f41b2ff
Update README.md
laurencegoolsby Feb 23, 2026
b63cc32
Update README
laurencegoolsby Feb 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions infra/modules/document-data-extraction/resources/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Bedrock Data Automation Terraform Module

This module provisions AWS Bedrock Data Automation resources, including the data automation project and blueprints.


## Overview

The module creates:
- **Bedrock Data Automation Project** - Main project resource for data automation workflows
- **Bedrock Blueprints** - Custom extraction blueprints configured via a map

## Important Notes

- **BDA uses its own internal service role** - This module does not create a custom IAM role for BDA. Bedrock Data Automation uses an AWS-managed internal service role for S3 access.
- **S3 bucket encryption** - S3 buckets used with BDA should use AWS-managed encryption (AES256), not customer-managed KMS keys.
- **Lambda permissions** - Any Lambda function invoking BDA must have S3 permissions for both input and output buckets directly attached to its execution role.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is if BDA is actually accessing the buckets?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doshitan Lambda needs S3 permissions as the DocumentAI Lambda uploading the input document to S3. A separate Lambda retrieving the output from S3, while another Lambda invokes BDA.

Updated Lambda permission verbiage accordingly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but this is true of anything wanting to use DDE, not just an AWS Lambda right?

And a service wanting to use BDA could theoretically not have write access to the input S3 bucket, but as long as they can construct the S3 URI to pass in as a parameter to BDA they could still utilize the module right? Unlikely set up, but this module doesn't really make any assumptions about the caller like this? (BDA itself just needs to be able to r/w the input/ouput locations)

- **No bucket policies needed** - BDA does not require bucket policies allowing the `bedrock.amazonaws.com` service principal.

## Features
- Creates resources required for Bedrock Data Automation workflows
- Uses a `name` variable to prefix all resource names for uniqueness and consistency
- Supports both standard and custom output configurations
- Flexible blueprint creation through a map of blueprint definitions
- Complies with Checkov recommendations for security and compliance
- Designed for cross-layer usage (see project module conventions)

## Usage

```hcl
module "bedrock_data_automation" {
source = "../../modules/document-data-extraction/resources"

name = "my-app-prod"

blueprints_map = {
invoice = {
schema = file("${path.module}/schemas/invoice.json")
type = "DOCUMENT"
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
}

standard_output_configuration = {
document = {
extraction = {
granularity = {
types = ["PAGE", "ELEMENT"]
}
}
}
}

tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
```

## Inputs

### Required Variables

| Name | Description | Type | Required |
|-------|-------------|------|----------|
| `name` | Prefix to use for resource names (e.g., "my-app-prod") | `string` | yes |
| `blueprints_map` | Map of unique blueprints with keys as blueprint identifiers and values as blueprint objects | `map(object)` | yes |

#### `blueprints_map` Object Structure
```hcl
{
schema = string # JSON schema defining the extraction structure
type = string # Blueprint type (e.g., "DOCUMENT")
tags = map(string) # Resource tags as key-value pairs
}
```

### Optional Variables

| Name | Description | Type | Default |
|------|-------------|------|---------|
| `standard_output_configuration` | Standard output configuration for extraction | `object` | `null` |
| `override_configuration` | Override configuration for standard BDA behavior | `string` | `null` |
| `tags` | Resource tags as key-value pairs | `map(string)` | `{}` |


#### `standard_output_configuration` Object Structure

Complex nested object supporting extraction configuration for audio, document, image, and video content types. Each content type supports:
- **extraction** - Category, bounding box, and granularity configuration
- **generative_field** - State and types for generative AI fields
- **output_format** (document only) - Additional file format and text format settings

See `variables.tf` for complete structure details.

## Outputs

| Name | Description |
|------|-------------|
| `bda_project_arn` | The ARN of the Bedrock Data Automation project |
| `access_policy_arn` | The ARN of the IAM policy for accessing the Bedrock Data Automation project |
| `bda_profile_arn` | The profile ARN for cross-region inference |
| `bda_blueprint_arns` | List of created blueprint ARNs |
| `bda_blueprint_names` | List of created blueprint names |
| `bda_blueprint_arn_to_name` | Map of blueprint ARNs to names |

## Resources Created

- `awscc_bedrock_data_automation_project.bda_project` - Main BDA project
- `awscc_bedrock_blueprint.bda_blueprint` - One or more blueprints (created from `blueprints_map`)

## Project Conventions

- All resource names are prefixed with `var.name`
- For cross-layer modules, use the interface/data/resources pattern as described in project documentation
- Write code that complies with Checkov recommendations
- Follow Terraform best practices for naming and organization

## File Structure

- `main.tf` - Resource definitions
- `variables.tf` - Input variable definitions
- `outputs.tf` - Output values
- `providers.tf` - Provider configuration
- `README.md` - This documentation

## Examples

### Minimal Configuration
```hcl
module "bedrock_data_automation" {
source = "../../modules/document-data-extraction/resources"

name = "my-app"

blueprints_map = {} # No custom blueprints
}
```

### With Standard Output Configuration
```hcl
module "bedrock_data_automation" {
source = "../../modules/document-data-extraction/resources"

name = "my-app"
blueprints_map = { /* ... */ }

standard_output_configuration = {
document = {
extraction = {
bounding_box = {
state = "ENABLED"
}
granularity = {
types = ["PAGE", "ELEMENT", "LINE"]
}
}
generative_field = {
state = "ENABLED"
}
output_format = {
text_format = {
types = ["MARKDOWN", "HTML"]
}
}
}
image = {
extraction = {
category = {
state = "ENABLED"
types = ["TABLES", "CHARTS"]
}
}
}
}
}
```

## Prerequisites

- AWS provider configured
- AWS Cloud Control provider (awscc) configured
- Appropriate AWS permissions to create Bedrock and IAM resources

## References

- [AWS Bedrock Data Automation](https://docs.aws.amazon.com/bedrock/latest/userguide/data-automation.html)
- [Project Terraform Conventions](../../../../.github/copilot-instructions.md)
- [Checkov Documentation](https://www.checkov.io/)
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
resource "aws_iam_policy" "bedrock_access" {
name = "${var.name}-access"
policy = data.aws_iam_policy_document.bedrock_access.json
}

data "aws_iam_policy_document" "bedrock_access" {
statement {
actions = [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream",
"bedrock:InvokeDataAutomationAsync",
"bedrock:GetDataAutomationProject",
"bedrock:GetBlueprint",
"bedrock:StartDataAutomationJob",
"bedrock:GetDataAutomationJob",
"bedrock:ListDataAutomationJobs"
]
effect = "Allow"
resources = [
awscc_bedrock_data_automation_project.bda_project.project_arn,
"${awscc_bedrock_data_automation_project.bda_project.project_arn}/*",
"arn:aws:bedrock:*:*:blueprint/*",
"arn:aws:bedrock:*:*:data-automation-profile/*"
]
}
}
44 changes: 44 additions & 0 deletions infra/modules/document-data-extraction/resources/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
locals {
# convert standard terraform tags to bedrock data automation format
bda_tags = [
for key, value in var.tags : {
key = key
value = value
}
]

all_blueprints = concat(
# custom blueprints created from json schemas
[for k, v in awscc_bedrock_blueprint.bda_blueprint : {
blueprint_arn = v.blueprint_arn
blueprint_stage = v.blueprint_stage
}],
# aws managed blueprints referenced by arn
var.aws_managed_blueprints != null ? [
for arn in var.aws_managed_blueprints : {
blueprint_arn = arn
blueprint_stage = "LIVE"
}
] : []
)
}

resource "awscc_bedrock_data_automation_project" "bda_project" {
project_name = "${var.name}-project"
project_description = "Project for ${var.name}"
tags = local.bda_tags
standard_output_configuration = var.standard_output_configuration
custom_output_configuration = length(local.all_blueprints) > 0 ? {
blueprints = local.all_blueprints
} : null
override_configuration = var.override_configuration
}

resource "awscc_bedrock_blueprint" "bda_blueprint" {
for_each = var.blueprints_map

blueprint_name = "${var.name}-${each.key}"
schema = each.value.schema
type = each.value.type
tags = local.bda_tags
}
40 changes: 40 additions & 0 deletions infra/modules/document-data-extraction/resources/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
data "aws_region" "current" {}
data "aws_caller_identity" "current" {}

output "access_policy_arn" {
description = "The ARN of the IAM policy for accessing the Bedrock Data Automation project"
value = aws_iam_policy.bedrock_access.arn
}

output "bda_project_arn" {
description = "The ARN of the Bedrock Data Automation project"
value = awscc_bedrock_data_automation_project.bda_project.project_arn
}

# aws bedrock data automation requires users to use cross Region inference support
# when processing files. the following like the profile ARNs for different inference
# profiles
# https://docs.aws.amazon.com/bedrock/latest/userguide/bda-cris.html
# TODO(https://github.com/navapbc/template-infra/issues/993) Add GovCloud Support
output "bda_profile_arn" {
description = "The profile ARN associated with the BDA project"
value = "arn:aws:bedrock:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:data-automation-profile/us.data-automation-v1"
}

output "bda_blueprint_arns" {
value = [
for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_arn
]
}

output "bda_blueprint_names" {
value = [
for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_name
]
}

output "bda_blueprint_arn_to_name" {
value = {
for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_arn => bp.blueprint_name
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
terraform {
required_providers {
awscc = {
source = "hashicorp/awscc"
version = ">= 1.63.0"
}
}
}
Loading
Loading