-
Notifications
You must be signed in to change notification settings - Fork 5
Add Document Data Extraction Module #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
8b7c0f5
d488d37
1a96c19
2459d71
57f7673
ddf24c6
07e4928
84b07a3
50a2bbf
22097c6
6f46a1d
153c233
d919bd1
436c674
a8c76aa
5eb248f
1c2397c
ce9d7d5
627ce48
6164c7e
2d8671a
f41b2ff
b63cc32
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,192 @@ | ||
| # Bedrock Data Automation Terraform Module | ||
|
|
||
| This module provisions AWS Bedrock Data Automation resources, including the data automation project and blueprints. | ||
|
|
||
|
|
||
| ## Overview | ||
|
|
||
| The module creates: | ||
| - **Bedrock Data Automation Project** - Main project resource for data automation workflows | ||
| - **Bedrock Blueprints** - Custom extraction blueprints configured via a map | ||
|
|
||
| ## Important Notes | ||
|
|
||
| - **BDA uses its own internal service role** - This module does not create a custom IAM role for BDA. Bedrock Data Automation uses an AWS-managed internal service role for S3 access. | ||
| - **S3 bucket encryption** - S3 buckets used with BDA should use AWS-managed encryption (AES256), not customer-managed KMS keys. | ||
| - **Lambda permissions** - Any Lambda function invoking BDA must have S3 permissions for both input and output buckets directly attached to its execution role. | ||
|
||
| - **No bucket policies needed** - BDA does not require bucket policies allowing the `bedrock.amazonaws.com` service principal. | ||
laurencegoolsby marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Features | ||
| - Creates resources required for Bedrock Data Automation workflows | ||
| - Uses a `name` variable to prefix all resource names for uniqueness and consistency | ||
| - Supports both standard and custom output configurations | ||
| - Flexible blueprint creation through a map of blueprint definitions | ||
| - Complies with Checkov recommendations for security and compliance | ||
| - Designed for cross-layer usage (see project module conventions) | ||
|
|
||
| ## Usage | ||
|
|
||
| ```hcl | ||
| module "bedrock_data_automation" { | ||
| source = "../../modules/document-data-extraction/resources" | ||
|
|
||
| name = "my-app-prod" | ||
|
|
||
| blueprints_map = { | ||
| invoice = { | ||
| schema = file("${path.module}/schemas/invoice.json") | ||
| type = "DOCUMENT" | ||
| tags = { | ||
| Environment = "production" | ||
| ManagedBy = "terraform" | ||
| } | ||
doshitan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| } | ||
|
|
||
| standard_output_configuration = { | ||
| document = { | ||
| extraction = { | ||
| granularity = { | ||
| types = ["PAGE", "ELEMENT"] | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| tags = { | ||
| Environment = "production" | ||
| ManagedBy = "terraform" | ||
laurencegoolsby marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## Inputs | ||
|
|
||
| ### Required Variables | ||
|
|
||
| | Name | Description | Type | Required | | ||
| |-------|-------------|------|----------| | ||
| | `name` | Prefix to use for resource names (e.g., "my-app-prod") | `string` | yes | | ||
| | `blueprints_map` | Map of unique blueprints with keys as blueprint identifiers and values as blueprint objects | `map(object)` | yes | | ||
|
|
||
| #### `blueprints_map` Object Structure | ||
| ```hcl | ||
| { | ||
| schema = string # JSON schema defining the extraction structure | ||
| type = string # Blueprint type (e.g., "DOCUMENT") | ||
| tags = map(string) # Resource tags as key-value pairs | ||
| } | ||
| ``` | ||
|
|
||
| ### Optional Variables | ||
|
|
||
| | Name | Description | Type | Default | | ||
| |------|-------------|------|---------| | ||
| | `standard_output_configuration` | Standard output configuration for extraction | `object` | `null` | | ||
| | `override_configuration` | Override configuration for standard BDA behavior | `string` | `null` | | ||
| | `tags` | Resource tags as key-value pairs | `map(string)` | `{}` | | ||
|
|
||
|
|
||
| #### `standard_output_configuration` Object Structure | ||
|
|
||
| Complex nested object supporting extraction configuration for audio, document, image, and video content types. Each content type supports: | ||
| - **extraction** - Category, bounding box, and granularity configuration | ||
| - **generative_field** - State and types for generative AI fields | ||
| - **output_format** (document only) - Additional file format and text format settings | ||
|
|
||
| See `variables.tf` for complete structure details. | ||
|
|
||
| ## Outputs | ||
|
|
||
| | Name | Description | | ||
| |------|-------------| | ||
| | `bda_project_arn` | The ARN of the Bedrock Data Automation project | | ||
| | `access_policy_arn` | The ARN of the IAM policy for accessing the Bedrock Data Automation project | | ||
| | `bda_profile_arn` | The profile ARN for cross-region inference | | ||
| | `bda_blueprint_arns` | List of created blueprint ARNs | | ||
| | `bda_blueprint_names` | List of created blueprint names | | ||
| | `bda_blueprint_arn_to_name` | Map of blueprint ARNs to names | | ||
|
|
||
| ## Resources Created | ||
|
|
||
| - `awscc_bedrock_data_automation_project.bda_project` - Main BDA project | ||
| - `awscc_bedrock_blueprint.bda_blueprint` - One or more blueprints (created from `blueprints_map`) | ||
|
|
||
| ## Project Conventions | ||
|
|
||
| - All resource names are prefixed with `var.name` | ||
| - For cross-layer modules, use the interface/data/resources pattern as described in project documentation | ||
| - Write code that complies with Checkov recommendations | ||
| - Follow Terraform best practices for naming and organization | ||
|
|
||
| ## File Structure | ||
|
|
||
| - `main.tf` - Resource definitions | ||
| - `variables.tf` - Input variable definitions | ||
| - `outputs.tf` - Output values | ||
| - `providers.tf` - Provider configuration | ||
| - `README.md` - This documentation | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Minimal Configuration | ||
| ```hcl | ||
| module "bedrock_data_automation" { | ||
| source = "../../modules/document-data-extraction/resources" | ||
|
|
||
| name = "my-app" | ||
|
|
||
| blueprints_map = {} # No custom blueprints | ||
| } | ||
| ``` | ||
|
|
||
| ### With Standard Output Configuration | ||
| ```hcl | ||
| module "bedrock_data_automation" { | ||
| source = "../../modules/document-data-extraction/resources" | ||
|
|
||
| name = "my-app" | ||
| blueprints_map = { /* ... */ } | ||
|
|
||
| standard_output_configuration = { | ||
| document = { | ||
| extraction = { | ||
| bounding_box = { | ||
| state = "ENABLED" | ||
| } | ||
| granularity = { | ||
| types = ["PAGE", "ELEMENT", "LINE"] | ||
| } | ||
| } | ||
| generative_field = { | ||
| state = "ENABLED" | ||
| } | ||
| output_format = { | ||
| text_format = { | ||
| types = ["MARKDOWN", "HTML"] | ||
| } | ||
| } | ||
| } | ||
| image = { | ||
| extraction = { | ||
| category = { | ||
| state = "ENABLED" | ||
| types = ["TABLES", "CHARTS"] | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - AWS provider configured | ||
| - AWS Cloud Control provider (awscc) configured | ||
| - Appropriate AWS permissions to create Bedrock and IAM resources | ||
|
|
||
| ## References | ||
|
|
||
| - [AWS Bedrock Data Automation](https://docs.aws.amazon.com/bedrock/latest/userguide/data-automation.html) | ||
| - [Project Terraform Conventions](../../../../.github/copilot-instructions.md) | ||
| - [Checkov Documentation](https://www.checkov.io/) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| resource "aws_iam_policy" "bedrock_access" { | ||
| name = "${var.name}-access" | ||
| policy = data.aws_iam_policy_document.bedrock_access.json | ||
| } | ||
|
|
||
| data "aws_iam_policy_document" "bedrock_access" { | ||
| statement { | ||
| actions = [ | ||
| "bedrock:InvokeModel", | ||
| "bedrock:InvokeModelWithResponseStream", | ||
laurencegoolsby marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| "bedrock:InvokeDataAutomationAsync", | ||
| "bedrock:GetDataAutomationProject", | ||
| "bedrock:GetBlueprint", | ||
| "bedrock:StartDataAutomationJob", | ||
| "bedrock:GetDataAutomationJob", | ||
| "bedrock:ListDataAutomationJobs" | ||
| ] | ||
| effect = "Allow" | ||
| resources = [ | ||
| awscc_bedrock_data_automation_project.bda_project.project_arn, | ||
| "${awscc_bedrock_data_automation_project.bda_project.project_arn}/*", | ||
| "arn:aws:bedrock:*:*:blueprint/*", | ||
| "arn:aws:bedrock:*:*:data-automation-profile/*" | ||
| ] | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| locals { | ||
| # convert standard terraform tags to bedrock data automation format | ||
| bda_tags = [ | ||
| for key, value in var.tags : { | ||
| key = key | ||
| value = value | ||
| } | ||
| ] | ||
|
|
||
| all_blueprints = concat( | ||
| # custom blueprints created from json schemas | ||
| [for k, v in awscc_bedrock_blueprint.bda_blueprint : { | ||
| blueprint_arn = v.blueprint_arn | ||
| blueprint_stage = v.blueprint_stage | ||
| }], | ||
| # aws managed blueprints referenced by arn | ||
| var.aws_managed_blueprints != null ? [ | ||
| for arn in var.aws_managed_blueprints : { | ||
| blueprint_arn = arn | ||
| blueprint_stage = "LIVE" | ||
| } | ||
| ] : [] | ||
| ) | ||
| } | ||
|
|
||
| resource "awscc_bedrock_data_automation_project" "bda_project" { | ||
| project_name = "${var.name}-project" | ||
| project_description = "Project for ${var.name}" | ||
| tags = local.bda_tags | ||
| standard_output_configuration = var.standard_output_configuration | ||
| custom_output_configuration = length(local.all_blueprints) > 0 ? { | ||
| blueprints = local.all_blueprints | ||
| } : null | ||
| override_configuration = var.override_configuration | ||
| } | ||
|
|
||
| resource "awscc_bedrock_blueprint" "bda_blueprint" { | ||
| for_each = var.blueprints_map | ||
|
|
||
| blueprint_name = "${var.name}-${each.key}" | ||
| schema = each.value.schema | ||
| type = each.value.type | ||
| tags = local.bda_tags | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| data "aws_region" "current" {} | ||
| data "aws_caller_identity" "current" {} | ||
|
|
||
| output "access_policy_arn" { | ||
| description = "The ARN of the IAM policy for accessing the Bedrock Data Automation project" | ||
| value = aws_iam_policy.bedrock_access.arn | ||
| } | ||
|
|
||
| output "bda_project_arn" { | ||
| description = "The ARN of the Bedrock Data Automation project" | ||
| value = awscc_bedrock_data_automation_project.bda_project.project_arn | ||
| } | ||
|
|
||
| # aws bedrock data automation requires users to use cross Region inference support | ||
| # when processing files. the following like the profile ARNs for different inference | ||
| # profiles | ||
| # https://docs.aws.amazon.com/bedrock/latest/userguide/bda-cris.html | ||
| # TODO(https://github.com/navapbc/template-infra/issues/993) Add GovCloud Support | ||
| output "bda_profile_arn" { | ||
| description = "The profile ARN associated with the BDA project" | ||
| value = "arn:aws:bedrock:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:data-automation-profile/us.data-automation-v1" | ||
laurencegoolsby marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| output "bda_blueprint_arns" { | ||
| value = [ | ||
| for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_arn | ||
| ] | ||
| } | ||
|
|
||
| output "bda_blueprint_names" { | ||
| value = [ | ||
| for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_name | ||
| ] | ||
| } | ||
|
|
||
| output "bda_blueprint_arn_to_name" { | ||
| value = { | ||
| for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_arn => bp.blueprint_name | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| terraform { | ||
| required_providers { | ||
| awscc = { | ||
| source = "hashicorp/awscc" | ||
| version = ">= 1.63.0" | ||
| } | ||
| } | ||
| } |
Uh oh!
There was an error while loading. Please reload this page.