-
Notifications
You must be signed in to change notification settings - Fork 5
Add Document Data Extraction Module #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 12 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
8b7c0f5
Update document data extraction module
laurencegoolsby d488d37
Update per PR feedback.
laurencegoolsby 1a96c19
Update per PR feedback.
laurencegoolsby 2459d71
Fix override_configuration variable naming
laurencegoolsby 57f7673
Update infra/app-flask/service/main.tf, infra/app-flask/service/outpu…
laurencegoolsby ddf24c6
Update per PR feedback.
laurencegoolsby 07e4928
Rename bucket_policy_arns to data_access_policy_arns
laurencegoolsby 84b07a3
Address PR feedback
laurencegoolsby 50a2bbf
Add missing document data extraction infrastructure configuration
laurencegoolsby 22097c6
Update document data extraction configuration per PR feedback
laurencegoolsby 6f46a1d
Rename dde_profile_arn to bda_profile_arn
laurencegoolsby 153c233
Fix DDE S3 environment variables to use S3 URI format
laurencegoolsby d919bd1
Address PR feedback - add comments, GovCloud TODO, update bucket naming
laurencegoolsby 436c674
Update DDE and storage modules
laurencegoolsby a8c76aa
Add InvokeDataAutomationAsync permission and profile resource access
laurencegoolsby 5eb248f
Refactor BDA blueprint configuration and remove unused permissions
laurencegoolsby 1c2397c
Merge branch 'main' into lgoolsby/add-bedrock-data-automation
laurencegoolsby ce9d7d5
Add required_providers blocks to storage and document-data-extraction…
laurencegoolsby 627ce48
Merge branch 'lgoolsby/add-bedrock-data-automation' of https://github…
laurencegoolsby 6164c7e
Update storage module encryption.tf and main.tf to match main
laurencegoolsby 2d8671a
Update document_data_extraction.tf to match platform-test
laurencegoolsby f41b2ff
Update README.md
laurencegoolsby b63cc32
Update README
laurencegoolsby File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
159 changes: 159 additions & 0 deletions
159
infra/modules/document-data-extraction/resources/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| # Bedrock Data Automation Terraform Module | ||
|
|
||
| This module provisions AWS Bedrock Data Automation resources, including the data automation project, blueprints, and associated IAM role for accessing S3 buckets. | ||
|
|
||
| ## Overview | ||
|
|
||
| The module creates: | ||
| - **Bedrock Data Automation Project** - Main project resource for data automation workflows | ||
| - **Bedrock Blueprints** - Custom extraction blueprints configured via a map | ||
| - **IAM Role** - Role for Bedrock service to assume with access to input/output S3 buckets | ||
|
|
||
| ## Features | ||
| - Creates resources required for Bedrock Data Automation workflows | ||
| - Uses a `name` variable to prefix all resource names for uniqueness and consistency | ||
| - Supports both standard and custom output configurations | ||
| - Flexible blueprint creation through a map of blueprint definitions | ||
| - Complies with Checkov recommendations for security and compliance | ||
| - Designed for cross-layer usage (see project module conventions) | ||
|
|
||
| ## Inputs | ||
|
|
||
| ### Required Variables | ||
|
|
||
| | Name | Description | Type | Required | | ||
| |-------|-------------|------|----------| | ||
| | `name` | Prefix to use for resource names (e.g., "my-app-prod") | `string` | yes | | ||
| | `data_access_policy_arns` | Map of policy ARNs for input and output locations to attach to the BDA role | `map(string)` | yes | | ||
| | `blueprints_map` | Map of unique blueprints with keys as blueprint identifiers and values as blueprint objects | `map(object)` | yes | | ||
|
|
||
| #### `blueprints_map` Object Structure | ||
| ```hcl | ||
| { | ||
| schema = string # JSON schema defining the extraction structure | ||
| type = string # Blueprint type (e.g., "DOCUMENT") | ||
| tags = map(string) # Resource tags as key-value pairs | ||
| } | ||
| ``` | ||
|
|
||
| ### Optional Variables | ||
|
|
||
| | Name | Description | Type | Default | | ||
| |------|-------------|------|---------| | ||
| | `standard_output_configuration` | Standard output configuration for extraction | `object` | `null` | | ||
| | `override_configuration` | Override configuration for standard BDA behavior | `string` | `null` | | ||
| | `tags` | Resource tags as key-value pairs | `map(string)` | `{}` | | ||
|
|
||
|
|
||
| #### `standard_output_configuration` Object Structure | ||
|
|
||
| Complex nested object supporting extraction configuration for audio, document, image, and video content types. Each content type supports: | ||
| - **extraction** - Category, bounding box, and granularity configuration | ||
| - **generative_field** - State and types for generative AI fields | ||
| - **output_format** (document only) - Additional file format and text format settings | ||
|
|
||
| See `variables.tf` for complete structure details. | ||
|
|
||
| ## Outputs | ||
|
|
||
| | Name | Description | | ||
| |------|-------------| | ||
| | `bda_project_arn` | The ARN of the Bedrock Data Automation project | | ||
| | `bda_role_name` | The name of the IAM role used by Bedrock Data Automation | | ||
| | `bda_role_arn` | The ARN of the IAM role used by Bedrock Data Automation | | ||
| | `access_policy_arn` | The ARN of the IAM policy for accessing the Bedrock Data Automation project | | ||
|
|
||
|
|
||
| ## Resources Created | ||
|
|
||
| - `awscc_bedrock_data_automation_project.bda_project` - Main BDA project | ||
| - `awscc_bedrock_blueprint.bda_blueprint` - One or more blueprints (created from `blueprints_map`) | ||
| - `aws_iam_role.bda_role` - IAM role for Bedrock service | ||
| - `aws_iam_role_policy_attachment.role_policy_attachments` - Policy attachments for S3 access | ||
|
|
||
| ## Project Conventions | ||
|
|
||
| - All resource names are prefixed with `var.name` | ||
| - For cross-layer modules, use the interface/data/resources pattern as described in project documentation | ||
| - Write code that complies with Checkov recommendations | ||
| - Follow Terraform best practices for naming and organization | ||
|
|
||
| ## File Structure | ||
|
|
||
| - `main.tf` - Resource definitions | ||
| - `variables.tf` - Input variable definitions | ||
| - `outputs.tf` - Output values | ||
| - `providers.tf` - Provider configuration | ||
| - `README.md` - This documentation | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Minimal Configuration | ||
| ```hcl | ||
| module "bedrock_data_automation" { | ||
| source = "../../modules/document-data-extraction/resources" | ||
|
|
||
| name = "my-app" | ||
|
|
||
| data_access_policy_arns = { | ||
| input = aws_iam_policy.input.arn | ||
| output = aws_iam_policy.output.arn | ||
| } | ||
|
|
||
| blueprints_map = {} # No custom blueprints | ||
| } | ||
| ``` | ||
|
|
||
| ### With Standard Output Configuration | ||
| ```hcl | ||
| module "bedrock_data_automation" { | ||
| source = "../../modules/document-data-extraction/resources" | ||
|
|
||
| name = "my-app" | ||
| data_access_policy_arns = { /* ... */ } | ||
| blueprints_map = { /* ... */ } | ||
|
|
||
| standard_output_configuration = { | ||
| document = { | ||
| extraction = { | ||
| bounding_box = { | ||
| state = "ENABLED" | ||
| } | ||
| granularity = { | ||
| types = ["PAGE", "ELEMENT", "LINE"] | ||
| } | ||
| } | ||
| generative_field = { | ||
| state = "ENABLED" | ||
| } | ||
| output_format = { | ||
| text_format = { | ||
| types = ["MARKDOWN", "HTML"] | ||
| } | ||
| } | ||
| } | ||
| image = { | ||
| extraction = { | ||
| category = { | ||
| state = "ENABLED" | ||
| types = ["TABLES", "CHARTS"] | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - AWS provider configured | ||
| - AWS Cloud Control provider (awscc) configured | ||
| - Appropriate AWS permissions to create Bedrock and IAM resources | ||
| - KMS keys | ||
| - S3 bucket policies defined for input/output buckets | ||
|
|
||
| ## References | ||
|
|
||
| - [AWS Bedrock Data Automation](https://docs.aws.amazon.com/bedrock/latest/userguide/data-automation.html) | ||
| - [Project Terraform Conventions](../../../../.github/copilot-instructions.md) | ||
| - [Checkov Documentation](https://www.checkov.io/) |
22 changes: 22 additions & 0 deletions
22
infra/modules/document-data-extraction/resources/access_control.tf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| resource "aws_iam_policy" "bedrock_access" { | ||
| name = "${var.name}-access" | ||
| policy = data.aws_iam_policy_document.bedrock_access.json | ||
| } | ||
|
|
||
| data "aws_iam_policy_document" "bedrock_access" { | ||
| statement { | ||
| actions = [ | ||
| "bedrock:InvokeModel", | ||
| "bedrock:InvokeModelWithResponseStream", | ||
| "bedrock:GetDataAutomationProject", | ||
| "bedrock:StartDataAutomationJob", | ||
| "bedrock:GetDataAutomationJob", | ||
| "bedrock:ListDataAutomationJobs" | ||
| ] | ||
| effect = "Allow" | ||
| resources = [ | ||
| awscc_bedrock_data_automation_project.bda_project.project_arn, | ||
| "${awscc_bedrock_data_automation_project.bda_project.project_arn}/*" | ||
| ] | ||
| } | ||
| } | ||
5 changes: 5 additions & 0 deletions
5
infra/modules/document-data-extraction/resources/encryption.tf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| resource "aws_kms_key" "bedrock_data_automation" { | ||
| description = "KMS key for Bedrock Data Automation ${var.name}" | ||
| deletion_window_in_days = "10" | ||
| enable_key_rotation = "true" | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| locals { | ||
| # convert standard terraform tags to bedrock data automation format | ||
| bda_tags = [ | ||
| for key, value in var.tags : { | ||
| key = key | ||
| value = value | ||
| } | ||
| ] | ||
|
|
||
| kms_encryption_context = { | ||
| Environment = lookup(var.tags, "environment", "unknown") | ||
| } | ||
| } | ||
|
|
||
| resource "awscc_bedrock_data_automation_project" "bda_project" { | ||
| project_name = "${var.name}-project" | ||
| project_description = "Project for ${var.name}" | ||
| kms_encryption_context = local.kms_encryption_context | ||
| kms_key_id = aws_kms_key.bedrock_data_automation.arn | ||
| tags = local.bda_tags | ||
| standard_output_configuration = var.standard_output_configuration | ||
| custom_output_configuration = { | ||
| blueprints = [for k, v in awscc_bedrock_blueprint.bda_blueprint : { | ||
| blueprint_arn = v.blueprint_arn | ||
| blueprint_stage = v.blueprint_stage | ||
| }] | ||
| } | ||
| override_configuration = var.override_configuration | ||
| } | ||
|
|
||
| resource "awscc_bedrock_blueprint" "bda_blueprint" { | ||
| for_each = var.blueprints_map | ||
|
|
||
| blueprint_name = "${var.name}-${each.key}" | ||
| schema = each.value.schema | ||
| type = each.value.type | ||
| kms_encryption_context = local.kms_encryption_context | ||
| kms_key_id = aws_kms_key.bedrock_data_automation.arn | ||
| tags = local.bda_tags | ||
| } | ||
|
|
||
| resource "aws_iam_role" "bda_role" { | ||
| name = "${var.name}-bda_role" | ||
|
|
||
| assume_role_policy = jsonencode({ | ||
| Version = "2012-10-17" | ||
| Statement = [ | ||
| { | ||
| Effect = "Allow" | ||
| Principal = { | ||
| Service = "bedrock.amazonaws.com" | ||
| } | ||
| Action = "sts:AssumeRole" | ||
| } | ||
| ] | ||
| }) | ||
| } | ||
|
|
||
| resource "aws_iam_role_policy_attachment" "role_policy_attachments" { | ||
| for_each = var.data_access_policy_arns | ||
|
|
||
| role = aws_iam_role.bda_role.name | ||
| policy_arn = each.value | ||
| } |
39 changes: 39 additions & 0 deletions
39
infra/modules/document-data-extraction/resources/outputs.tf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| data "aws_region" "current" {} | ||
| data "aws_caller_identity" "current" {} | ||
|
|
||
| output "access_policy_arn" { | ||
| description = "The ARN of the IAM policy for accessing the Bedrock Data Automation project" | ||
| value = aws_iam_policy.bedrock_access.arn | ||
| } | ||
|
|
||
| output "bda_project_arn" { | ||
| description = "The ARN of the Bedrock Data Automation project" | ||
| value = awscc_bedrock_data_automation_project.bda_project.project_arn | ||
| } | ||
|
|
||
| # aws bedrock data automation requires users to use cross Region inference support | ||
| # when processing files. the following like the profile ARNs for different inference | ||
| # profiles | ||
| # https://docs.aws.amazon.com/bedrock/latest/userguide/bda-cris.html | ||
| output "bda_profile_arn" { | ||
| description = "The profile ARN associated with the BDA project" | ||
| value = "arn:aws:bedrock:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:data-automation-profile/us.data-automation-v1" | ||
laurencegoolsby marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| output "bda_blueprint_arns" { | ||
| value = [ | ||
| for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_arn | ||
| ] | ||
| } | ||
|
|
||
| output "bda_blueprint_names" { | ||
| value = [ | ||
| for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_name | ||
| ] | ||
| } | ||
|
|
||
| output "bda_blueprint_arn_to_name" { | ||
| value = { | ||
| for key, bp in awscc_bedrock_blueprint.bda_blueprint : bp.blueprint_arn => bp.blueprint_name | ||
| } | ||
| } | ||
8 changes: 8 additions & 0 deletions
8
infra/modules/document-data-extraction/resources/providers.tf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| terraform { | ||
| required_providers { | ||
| awscc = { | ||
| source = "hashicorp/awscc" | ||
| version = ">= 1.63.0" | ||
| } | ||
| } | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.