Skip to content

Add Document Data Extraction Module#989

Merged
laurencegoolsby merged 23 commits intomainfrom
lgoolsby/add-bedrock-data-automation
Feb 23, 2026
Merged

Add Document Data Extraction Module#989
laurencegoolsby merged 23 commits intomainfrom
lgoolsby/add-bedrock-data-automation

Conversation

@laurencegoolsby
Copy link
Copy Markdown
Contributor

@laurencegoolsby laurencegoolsby commented Jan 6, 2026

Fixes #986

Changes

Add a new document-data-extraction module that provides AWS Bedrock Data Automation (BDA) resources for document data extraction workflows.

DDE Module

  • Add BDA project and profile resources
  • Add custom blueprint support from JSON schemas
  • Add AWS-managed blueprint support
  • Add IAM access policies with bedrock permissions
  • Add cross-region inference profile support
  • Add standard and custom output configuration options

Storage Module

  • Uses KMS encryption (storage module provides the encryption key)

Template Files

  • Update service and env-config files to use new module interface
  • Add aws_managed_blueprints parameter

Important Notes

  • BDA uses its own internal service role - This module does not create a custom IAM role for BDA
  • S3 bucket encryption - S3 buckets used with BDA should use the KMS encryption key provided by the storage module
  • Lambda permissions - Lambda functions invoking BDA must have S3 permissions for both input and output buckets directly attached to their execution role

Testing

Tested in navapbc/platform-test#237

- Add enable_document_data_extraction boolean
- Add environment variables and bucket policies to service
- Add additional outputs for BDA project/blueprint
- Move infra/modules/document-data-extraction/resources/bedrock-data-automation/* to infra/modules/document-data-extraction/resources
- Rename infra/app-flask/service/blueprints to document-data-extraction-blueprints
- Remove bda_ prefix from infra/modules/document-data-extraction/resources/variables.tf and infra/app-flask/service/document_data_extraction.tf
- Update infra/app-flask/app-config/env-config/document_data_extraction.tf name and path
- Create KMS key
- Remove KMS configuration parameters from module interface
- Add Bedrock Data Automation access_policy_arn output for service integration
- Update blueprints_map to use map(string) tags instead of complex objects
- Remove enabled_blueprint logic in favor of reading blueprints directory
- Update README
…ts.tf, infra/modules/document-data-extraction/resources/variables.tf
- Remove Bedrock Data Automation KMS key alias
- Remove unused Document Data Extraction outputs
- Format code via make format
@laurencegoolsby laurencegoolsby requested a review from a team as a code owner January 6, 2026 22:11
Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the infra/{{app_name}}/* changes for configuring/using the module.

- Rename override_config_state to override_configuration in README
- Add usage examples with service configuration and blueprint schema
@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

We also need a infra/{{app_name}}/* changes for configuring/using the module.

@doshitan - Added usage examples in infra/{{app_name}}/service/

@doshitan
Copy link
Copy Markdown
Contributor

doshitan commented Jan 8, 2026

We also need a infra/{{app_name}}/* changes for configuring/using the module.

@doshitan - Added usage examples in infra/{{app_name}}/service/

@laurencegoolsby looks like we are still missing most of the infra/{{app_name}}/* changes? We need all the config and supporting plumbing for using the module.

@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

We also need a infra/{{app_name}}/* changes for configuring/using the module.

@doshitan - Added usage examples in infra/{{app_name}}/service/

@laurencegoolsby looks like we are still missing most of the infra/{{app_name}}/* changes? We need all the config and supporting plumbing for using the module.

@doshitan - Added/updated files in infra/{{app_name}}/*

Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the PR description to mention the associated ticket, matching the PR template for the project (and so also probably removing the "Summary" section/fold that into the "Changes" to match the PR template): #986

Please link to the test PR, not the branch the in the PR description.

Avoid the past tense description of changes in the PR description. For background context about how a decision was arrived at that's fine, but for what the current changes do the descriptive text should be more in the imperative mood, like the title.

We'll also need the changes in navapbc/platform-test#238

@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

laurencegoolsby commented Jan 9, 2026

Please update the PR description to mention the associated ticket, matching the PR template for the project (and so also probably removing the "Summary" section/fold that into the "Changes" to match the PR template): #986

Please link to the test PR, not the branch the in the PR description.

Avoid the past tense description of changes in the PR description. For background context about how a decision was arrived at that's fine, but for what the current changes do the descriptive text should be more in the imperative mood, like the title.

We'll also need the changes in navapbc/platform-test#238

@doshitan Added changes from navapbc/platform-test#238

- Change environment variables to DDE_INPUT_LOCATION and DDE_OUTPUT_LOCATION
- Add BDA region configuration with us-east-1 default
- Create region-specific AWS providers for BDA deployment
- Move profile ARN generation to module output
- Add comment explaining DOCUMENT vs IMAGE blueprint types
- Add support for AWS-managed encryption (Bedrock Data Automation requires AWS-managed encryption)
- Add support for AWS-managed Bedrock Data Automation blueprints
Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any changes to the storage module should be a separate PR for cleaner review please.

And what exactly are we seeing that leads us to believe BDA can't use S3 buckets with customer managed keys?

Also removed was configuration for custom KMS key encryption for BDA itself. Why is that?

@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

Any changes to the storage module should be a separate PR for cleaner review please.

And what exactly are we seeing that leads us to believe BDA can't use S3 buckets with customer managed keys?

Also removed was configuration for custom KMS key encryption for BDA itself. Why is that?

@doshitan

Re: Storage module changes

BDA requires storage module changes for both encryption approaches:

  • CMK: Storage module needs kms_service_principals variable (default []) for bedrock.amazonaws.com access and KMS key ARN output
  • AWS-managed: Storage module needs use_aws_managed_encryption variable to skip CMK creation

Current storage module always creates CMKs with a basic key policy but does not expose for use by BDA.

Options:

  1. Include minimal storage module changes in this PR
  2. Create BDA buckets directly (not using storage module) to unblock this PR
  3. Wait for separate storage module PR

Re: BDA encryption_configuration

Removed - caused deployment errors, parameter unsupported in current BDA API.

- Consolidate blueprints_path and aws_managed_blueprints into single blueprints list
- Support mixed list of file paths and ARNs in blueprints variable
- Add glob pattern expansion in service layer for blueprint files
- Remove InvokeModel and InvokeModelWithResponseStream permissions (not needed for BDA)
- Update README to reflect new blueprints list interface
@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

@doshitan Addressed feedback:

  • Consolidated blueprints_path and aws_managed_blueprints into single blueprints list
  • Removed InvokeModel/InvokeModelWithResponseStream permissions (not needed for BDA)
  • Updated README

Re: Storage module - BDA requires storage module changes regardless of encryption approach. Awaiting your feedback on how to proceed.

@doshitan
Copy link
Copy Markdown
Contributor

BDA requires storage module changes for both encryption approaches:

And whatever they are, please put them in a separate PR. This PR can depend on it.

CMK: Storage module needs kms_service_principals variable (default []) for bedrock.amazonaws.com access and KMS key ARN output

This seems like the more useful functionality to have. The storage module could export the KMS ARN and the DDE module could attach whatever additional policies needed to the key/bucket. Not sure if just the caller of BDA needs to be able to provide a grant for the KMS key and/or BDA needs declared in the key policy directly.

But either option is fine for now.

Re: BDA encryption_configuration
Removed - caused deployment errors, parameter unsupported in current BDA API.

It was working previously in navapbc/platform-test#237? What deployment? What exactly is the error message?

Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

BDA requires storage module changes for both encryption approaches:

And whatever they are, please put them in a separate PR. This PR can depend on it.

CMK: Storage module needs kms_service_principals variable (default []) for bedrock.amazonaws.com access and KMS key ARN output

This seems like the more useful functionality to have. The storage module could export the KMS ARN and the DDE module could attach whatever additional policies needed to the key/bucket. Not sure if just the caller of BDA needs to be able to provide a grant for the KMS key and/or BDA needs declared in the key policy directly.

But either option is fine for now.

Re: BDA encryption_configuration
Removed - caused deployment errors, parameter unsupported in current BDA API.

It was working previously in navapbc/platform-test#237? What deployment? What exactly is the error message?

@doshitan

PR #237 invoked BDA directly from ECS; DocumentAI infra structure uses Lambda (triggered by EventBridge) to invoke BDA.

TLDR;

  • ECS worked sans issues
  • Lambda execution role lacks KMS permissions resulting in an "AccessDeniedException: Invalid KMS key. Check KMS key and associated permissions." error.

I'll create a separate PR for the storage module changes (kms_service_principals, KMS ARN export, optional AWS-managed keys)

@laurencegoolsby laurencegoolsby requested a review from a team as a code owner February 20, 2026 02:39
… modules

- Add providers.tf to infra/modules/storage
- Add providers.tf to infra/modules/document-data-extraction/resources
- Resolves Terraform warnings about undefined provider references when passing providers between modules
Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint and test failures, please fix. Looks like this code is not completely up-to-date.

Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module README.md and PR description are out of date. But after those are updated to reflect current code. Looks good!


- **BDA uses its own internal service role** - This module does not create a custom IAM role for BDA. Bedrock Data Automation uses an AWS-managed internal service role for S3 access.
- **S3 bucket encryption** - S3 buckets used with BDA should use AWS-managed encryption (AES256), not customer-managed KMS keys.
- **Lambda permissions** - Any Lambda function invoking BDA must have S3 permissions for both input and output buckets directly attached to its execution role.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is if BDA is actually accessing the buckets?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doshitan Lambda needs S3 permissions as the DocumentAI Lambda uploading the input document to S3. A separate Lambda retrieving the output from S3, while another Lambda invokes BDA.

Updated Lambda permission verbiage accordingly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but this is true of anything wanting to use DDE, not just an AWS Lambda right?

And a service wanting to use BDA could theoretically not have write access to the input S3 bucket, but as long as they can construct the S3 URI to pass in as a parameter to BDA they could still utilize the module right? Unlikely set up, but this module doesn't really make any assumptions about the caller like this? (BDA itself just needs to be able to r/w the input/ouput locations)

@doshitan doshitan removed the request for review from a team February 20, 2026 22:11
- Remove references to S3 AWS-managed encryption
- Fix indentation in usage section
@laurencegoolsby
Copy link
Copy Markdown
Contributor Author

The module README.md and PR description are out of date. But after those are updated to reflect current code. Looks good!

Updated README.md and PR description.

@laurencegoolsby laurencegoolsby merged commit 8f4515d into main Feb 23, 2026
9 checks passed
@laurencegoolsby laurencegoolsby deleted the lgoolsby/add-bedrock-data-automation branch February 23, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Document Data Extraction module for AWS

2 participants