Add Document Data Extraction Module#989
Conversation
- Add enable_document_data_extraction boolean - Add environment variables and bucket policies to service - Add additional outputs for BDA project/blueprint
- Move infra/modules/document-data-extraction/resources/bedrock-data-automation/* to infra/modules/document-data-extraction/resources - Rename infra/app-flask/service/blueprints to document-data-extraction-blueprints - Remove bda_ prefix from infra/modules/document-data-extraction/resources/variables.tf and infra/app-flask/service/document_data_extraction.tf - Update infra/app-flask/app-config/env-config/document_data_extraction.tf name and path
- Create KMS key - Remove KMS configuration parameters from module interface - Add Bedrock Data Automation access_policy_arn output for service integration - Update blueprints_map to use map(string) tags instead of complex objects - Remove enabled_blueprint logic in favor of reading blueprints directory - Update README
…ts.tf, infra/modules/document-data-extraction/resources/variables.tf
- Remove Bedrock Data Automation KMS key alias - Remove unused Document Data Extraction outputs - Format code via make format
- Rename override_config_state to override_configuration in README - Add usage examples with service configuration and blueprint schema
@doshitan - Added usage examples in |
@laurencegoolsby looks like we are still missing most of the |
@doshitan - Added/updated files in infra/{{app_name}}/* |
There was a problem hiding this comment.
Please update the PR description to mention the associated ticket, matching the PR template for the project (and so also probably removing the "Summary" section/fold that into the "Changes" to match the PR template): #986
Please link to the test PR, not the branch the in the PR description.
Avoid the past tense description of changes in the PR description. For background context about how a decision was arrived at that's fine, but for what the current changes do the descriptive text should be more in the imperative mood, like the title.
We'll also need the changes in navapbc/platform-test#238
@doshitan Added changes from navapbc/platform-test#238 |
- Change environment variables to DDE_INPUT_LOCATION and DDE_OUTPUT_LOCATION - Add BDA region configuration with us-east-1 default - Create region-specific AWS providers for BDA deployment - Move profile ARN generation to module output - Add comment explaining DOCUMENT vs IMAGE blueprint types
infra/{{app_name}}/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
infra/{{app_name}}/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
- Add support for AWS-managed encryption (Bedrock Data Automation requires AWS-managed encryption) - Add support for AWS-managed Bedrock Data Automation blueprints
doshitan
left a comment
There was a problem hiding this comment.
Any changes to the storage module should be a separate PR for cleaner review please.
And what exactly are we seeing that leads us to believe BDA can't use S3 buckets with customer managed keys?
Also removed was configuration for custom KMS key encryption for BDA itself. Why is that?
infra/{{app_name}}/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
infra/modules/document-data-extraction/resources/access_control.tf
Outdated
Show resolved
Hide resolved
Re: Storage module changesBDA requires storage module changes for both encryption approaches:
Current storage module always creates CMKs with a basic key policy but does not expose for use by BDA. Options:
Re: BDA encryption_configurationRemoved - caused deployment errors, parameter unsupported in current BDA API. |
- Consolidate blueprints_path and aws_managed_blueprints into single blueprints list - Support mixed list of file paths and ARNs in blueprints variable - Add glob pattern expansion in service layer for blueprint files - Remove InvokeModel and InvokeModelWithResponseStream permissions (not needed for BDA) - Update README to reflect new blueprints list interface
|
@doshitan Addressed feedback:
Re: Storage module - BDA requires storage module changes regardless of encryption approach. Awaiting your feedback on how to proceed. |
And whatever they are, please put them in a separate PR. This PR can depend on it.
This seems like the more useful functionality to have. The storage module could export the KMS ARN and the DDE module could attach whatever additional policies needed to the key/bucket. Not sure if just the caller of BDA needs to be able to provide a grant for the KMS key and/or BDA needs declared in the key policy directly. But either option is fine for now.
It was working previously in navapbc/platform-test#237? What deployment? What exactly is the error message? |
PR #237 invoked BDA directly from ECS; DocumentAI infra structure uses Lambda (triggered by EventBridge) to invoke BDA. TLDR;
I'll create a separate PR for the storage module changes (kms_service_principals, KMS ARN export, optional AWS-managed keys) |
… modules - Add providers.tf to infra/modules/storage - Add providers.tf to infra/modules/document-data-extraction/resources - Resolves Terraform warnings about undefined provider references when passing providers between modules
….com/navapbc/template-infra into lgoolsby/add-bedrock-data-automation
doshitan
left a comment
There was a problem hiding this comment.
Lint and test failures, please fix. Looks like this code is not completely up-to-date.
Remove obsolete use_aws_managed_encryption variable and related logic
doshitan
left a comment
There was a problem hiding this comment.
The module README.md and PR description are out of date. But after those are updated to reflect current code. Looks good!
|
|
||
| - **BDA uses its own internal service role** - This module does not create a custom IAM role for BDA. Bedrock Data Automation uses an AWS-managed internal service role for S3 access. | ||
| - **S3 bucket encryption** - S3 buckets used with BDA should use AWS-managed encryption (AES256), not customer-managed KMS keys. | ||
| - **Lambda permissions** - Any Lambda function invoking BDA must have S3 permissions for both input and output buckets directly attached to its execution role. |
There was a problem hiding this comment.
Not sure why this is if BDA is actually accessing the buckets?
There was a problem hiding this comment.
@doshitan Lambda needs S3 permissions as the DocumentAI Lambda uploading the input document to S3. A separate Lambda retrieving the output from S3, while another Lambda invokes BDA.
Updated Lambda permission verbiage accordingly.
There was a problem hiding this comment.
Ah, but this is true of anything wanting to use DDE, not just an AWS Lambda right?
And a service wanting to use BDA could theoretically not have write access to the input S3 bucket, but as long as they can construct the S3 URI to pass in as a parameter to BDA they could still utilize the module right? Unlikely set up, but this module doesn't really make any assumptions about the caller like this? (BDA itself just needs to be able to r/w the input/ouput locations)
- Remove references to S3 AWS-managed encryption - Fix indentation in usage section
Updated README.md and PR description. |
Fixes #986
Changes
Add a new
document-data-extractionmodule that provides AWS Bedrock Data Automation (BDA) resources for document data extraction workflows.DDE Module
Storage Module
Template Files
aws_managed_blueprintsparameterImportant Notes
Testing
Tested in navapbc/platform-test#237