Conversation
doshitan
left a comment
There was a problem hiding this comment.
This looks like a good start! Some refactoring and questions throughout, but I think the main thing we should do is ensure the module exports the necessary info such that it could be used in a service.
Also, could you post some examples of the changes working in the “Testing” section of the PR description? Like a short video from the AWS console or whatever of calling the created BDA project with example input and seeing the output.
We'll want to avoid product/vendor-specific names for modules (following the style guide). I couldn't think of anything better than document_data_extraction, so that's what I'm suggesting here, but open to other options ("document" doesn't obviously capture the potential scope of images/videos, but feels like the most common usecase).
Refactorings to that end:
-
Move
/infra/modules/bedrock-data-automation/*to/infra/modules/document-data-extraction/resources/.- Add an
access_policy_arnoutput which has the policy a client would need to be able to invoke the document processing. Like https://github.com/navapbc/template-infra/blob/5afca08b040604d25ee934ec7afea4ecca060a69/infra/modules/storage/access_control.tf#L35-L65 - Add outputs for loaded blueprint arns, or maybe just the blueprint ARN prefix, since the app will need to know some generic identifier (e.g., the name) for the blueprint anyway?
- Add an
-
Add a
enable_document_data_extractionto/infra/{{app_name}}/app-config/main.tf -
Add a
/infra/{{app_name}}/app-config/env-config/document_data_extraction.tffile, holding various config, likename(passed in to the module as/replacing the currentname_prefixI think) and probably theblueprints_map(at least theschemaandtype, or maybe nest schemas under a type key), like https://github.com/navapbc/template-infra/blob/5afca08b040604d25ee934ec7afea4ecca060a69/infra/%7B%7Bapp_name%7D%7D/app-config/env-config/notifications.tf -
Move
/infra/{{app_name}}/service/bda.tfto/infra/{{app_name}}/service/document_data_extraction.tfand make resource creation dependent on theenable_document_data_extractionconfig boolean.- Add a local variable holding env vars with the parameters app code would need in order to use the service (generally the module’s outputs, like the input and output bucket names/uris, BDA project ARN, and probably the blueprint ARNs/prefix?), like https://github.com/navapbc/template-infra/blob/5afca08b040604d25ee934ec7afea4ecca060a69/infra/%7B%7Bapp_name%7D%7D/service/identity_provider.tf#L8-L11
- Then connect those env vars to the service, like https://github.com/navapbc/template-infra/blob/5afca08b040604d25ee934ec7afea4ecca060a69/infra/%7B%7Bapp_name%7D%7D/service/main.tf#L100
-
Add access policy to service conditional on
enable_document_data_extractionas well as the input & output buckets: https://github.com/navapbc/template-infra/blob/5afca08b040604d25ee934ec7afea4ecca060a69/infra/%7B%7Bapp_name%7D%7D/service/main.tf#L117
infra/app-flask/service/bda.tf
Outdated
| @@ -0,0 +1,48 @@ | |||
| module "bda_input_bucket" { | |||
| source = "../../modules/storage" | |||
| name = "bda-test-input-app-flask" | |||
There was a problem hiding this comment.
If we create new buckets, the naming should follow the existing convention and uniqueness constraints, which the simpliest way would probably be something like:
| name = "bda-test-input-app-flask" | |
| name = "${local.bucket_name}-bda-input" |
Re-using the existing storage bucket name (which should handle most of the conventions) and tack on the specific purpose. Ideally we'd use it something more clear like -document-data-extraction-, but that's like 40% the max length for a bucket name, so should probably default to something shorter.
There was a problem hiding this comment.
@doshitan We should use separate buckets. Rather than the existing storage bucket. Reasons here:
@doshitan We should 100% use separate input/output buckets for BDA. Reasons include, but are not limited to:
Physical separation of resources
- Input bucket: BDA read-only
- Output bucket: BDA write-only
Retention policies
- Input bucket: longer retention for compliance/regulatory reasons
- Output bucket: shorter retention for processed results
Cost/storage management
- Input bucket: transition to Infrequent Access after 30 days, Glacier > 90 days, etc.
- Output bucket: standard storage class with n-day automatic deletion
Monitoring/alerts:
- Per-bucket S3 metrics and CloudWatch alerts
infra/modules/document-data-extraction/resources/bedrock-data-automation/main.tf
Outdated
Show resolved
Hide resolved
infra/modules/document-data-extraction/resources/bedrock-data-automation/variables.tf
Outdated
Show resolved
Hide resolved
- Add enable_document_data_extraction boolean - Add environment variables and bucket policies to service - Add additional outputs for BDA project/blueprint
doshitan
left a comment
There was a problem hiding this comment.
It looks like most of my previous in-line comments haven't been addressed. I think most critically could use an answer to #237 (comment)
infra/app-flask/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
infra/app-flask/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
infra/app-flask/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
- Move infra/modules/document-data-extraction/resources/bedrock-data-automation/* to infra/modules/document-data-extraction/resources - Rename infra/app-flask/service/blueprints to document-data-extraction-blueprints - Remove bda_ prefix from infra/modules/document-data-extraction/resources/variables.tf and infra/app-flask/service/document_data_extraction.tf - Update infra/app-flask/app-config/env-config/document_data_extraction.tf name and path
- Added local.prefix to bucket names and BDA project name - Updated environment variables to use prefixed bucket names
doshitan
left a comment
There was a problem hiding this comment.
I think we can tighten up the document data extract module interface and outputs a bit. And we should sync on the bucket setup for BDA in general.
There are still some older unresolved comments:
infra/app-flask/app-config/env-config/document_data_extraction.tf
Outdated
Show resolved
Hide resolved
| override_configuration = { | ||
| document = { | ||
| splitter = { | ||
| state = var.override_config_state |
There was a problem hiding this comment.
The name of the variable and its use seem a little inconsistent. Should like override_configuration = var.override_config?
Also what is the purpose/function of override_configuration? Like is there a less BDA-specific name we could give the module variable?
There was a problem hiding this comment.
- Updated the variables names to be consistent (e.g. renamed
override_config_statetooverride_configuration) https://docs.aws.amazon.com/bedrock/latest/APIReference/API_data-automation_OverrideConfiguration.html - re: Less BDA-specific? We could use
modality_override_configuration?
- Create KMS key - Remove KMS configuration parameters from module interface - Add Bedrock Data Automation access_policy_arn output for service integration - Update blueprints_map to use map(string) tags instead of complex objects - Remove enabled_blueprint logic in favor of reading blueprints directory - Update README
| for blueprint in fileset(local.document_data_extraction_config.blueprints_path, "*") : | ||
| split(".", blueprint)[0] => { | ||
| schema = file("${local.document_data_extraction_config.blueprints_path}/${blueprint}") | ||
| type = "DOCUMENT" |
There was a problem hiding this comment.
As opposed to image, video, or audio right? And it looks like you can only have one custom blueprint for these other types right? Have we found value in custom blueprints for any of these?
And so not in scope, but if projects wanted to have custom blueprints for multiple kinds of images/audio, they'd have to create multiple BDA projects, hmm.
But JPG, PNG, TIFF are counted as "documents" not "images" right?
There was a problem hiding this comment.
@doshitan All of our blueprints were defined with a DOCUMENT modality rather than IMAGE.
JPG, PNG, and TIFF documents can match to blueprints with either a DOCUMENT or IMAGE modality. In other words, if you upload a jpg, BDA can choose a "document" blueprint if it's the best fit.
| File Types | Default Routing Behavior |
|---|---|
| PNG | Semantic classifier; either Image or Document |
| JPEG | Semantic classifier; either Image or Document |
| PDF, TIFF | Document |
| MP4, MOV | Video |
| AMR, FLAC, M4A, MP3, OGG, WEBM, WAV | Audio |
There was a problem hiding this comment.
Hmm, it looks like there's also a difference in behavior using InvokeDataAutomationAsync and InvokeDataAutomation, the latter rejecting JPG/PNG that get detected as DOCUMENT? Feels like that nuance should be captured somewhere. infra/modules/document-data-extraction/README.md doesn't feel quite right, but a usage guide would be good. I guess for now the usage guide will be the API app we build to use this module.
Also let's put a short comment here summarizing type thing? Something like "JPG/PNG can be processed as DOCUMENT or IMAGE types, but IMAGE types can only have a single custom blueprint so generally the blueprints will be for the DOCUMENT type"
…ment data extraction integration - Create /test-document-data-extraction endpoint in Flask app - Update boto3/botocore to support latest BDA API (1.38.0+) - Fix Terraform outputs to include bucket name prefixes - Add DDE environment variables to local.env template
f65de85 to
d5036ce
Compare
- Consolidate blueprints_path and aws_managed_blueprints into single blueprints list - Support mixed list of file paths and ARNs in blueprints variable - Add glob pattern expansion in service layer for blueprint files - Remove InvokeModel and InvokeModelWithResponseStream permissions (not needed for BDA)
d5036ce to
ee6483e
Compare
|
@doshitan
Re: Storage module - BDA requires storage module changes regardless of encryption approach. Awaiting your feedback on how to proceed. |
|
CI app-flask PR Environment Checks / update / Update environment (pull_request)Failing after 21s Currently receiving an AccessDeniedException for Bedrock Blueprint operations. This appears to be related to missing bedrock and cloudformation permissions in the CI/CD role. These services were added to cc @doshitan |
|
@laurencegoolsby the changes from #238 need re-applied against the account layer. |
Updated! |
doshitan
left a comment
There was a problem hiding this comment.
Service test still look to be failing after re-run: https://github.com/navapbc/platform-test/actions/runs/21992515262/job/64041863229?pr=237
As well as warning:
TestService 2026-02-18T17:48:59Z logger.go:67: ╷
TestService 2026-02-18T17:48:59Z logger.go:67: │ Warning: Reference to undefined provider
TestService 2026-02-18T17:48:59Z logger.go:67: │
TestService 2026-02-18T17:48:59Z logger.go:67: │ on document_data_extraction.tf line 33, in module "dde_input_bucket":
TestService 2026-02-18T17:48:59Z logger.go:67: │ 33: aws = aws.dde
TestService 2026-02-18T17:48:59Z logger.go:67: │
TestService 2026-02-18T17:48:59Z logger.go:67: │ There is no explicit declaration for local provider name "aws" in
TestService 2026-02-18T17:48:59Z logger.go:67: │ module.dde_input_bucket, so Terraform is assuming you mean to pass a
TestService 2026-02-18T17:48:59Z logger.go:67: │ configuration for "hashicorp/aws".
TestService 2026-02-18T17:48:59Z logger.go:67: │
TestService 2026-02-18T17:48:59Z logger.go:67: │ If you also control the child module, add a required_providers entry named
TestService 2026-02-18T17:48:59Z logger.go:67: │ "aws" with the source address "hashicorp/aws".
TestService 2026-02-18T17:48:59Z logger.go:67: │
TestService 2026-02-18T17:48:59Z logger.go:67: │ (and 2 more similar warnings elsewhere)
CI test is failing with ECS service stability timeout: The manual deployment in p-237 workspace is healthy and running. This seems to suggest the CI failure is independent from the PR. Any chance this is related to the quota issues mentioned in Slack or is it something else entirely? Re: the Terraform provider warning - it does not appear to be a blocking issue as the deploy succeeded. I can add an explicit AWS provider declaration if desired. |
I'd re-run the infra tests and watch the logs/deployment events for the infra test.
Yes, I think we can a required_providers block to |
|
As mentioned, And is the testing up-to-date/have we validated the functionality? The current demo looks to be from January? But if those are all good, should update the template PR to get ready to merge. |
|
Fixes #986 ## Changes Add a new `document-data-extraction` module that provides AWS Bedrock Data Automation (BDA) resources for document data extraction workflows. ### DDE Module - Add BDA project and profile resources - Add custom blueprint support from JSON schemas - Add AWS-managed blueprint support - Add IAM access policies with bedrock permissions - Add cross-region inference profile support - Add standard and custom output configuration options ### Storage Module - Uses KMS encryption (storage module provides the encryption key) ### Template Files - Update service and env-config files to use new module interface - Add `aws_managed_blueprints` parameter ## Important Notes - **BDA uses its own internal service role** - This module does not create a custom IAM role for BDA - **S3 bucket encryption** - S3 buckets used with BDA should use the KMS encryption key provided by the storage module - **Lambda permissions** - Lambda functions invoking BDA must have S3 permissions for both input and output buckets directly attached to their execution role ## Testing Tested in navapbc/platform-test#237
Ticket
navapbc/template-infra#986
Changes
Adding BDA module for testing before integration into template-infra.
Context for reviewers
This is just to create the BDA resources, i.e., the project and blueprints, any integrations with how the project will be triggered are not in scope for this PR.
Testing
2026-01-23
platform-test-bda-invocation-p-237.mov
2026-02-19
platform-test-bda-invocation-p-237-2026-02-19.mov
Preview environment for app
♻️ Environment destroyed ♻️
Preview environment for app-nextjs
♻️ Environment destroyed ♻️
Preview environment for app-rails
♻️ Environment destroyed ♻️
Preview environment for app-flask
♻️ Environment destroyed ♻️