Skip to content

Add Strata DocumentAI API app#278

Draft
doshitan wants to merge 8 commits intomainfrom
doshitan/add-documentai
Draft

Add Strata DocumentAI API app#278
doshitan wants to merge 8 commits intomainfrom
doshitan/add-documentai

Conversation

@doshitan
Copy link
Copy Markdown
Contributor

@doshitan doshitan commented Apr 28, 2026

Ticket

Resolves #253

Changes

TODO extract useful notes for configuring DocumentAI API

TODO add domain/HTTPS config

TODO update or remove E2E tests

TODO infra test prefix does still cause issues with app-documentai name, PR envs fine (into the 1000s at least, and dependent on env name). We could use app-docuai, which fits the other (current) longest app names.

Context for reviewers

Have not (yet) copied overall the custom templates from #274. A bit unclear if those should be something we just ship as a part of the DDE module in the template vs DocumentAI API specific.

Review all the TODO(pre-merge): comments.

Testing

Video demo of PR environment (public link): https://drive.google.com/file/d/1t3-cBPaE-4i2_XkEPgbSpbBqhIU36elO/view?usp=drive_link

Using https://www.paystubhero.com/wp-content/uploads/2023/11/896-1.jpg

See results:

{
  "jobId": "ab2e401b-e2f1-4e79-8b60-a21562aa07e5",
  "jobStatus": "completed",
  "message": "Document processed successfully",
  "createdAt": "2026-05-07T21:03:47.833187Z",
  "completedAt": "2026-05-07T21:05:03Z",
  "totalProcessingTimeSeconds": 75.17,
  "matchedDocumentClass": "W2",
  "fields": {
    "employerInfo.employerAddress": {
      "confidence": 0.84,
      "value": "41980 Ann Arbor Rd. E Plymouth, NC"
    },
    "employerInfo.controlNumber": {
      "confidence": 0.92,
      "value": ""
    },
    "employerInfo.employerName": {
      "confidence": 0.98,
      "value": "Paystub Hero"
    },
    "employerInfo.ein": {
      "confidence": 0.97,
      "value": "39-3598535"
    },
    "employerInfo.employerZipCode": {
      "confidence": 0.97,
      "value": 48170
    },
    "filingInfo.ombNumber": {
      "confidence": 0.97,
      "value": "1545-0029"
    },
    "filingInfo.verificationCode": {
      "confidence": 0.96,
      "value": ""
    },
    "other": {
      "confidence": 0.95,
      "value": ""
    },
    "federalTaxInfo.federalIncomeTax": {
      "confidence": 0.98,
      "value": 9467
    },
    "federalTaxInfo.allocatedTips": {
      "confidence": 0.96,
      "value": 0
    },
    "federalTaxInfo.socialSecurityTax": {
      "confidence": 0.97,
      "value": 4960
    },
    "federalTaxInfo.medicareTax": {
      "confidence": 0.97,
      "value": 1160
    },
    "employeeGeneralInfo.employeeNameSuffix": {
      "confidence": 0.96,
      "value": ""
    },
    "employeeGeneralInfo.employeeAddress": {
      "confidence": 0.13,
      "value": "41980 Ann Arbor Rd. E Plymouth, CA"
    },
    "employeeGeneralInfo.employeeLastName": {
      "confidence": 0.96,
      "value": "Jesan"
    },
    "employeeGeneralInfo.employeeZipCode": {
      "confidence": 0.98,
      "value": 48170
    },
    "employeeGeneralInfo.firstName": {
      "confidence": 0.97,
      "value": "Abdur Rahaman"
    },
    "employeeGeneralInfo.ssn": {
      "confidence": 0.95,
      "value": "498-74-9874"
    },
    "federalWageInfo.socialSecurityTips": {
      "confidence": 0.96,
      "value": 0
    },
    "federalWageInfo.wagesTipsOtherCompensation": {
      "confidence": 0.96,
      "value": 80000
    },
    "federalWageInfo.medicareWagesTips": {
      "confidence": 0.96,
      "value": 80000
    },
    "federalWageInfo.socialSecurityWages": {
      "confidence": 0.96,
      "value": 80000
    },
    "nonqualifiedPlansIncom": {
      "confidence": 0.98,
      "value": 0
    }
  },
  "error": null,
  "additionalInfo": null
}

Preview environment for app

Preview environment for app-catala

Preview environment for app-rails

Preview environment for app-flask

Preview environment for app-nextjs

Preview environment for app-documentai

@doshitan
Copy link
Copy Markdown
Contributor Author

@laurencegoolsby could you review and respond to any TODO(pre-merge): comments (well, the ones that are questions, haha)? I came across some things I didn't understand in the course of going through the #274 changes.

Copy link
Copy Markdown

@laurencegoolsby laurencegoolsby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed/responded to all TODO(pre-merge): comments.
Created navapbc/strata-template-documentai-api#52 , navapbc/strata-template-documentai-api#53

}
}

# TODO(pre-merge): this is new, what should we be using?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document block is needed for PDF/document extraction. The image block handles image files (photos of documents). We need both - document for PDFs/TIFFs, image for JPEGs/PNGs.

The document config extracts page-level text with bounding boxes, which is what BDA needs for field extraction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these document settings highly specific to the Strata DocumentAI implementation? Or are they reasonable general baselines we should just ship as a part of the Document Data Extraction config (like we are doing for the image part)?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doshitan - Reasonable general baselines. Ship as part of DDE.

Comment on lines +13 to +15
# TODO(pre-merge): create ticket for documentapi-api to respect standard DDE env vars
# and/or update DDE module to provide other env vars (like the BDA_ ones?)
DOCUMENTAI_INPUT_LOCATION = "${local.document_data_extraction_environment_variables.DDE_INPUT_LOCATION}/input"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree — the app code should use DDE_INPUT_LOCATION etc. instead of DOCUMENTAI_* prefixed vars.

Created navapbc/strata-template-documentai-api#52

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doshitan I retract my previous statement. DOCUMENTAI_INPUT_LOCATION makes sense here. The DOCUMENTAI_* prefix makes sense for the applications environment variables.

The DDE module provides DDE_INPUT_LOCATION as its output, but the app should map that to its own namespace (DOCUMENTAI_INPUT_LOCATION).

Comment thread infra/app-documentai/service/documentai_api.tf Outdated
Comment thread infra/app-documentai/service/documentai_api.tf Outdated
Comment thread infra/app-documentai/app-config/env-config/document_data_extraction.tf Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create DocumentAI API instance

2 participants