This Azure Functions app provides the extract_pdf_content HTTP endpoint implemented in function_app.py, which downloads a PDF from Azure Blob Storage and uses Azure AI Content Understanding to extract field data per page, returning the results as JSON.
Key features:
- Managed identity authentication via
DefaultAzureCredential. - Blob retrieval using
BlobServiceClient(configurable account/container). - PDF splitting into single-page byte streams via
PyPDF2. - Parallel page analysis using
ThreadPoolExecutorfor improved throughput. - Field extraction per page, with a best-effort
TotalTaxesDuevalue that falls back toTotalDueWithDiscountorTaxAfterDiscount.
flowchart LR
Client(["Client / Caller"])
AzFunc["Azure Function\nextract_pdf_content\n(HTTP trigger)"]
Blob[("Azure Blob Storage\nInput PDFs")]
CU["Azure AI\nContent Understanding"]
AppInsights["Application Insights\n(Logs & Telemetry)"]
Client -- "HTTP GET\n?document=<name>.pdf" --> AzFunc
AzFunc -- "Download PDF blob" --> Blob
Blob -- "PDF bytes" --> AzFunc
AzFunc -- "Analyze pages in parallel\n(ThreadPoolExecutor)" --> CU
CU -- "Extracted fields per page" --> AzFunc
AzFunc -- "JSON response" --> Client
AzFunc -. "Logs" .-> AppInsights
- Python 3.10+
- Azure Functions Core Tools (for local development)
- Azure Storage account containing the PDFs
- Azure AI Content Understanding endpoint and analyzer (default
prebuilt-documentFields)
Set the following environment variables (or add them to local.settings.json for local runs):
| Setting | Description |
|---|---|
BLOB_ACCOUNT_URL |
Blob storage endpoint, e.g., https://<account>.blob.core.windows.net. |
BLOB_CONTAINER_NAME |
Storage container containing PDFs (default documents). |
CONTENT_UNDERSTANDING_ENDPOINT |
Azure AI Content Understanding endpoint. |
CONTENT_UNDERSTANDING_ANALYZER |
Analyzer ID (default prebuilt-documentFields). |
MANAGED_IDENTITY_CLIENT_ID |
Optional user-assigned managed identity client ID for DefaultAzureCredential. |
CONTENT_UNDERSTANDING_MAX_CONCURRENCY |
Max parallel page analyses (default 4). |
TOTAL_TAX_FIELD_CANDIDATES |
Comma-separated list of field names treated as "total taxes" (default TotalTaxesDue,TotalDueWithDiscount,TaxAfterDiscount). |
Example local.settings.json snippet:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"FUNCTIONS_WORKER_RUNTIME": "python",
"BLOB_ACCOUNT_URL": "https://<account>.blob.core.windows.net",
"BLOB_CONTAINER_NAME": "documents",
"CONTENT_UNDERSTANDING_ENDPOINT": "https://<name>.cognitiveservices.azure.com",
"CONTENT_UNDERSTANDING_ANALYZER": "prebuilt-documentFields",
"TOTAL_TAX_FIELD_CANDIDATES": "TotalTaxesDue,TotalDueWithDiscount,TaxAfterDiscount"
}
}Listed in requirements.txt:
azure-functions
azure-storage-blob
azure-ai-contentunderstanding
azure-identity
PyPDF2
Install locally with:
python -m pip install -r requirements.txt-
Configure
local.settings.jsonwith the required settings. -
Start the Functions host:
func start
-
Invoke the endpoint:
curl "http://localhost:7071/api/extract_pdf_content?document=<blob-name>.pdf"
The response indicates the total number of pages and includes per-page entries with a full fields dictionary and the resolved total tax value.
- Deploy using standard Azure Functions workflows (CLI, VS Code, CI/CD).
- Ensure the Function App's managed identity has access to Blob Storage and the Content Understanding resource.
- Tune
CONTENT_UNDERSTANDING_MAX_CONCURRENCYaccording to your hosting plan and analyzer limits.