Skip to content

Conversation

@vblagoje
Copy link
Member

Why

Adds a new integration for Azure Document Intelligence (azure-doc-intelligence-haystack), providing a Haystack component that converts documents (PDF, images, Office files) to Haystack Documents using Azure's Document Intelligence service.

Azure deprecated azure-ai-formrecognizer in favor of azure-ai-documentintelligence (v1.0.0, GA Dec 2024). This new integration provides a clean slate with:

  • Markdown output format (GitHub Flavored Markdown): Better suited for RAG/LLM applications - tables inline with context, preserved document structure (headings, lists), no manual assembly required
  • Modern API: Uses the 2024-11-30 API version with improved table and structure detection
  • Simplified API: Removed deprecated parameters and streamlined the interface

What

Added AzureDocumentIntelligenceConverter component:

  • Uses azure-ai-documentintelligence>=1.0.0 package
  • Markdown output mode (default): Single document with inline tables and preserved structure
  • Text output mode (backward compatibility): Separate CSV table documents or markdown tables
  • Multiple model support: prebuilt-read (fast OCR), prebuilt-layout (enhanced structure), prebuilt-document (general), or custom models

Usage

  import os
  from haystack_integrations.components.converters.azure_doc_intelligence import (
      AzureDocumentIntelligenceConverter,
  )
  from haystack.utils import Secret

  # Markdown mode (recommended for RAG)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="markdown"
  )
  results = converter.run(sources=["invoice.pdf"])
  # Returns single document with markdown, tables inline

  # Text mode with CSV tables (backward compatibility)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="text",
      table_format="csv"
  )
  # Returns separate CSV table documents + text document

Testing

  • 3 unit tests (init, to_dict, from_dict)
  • 4 integration tests with real Azure API (markdown output, text+CSV tables, metadata handling, multiple files)

Notes for reviewer

  • Package follows the standard haystack-core-integrations structure
  • Includes optional [csv] extra for tabulate dependency
  • CI workflow added: .github/workflows/azure_doc_intelligence.yml
  • Integration added to root README.md inventory table

@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Jan 12, 2026
@vblagoje vblagoje marked this pull request as ready for review January 12, 2026 15:03
@vblagoje vblagoje requested a review from a team as a code owner January 12, 2026 15:03
@vblagoje vblagoje requested review from julian-risch and sjrl and removed request for a team January 12, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add a Azure OCR Converter that uses the azure-ai-documentintelligence library

2 participants