Skip to content

Feature request: Create Textract Middleware #46

@HQarroum

Description

@HQarroum

Use case

Implement a middleware that exposes the Textract capabilities within a Lakechain document processing pipeline.

Solution/User Experience

Below is the temporary design for an API for this middleware.

Table data extraction.
Input(s) : PDF, Images
Output(s) : 'markdown' and/or 'text' and/or 'excel' and/or 'csv' and/or 'html'

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new TableExtractionTask.Builder()
    .withOutputType('markdown' | 'text' | 'excel' | 'csv' | 'html')
    // Defines whether a document will be created for each table,
    // or whether to group them all in one document.
    .withGroupOutput(false)
    .build())
  .build();

Key value pair extraction.
Input(s) : PDF, Images
Output(s) : 'json' | 'csv'

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new KvExtractionTask.Builder()
    .withOutputType('json' | 'csv')
    .build())
  .build();

Visualize task.
Input(s) : PDF, Images
Output(s) : One or multiple images

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new ImageVisualizationTask.Builder()
    .withCheckboxes(true)
    .withKeyValues(true)
    .withTables(true)
    .withSearch('rent', { top_k: 10 })
    .build())
  .build();

Expense analysis.
Input(s) : PDF, Images
Output(s) : CSV

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new ExpenseAnalysisTask.Builder()
    .withOutputType('csv')
    .build())
  .build();

ID Analysis.
Input(s) : PDF, Images
Output(s) : JSON, CSV

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new IdAnalysisTask.Builder()
    .withOutputType('json' | 'csv')
    .build())
  .build();

Layout Analysis.
Input(s) : PDF, Images
Output(s) : PDF, Images + Metadata
Exports layout information in a structured way in the document metadata.

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new LayoutAnalysisTask.Builder()
    .build())
  .build();

Alternative solutions

No response

Metadata

Metadata

Assignees

Labels

new-middlewareA label associated with a new middleware.triage

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions