generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Labels
new-middlewareA label associated with a new middleware.A label associated with a new middleware.triage
Description
Use case
Implement a middleware that exposes the Textract capabilities within a Lakechain document processing pipeline.
Solution/User Experience
Below is the temporary design for an API for this middleware.
Table data extraction.
Input(s) : PDF, Images
Output(s) : 'markdown' and/or 'text' and/or 'excel' and/or 'csv' and/or 'html'
const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new TableExtractionTask.Builder()
.withOutputType('markdown' | 'text' | 'excel' | 'csv' | 'html')
// Defines whether a document will be created for each table,
// or whether to group them all in one document.
.withGroupOutput(false)
.build())
.build();Key value pair extraction.
Input(s) : PDF, Images
Output(s) : 'json' | 'csv'
const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new KvExtractionTask.Builder()
.withOutputType('json' | 'csv')
.build())
.build();Visualize task.
Input(s) : PDF, Images
Output(s) : One or multiple images
const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new ImageVisualizationTask.Builder()
.withCheckboxes(true)
.withKeyValues(true)
.withTables(true)
.withSearch('rent', { top_k: 10 })
.build())
.build();Expense analysis.
Input(s) : PDF, Images
Output(s) : CSV
const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new ExpenseAnalysisTask.Builder()
.withOutputType('csv')
.build())
.build();ID Analysis.
Input(s) : PDF, Images
Output(s) : JSON, CSV
const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new IdAnalysisTask.Builder()
.withOutputType('json' | 'csv')
.build())
.build();Layout Analysis.
Input(s) : PDF, Images
Output(s) : PDF, Images + Metadata
Exports layout information in a structured way in the document metadata.
const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new LayoutAnalysisTask.Builder()
.build())
.build();Alternative solutions
No response
Metadata
Metadata
Assignees
Labels
new-middlewareA label associated with a new middleware.A label associated with a new middleware.triage
Type
Projects
Status
In progress