-
Notifications
You must be signed in to change notification settings - Fork 311
[FEA]: Clean up returned results objects #1768
Description
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Currently preventing usage
Please provide a clear description of problem this feature solves
Currently ingestor.ingest (in batch mode) returns a very large amount of data from Ray workers.
This takes a long time, and includes far more fields than desired.
Describe the feature, and optionally a solution or implementation and any alternatives
-
source_name - document name (filename)
-
source_location - fully qualified path to ingested file
-
raw_location - fully qualified path for accessing related page image, cropped images, audio/video chunks or frames - Related to (retriever) Add .store() task for persisting extracted images (#1675) #1714
-
element_type - text / image / table / chart / infographic / audio / custom (can be populated by UDFs)
-
sequence_number - for citation references - this is page number for PDFs, but audio/video/text chunk number for other content types. "sequence_number" is less clear than page number for PDFs, but "page_number" is not clear for non pdf content types :)
-
bounding box - [x1, y1, x2, y2] coordinates, if applicable for image/table overlays
-
page dimensions - W/H for bbox normalization
-
content_type - top-level file/content type