Skip to content

[FEA]: Clean up returned results objects #1768

@randerzander

Description

@randerzander

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Currently preventing usage

Please provide a clear description of problem this feature solves

Currently ingestor.ingest (in batch mode) returns a very large amount of data from Ray workers.

This takes a long time, and includes far more fields than desired.

Describe the feature, and optionally a solution or implementation and any alternatives

  1. source_name - document name (filename)

  2. source_location - fully qualified path to ingested file

  3. raw_location - fully qualified path for accessing related page image, cropped images, audio/video chunks or frames - Related to (retriever) Add .store() task for persisting extracted images (#1675) #1714

  4. element_type - text / image / table / chart / infographic / audio / custom (can be populated by UDFs)

  5. sequence_number - for citation references - this is page number for PDFs, but audio/video/text chunk number for other content types. "sequence_number" is less clear than page number for PDFs, but "page_number" is not clear for non pdf content types :)

  6. bounding box - [x1, y1, x2, y2] coordinates, if applicable for image/table overlays

  7. page dimensions - W/H for bbox normalization

  8. content_type - top-level file/content type

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions