Skip to content

Metadata in ConversionResult #2273

@dolfim-ibm

Description

@dolfim-ibm

New feature

Allow document backends to populate a generic metadata fields in the ConversionResult object.

For example: PDF metadata, USPTO patents metadata, etc

Specs

  • ConversionResult:

    • Add metadata: Dict[str, Any] = {}
    • class ConversionResult(BaseModel):
      input: InputDocument
      status: ConversionStatus = ConversionStatus.PENDING # failure, success
      errors: List[ErrorItem] = [] # structure to keep errors
      pages: List[Page] = []
      assembled: AssembledUnit = AssembledUnit()
      timings: Dict[str, ProfilingItem] = {}
      confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)
      document: DoclingDocument = _EMPTY_DOCLING_DOC
  • Backends

    • Add extract_metadata() to the AbstractDocumentBackend. It can just be a default implementation returning the empty dict.
    • class AbstractDocumentBackend(ABC):
      @abstractmethod
      def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
      self.file = in_doc.file
      self.path_or_stream = path_or_stream
      self.document_hash = in_doc.document_hash
      self.input_format = in_doc.format
      @abstractmethod
      def is_valid(self) -> bool:
      pass
      @classmethod
      @abstractmethod
      def supports_pagination(cls) -> bool:
      pass
      def unload(self):
      if isinstance(self.path_or_stream, BytesIO):
      self.path_or_stream.close()
      self.path_or_stream = None
      @classmethod
      @abstractmethod
      def supported_formats(cls) -> Set["InputFormat"]:
      pass
  • Pipeline

    • Start with the SimplePipeline, add something like conv_res.document = conv_res.input._backend.extract_metadata()
    • with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT):
      conv_res.document = conv_res.input._backend.convert()

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions