-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
New feature
Allow document backends to populate a generic metadata fields in the ConversionResult
object.
For example: PDF metadata, USPTO patents metadata, etc
Specs
-
ConversionResult
:- Add
metadata: Dict[str, Any] = {}
docling/docling/datamodel/document.py
Lines 198 to 210 in ff351fd
class ConversionResult(BaseModel): input: InputDocument status: ConversionStatus = ConversionStatus.PENDING # failure, success errors: List[ErrorItem] = [] # structure to keep errors pages: List[Page] = [] assembled: AssembledUnit = AssembledUnit() timings: Dict[str, ProfilingItem] = {} confidence: ConfidenceReport = Field(default_factory=ConfidenceReport) document: DoclingDocument = _EMPTY_DOCLING_DOC
- Add
-
Backends
- Add
extract_metadata()
to theAbstractDocumentBackend
. It can just be a default implementation returning the empty dict. docling/docling/backend/abstract_backend.py
Lines 13 to 39 in ff351fd
class AbstractDocumentBackend(ABC): @abstractmethod def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]): self.file = in_doc.file self.path_or_stream = path_or_stream self.document_hash = in_doc.document_hash self.input_format = in_doc.format @abstractmethod def is_valid(self) -> bool: pass @classmethod @abstractmethod def supports_pagination(cls) -> bool: pass def unload(self): if isinstance(self.path_or_stream, BytesIO): self.path_or_stream.close() self.path_or_stream = None @classmethod @abstractmethod def supported_formats(cls) -> Set["InputFormat"]: pass
- Add
-
Pipeline
- Start with the SimplePipeline, add something like
conv_res.document = conv_res.input._backend.extract_metadata()
docling/docling/pipeline/simple_pipeline.py
Lines 39 to 40 in ff351fd
with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT): conv_res.document = conv_res.input._backend.convert()
- Start with the SimplePipeline, add something like
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request