Open
Description
Hi,
I'm new to unstructured. When I run the sample code to perform the partitioning of several pdf/doc files, the extracted images are saved to the separate folder called figures
. The naming convention seems to be: figure-{page_number}-{#}
, as a result, the images extracted from different documents appearing on the same page number, will overwrite themselves - the metadata in the resulting json will point out to wrong file.
For instance the link to figures/figure-3-2.jpg
is included in three separate json files.
Here is my code:
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=str(INPUT_DIR)),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
ocr_languages=["eng"],
strategy="hi_res",
partition_by_api=False,
additional_partition_args={"extract_image_block_types": ["Image", "Table"]},
),
# chunker_config=ChunkerConfig(
# chunking_strategy="by_title",
# chunk_max_characters=512,
# chunk_combine_text_under_n_chars=200,
# ),
# embedder_config=EmbedderConfig(embedding_provider="huggingface"),
uploader_config=LocalUploaderConfig(output_dir=str(OUTPUT_DIR)),
).run()
Is this a bug or am I doing something wrong ?
Metadata
Metadata
Assignees
Labels
No labels