Open
Description
I'm not sure if I'm overlooking something. I can hardly imagine that I'm the first one to come across this. Therefore, I appreciate any information or support. Thank you very much.
Describe the bug
I am trying to process a Markdown file using the Ingest library. However, partitioning/chunking via the API fails with a 500 status
. This only happens when the Markdown file contains umlauts (ö ä ü)
. Or is it not intended for partitioning and chunking to be performed separately using the API?
To Reproduce
- Create the directory
inputdir
, including only one markdown file with following (or similar) content (which includes umlauts):
können
- Use
Pipeline.from_configs
Pipeline.from_configs(
context=ProcessorConfig(
delete_cache=True,
),
indexer_config=LocalIndexerConfig(input_path=Path("./inputdir"), recursive=True),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
},
),
chunker_config=ChunkerConfig(
chunking_strategy="by_title",
chunk_by_api=True,
chunking_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
chunk_api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
embedder_config=EmbedderConfig(
embedding_provider=os.getenv("LLM_PROVIDER"),
embedding_model_name=os.getenv("EMBEDDING_MODEL_NAME"),
embedding_api_key=os.getenv("OPENAI_API_KEY"),
),
destination_connection_config=ServerQdrantConnectionConfig(
access_config=Secret(ServerQdrantAccessConfig()),
url=os.getenv("QDRANT_URL")
),
stager_config=ServerQdrantUploadStagerConfig(),
uploader_config=ServerQdrantUploaderConfig(
collection_name=os.getenv("QDRANT_COLLECTION"),
batch_size=50,
num_processes=1
)
).run()
- Filetype: Markdown file, containing umlauts
- Any additional API parameters: -
Environment:
- API running in a docker container (tested with version
0.0.82
/latest
) - Calling the API via the Ingest-Library
unstructured-ingest 0.4.2
Python 3.12
Additional context
- The encoding of the markdown file is UTF-8
Errors
In my python application:
2025-01-29 18:15:35,691 MainProcess INFO calling ChunkStep with 1 docs
INFO: calling ChunkStep with 1 docs
2025-01-29 18:15:35,691 MainProcess INFO processing content async
INFO: processing content async
INFO: HTTP Request: POST http://localhost:9500/general/v0/general "HTTP/1.1 500 Internal Server Error"
INFO: Failed to process a request due to API server error with status code 500. Attempting retry number 1 after sleep.
INFO: Server message - {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
ERROR: Failed to partition the document.
ERROR: Server responded with 500 - {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
2025-01-29 18:15:36,935 MainProcess ERROR Uncaught Error calling API: {"detail": "'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
ERROR: Uncaught Error calling API: {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
2025-01-29 18:15:36,945 MainProcess INFO chunk finished in 1.2495855s, attributes: file_id=6176397801e1
INFO: chunk finished in 1.2495855s, attributes: file_id=6176397801e1
2025-01-29 18:15:36,945 MainProcess ERROR Exception raised while running chunk
In the logs of my API docker container:
File "/home/notebook-user/prepline_general/api/general.py", line 428, in pipeline_api
raise e
File "/home/notebook-user/prepline_general/api/general.py", line 388, in pipeline_api
elements = partition(**partition_kwargs) # type: ignore # pyright: ignore[reportGeneralTypeIssues]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/auto.py", line 292, in partition
elements = partition_json(filename=filename, file=file, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 581, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 725, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 683, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/json.py", line 63, in partition_json
file_text = file_content if isinstance(file_content, str) else file_content.decode()
^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1945: invalid continuation byte
Metadata
Metadata
Assignees
Labels
No labels