Skip to content

Partitioning markdown file with umlauts (ö ä ü) - Status code 500 #489

Open
@navlisData

Description

@navlisData

I'm not sure if I'm overlooking something. I can hardly imagine that I'm the first one to come across this. Therefore, I appreciate any information or support. Thank you very much.

Describe the bug

I am trying to process a Markdown file using the Ingest library. However, partitioning/chunking via the API fails with a 500 status. This only happens when the Markdown file contains umlauts (ö ä ü). Or is it not intended for partitioning and chunking to be performed separately using the API?

To Reproduce

  1. Create the directory inputdir, including only one markdown file with following (or similar) content (which includes umlauts):
können
  1. Use Pipeline.from_configs
Pipeline.from_configs(
        context=ProcessorConfig(
            delete_cache=True,
        ),
        indexer_config=LocalIndexerConfig(input_path=Path("./inputdir"), recursive=True),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            },
        ),
        chunker_config=ChunkerConfig(
            chunking_strategy="by_title",
            chunk_by_api=True,
            chunking_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            chunk_api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        ),
        embedder_config=EmbedderConfig(
            embedding_provider=os.getenv("LLM_PROVIDER"),
            embedding_model_name=os.getenv("EMBEDDING_MODEL_NAME"),
            embedding_api_key=os.getenv("OPENAI_API_KEY"),
        ),
        destination_connection_config=ServerQdrantConnectionConfig(
            access_config=Secret(ServerQdrantAccessConfig()),
            url=os.getenv("QDRANT_URL")
        ),
        stager_config=ServerQdrantUploadStagerConfig(),
        uploader_config=ServerQdrantUploaderConfig(
            collection_name=os.getenv("QDRANT_COLLECTION"),
            batch_size=50,
            num_processes=1
        )
    ).run()
  • Filetype: Markdown file, containing umlauts
  • Any additional API parameters: -

Environment:

  • API running in a docker container (tested with version 0.0.82 / latest)
  • Calling the API via the Ingest-Library
  • unstructured-ingest 0.4.2
  • Python 3.12

Additional context

  • The encoding of the markdown file is UTF-8

Errors
In my python application:

2025-01-29 18:15:35,691 MainProcess INFO     calling ChunkStep with 1 docs
INFO: calling ChunkStep with 1 docs
2025-01-29 18:15:35,691 MainProcess INFO     processing content async
INFO: processing content async
INFO: HTTP Request: POST http://localhost:9500/general/v0/general "HTTP/1.1 500 Internal Server Error"
INFO: Failed to process a request due to API server error with status code 500. Attempting retry number 1 after sleep.
INFO: Server message - {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
ERROR: Failed to partition the document.
ERROR: Server responded with 500 - {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
2025-01-29 18:15:36,935 MainProcess ERROR    Uncaught Error calling API: {"detail": "'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
ERROR: Uncaught Error calling API: {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
2025-01-29 18:15:36,945 MainProcess INFO     chunk finished in 1.2495855s, attributes: file_id=6176397801e1
INFO: chunk finished in 1.2495855s, attributes: file_id=6176397801e1
2025-01-29 18:15:36,945 MainProcess ERROR    Exception raised while running chunk

In the logs of my API docker container:

  File "/home/notebook-user/prepline_general/api/general.py", line 428, in pipeline_api
    raise e
  File "/home/notebook-user/prepline_general/api/general.py", line 388, in pipeline_api
    elements = partition(**partition_kwargs)  # type: ignore # pyright: ignore[reportGeneralTypeIssues]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/auto.py", line 292, in partition
    elements = partition_json(filename=filename, file=file, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 725, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 683, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/json.py", line 63, in partition_json
    file_text = file_content if isinstance(file_content, str) else file_content.decode()
                                                                   ^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1945: invalid continuation byte

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions