Partitioning markdown file with umlauts (ö ä ü) - Status code 500

I'm not sure if I'm overlooking something. I can hardly imagine that I'm the first one to come across this. Therefore, I appreciate any information or support. Thank you very much.

**Describe the bug**

I am trying to process a Markdown file using the Ingest library. However, partitioning/chunking via the **API fails** with a `500 status`. This only happens when the Markdown file `contains umlauts (ö ä ü)`. Or is it not intended for partitioning and chunking to be performed separately using the API?

**To Reproduce**

1. Create the directory `inputdir`, including only one markdown file with following (or similar) content (which includes umlauts):
```
können
```

2.  Use `Pipeline.from_configs`
```
Pipeline.from_configs(
        context=ProcessorConfig(
            delete_cache=True,
        ),
        indexer_config=LocalIndexerConfig(input_path=Path("./inputdir"), recursive=True),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            },
        ),
        chunker_config=ChunkerConfig(
            chunking_strategy="by_title",
            chunk_by_api=True,
            chunking_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            chunk_api_key=os.getenv("UNSTRUCTURED_API_KEY"),
        ),
        embedder_config=EmbedderConfig(
            embedding_provider=os.getenv("LLM_PROVIDER"),
            embedding_model_name=os.getenv("EMBEDDING_MODEL_NAME"),
            embedding_api_key=os.getenv("OPENAI_API_KEY"),
        ),
        destination_connection_config=ServerQdrantConnectionConfig(
            access_config=Secret(ServerQdrantAccessConfig()),
            url=os.getenv("QDRANT_URL")
        ),
        stager_config=ServerQdrantUploadStagerConfig(),
        uploader_config=ServerQdrantUploaderConfig(
            collection_name=os.getenv("QDRANT_COLLECTION"),
            batch_size=50,
            num_processes=1
        )
    ).run()
```

- **Filetype**: Markdown file, containing umlauts
- **Any additional API parameters**: -

**Environment:**
 - API running in a docker container (tested with version `0.0.82` / `latest`)
 - Calling the API via the Ingest-Library
 - `unstructured-ingest 0.4.2`
 - `Python 3.12`

**Additional context**
- The encoding of the markdown file is UTF-8

**Errors**
In my python application:
```
2025-01-29 18:15:35,691 MainProcess INFO     calling ChunkStep with 1 docs
INFO: calling ChunkStep with 1 docs
2025-01-29 18:15:35,691 MainProcess INFO     processing content async
INFO: processing content async
INFO: HTTP Request: POST http://localhost:9500/general/v0/general "HTTP/1.1 500 Internal Server Error"
INFO: Failed to process a request due to API server error with status code 500. Attempting retry number 1 after sleep.
INFO: Server message - {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
ERROR: Failed to partition the document.
ERROR: Server responded with 500 - {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
2025-01-29 18:15:36,935 MainProcess ERROR    Uncaught Error calling API: {"detail": "'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
ERROR: Uncaught Error calling API: {"detail":"'utf-8' codec can't decode byte 0xf6 in position 99: invalid start byte"}
2025-01-29 18:15:36,945 MainProcess INFO     chunk finished in 1.2495855s, attributes: file_id=6176397801e1
INFO: chunk finished in 1.2495855s, attributes: file_id=6176397801e1
2025-01-29 18:15:36,945 MainProcess ERROR    Exception raised while running chunk
```

In the logs of my API docker container:
```
  File "/home/notebook-user/prepline_general/api/general.py", line 428, in pipeline_api
    raise e
  File "/home/notebook-user/prepline_general/api/general.py", line 388, in pipeline_api
    elements = partition(**partition_kwargs)  # type: ignore # pyright: ignore[reportGeneralTypeIssues]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/auto.py", line 292, in partition
    elements = partition_json(filename=filename, file=file, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 725, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 683, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured/partition/json.py", line 63, in partition_json
    file_text = file_content if isinstance(file_content, str) else file_content.decode()
                                                                   ^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1945: invalid continuation byte

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partitioning markdown file with umlauts (ö ä ü) - Status code 500 #489

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partitioning markdown file with umlauts (ö ä ü) - Status code 500 #489

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions