Skip to content

bug/Very High Memory Utilisation: 7mb Excel sheet taking more than 10gb memory space #3872

Open
@Akashtyagi

Description

@Akashtyagi

Describe the bug
When trying to parse and chunk a 7mb xls file, the Unstructured server takes exponential memory space and crashes for me beyond 10gb.

To Reproduce
Input file -

xlsx4.xls

with open(file_path, "rb") as f:
    files = shared.Files(
        content=f.read(),
        file_name=file_path,
    )


req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=files,
        chunking_strategy=ChunkingStrategy.BY_TITLE,
        strategy=PartitionStrategy.HI_RES,
        multipage_sections=False,
    )
)


try:
    start = time.time()
    elements_by_page = {}
    print("File name: ", file_path)
    partitioned_data = unstructured_client.general.partition(req)
    end = time.time() - start
    print("Time taken in seconds: ", end)
    # print("Partitioned data:", partitioned_data)
    tables = 0
    for element in partitioned_data.elements:
        if element["type"] == "Table" and element['metadata']['text_as_html'] is not None:
            tables += 1
    print("Total Table counts: ", tables)

except SDKError as sdk_error:
    raise sdk_error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

Image

Environment Info
Please run python scripts/collect_env.py and paste the output here.
This will help us understand more about the environment in which the bug occurred.

Additional context
Unstructured Pod Config
Resources:
CPU: 4
Memory: 10000

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions