Skip to content

Scalability issues when storing binary file in pyspark column #44

@aleksandrskrivickis

Description

@aleksandrskrivickis

Dear @aamend @alexott @nfx,
I appreaciate your work on making tika file format possible.

After reviewing serialiser code I have noticed you storing binary file as one of the columns.

Such a construct does not allow stable flow at a scale of more than 1000 large documents.

It could be prudent to store binary files outside of result dataframe.

Let me know your thoughts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions