Scalability issues when storing binary file in pyspark column

Dear @aamend @alexott @nfx, 
I appreaciate your work on making `tika` file format possible.

After reviewing serialiser [code](https://github.com/databrickslabs/tika-ocr/blob/fbf1976911eb4aebf9c95bff4c17517cfd2c6318/src/main/scala/com/databricks/labs/tika/TikaSerializer.scala#L23C32-L23C42) I have noticed you storing binary file as one of the columns. 

Such a construct does not allow stable flow at a scale of more than 1000 large documents. 

It could be prudent to store binary files outside of result dataframe.

Let me know your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scalability issues when storing binary file in pyspark column #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scalability issues when storing binary file in pyspark column #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions