Feature Description
Pass args and kwargs on Unstructured base for load_data() and pass them when calling partition() or partition_via_api().
This would add flexibility to manipulate the (far too many) kwargs from the paritition library.
Reason
Over the last week, I tried taking advantage of the many good advantages partition offers through this loader. To give a few examples,
-
For .docx I intended to use include_page_breaks, which is set True by default on their docx.py but False on their "auto" method partition -> this is the one called by the loader.
-
For .pdf, I intended to use cool features such as infer_table_structure or strategy (to set hi_res). Similarly, I intended to use the former kwarg for .pptx as well.
The fact that I cannot manipulate the kwargs passed onto partition prevents me from manipulating data extraction the way I intend, and it's forcing me to subclass and override behavior for a very simple change.
Value of Feature
As explained before, users would be able to take advantage of the many great functionalities unstructured can offer, namely infer_table_structure, strategy, include_page_breaks, etc, by simply passing args and kwargs to the partition() or partition_via_api() methods.