Skip to content

bug/Discrepancy between CLI and Python Runner for Box to Azure Cognitive Search Ingestion #73

Open
@ron-unstructured

Description

@ron-unstructured

Describe the bug
There is a discrepancy between the CLI and Python when using the download_dir parameter in unstructured-ingest when running Box -> Azure Cognitive Search. The CLI correctly downloads files to the specified directory, while the Python implementation attempts to write files to the root directory, resulting in a "Read-only file system" error.

To Reproduce

  • CLI (working):
    unstructured-ingest box \ --box-app-config box_config_test.json \ --remote-url box://12345 \ --work-dir ./unstructured/ \ --output-dir ./unstructured/ \ --download-dir ./unstructured/ \ --num-processes 1 \ --raise-on-error \ --verbose \ --recursive \ --re-download

  • Python Runner (throw an error):
    runner = BoxRunner( processor_config=ProcessorConfig( work_dir="./unstructured/", verbose=True, raise_on_error=True, output_dir="./unstructured/", num_processes=1, ), read_config=ReadConfig( download_dir="./unstructured/", re_download=True, ), partition_config=PartitionConfig(), connector_config=SimpleBoxConfig( remote_url="box://12345", recursive=True, access_config=BoxAccessConfig( box_app_config="./box_config_test.json"), ), ) runner.run()

Error message: "unstructured.ingest.error.SourceConnectionError: Error in getting data from upstream data source: [Errno 30] Read-only file system: '/{here is the folder as in the box itself}'"

Expected behavior
The Python implementation should respect the download_dir parameter in the ReadConfig and download files to the specified directory, just like the CLI does.

Environment Info
unstructured: 0.14.9 (issue also present in version 0.12.x)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions