Skip to content

Problem in using unstructured workflows #254

Open
@edoproch

Description

@edoproch

Hi!
I am developing an ETL pipeline using restate and Unstructured, using python. I have the following problem:

  • I do not want to use Partition Endpoint because I also need the enrichment step.
  • Using a workflow, I noticed that if a second user requests to process a document while the first one is still processing, there are two possible cases:
      1. the second request is quite close to the first (i.e., the first initiated job is still reading files from the source (in my case a folder on S3))
      1. the second request occurs during a later stage of the job: in this case the request is lost (a request consists of loading a new file into the source folder of S3 and then starting the workflow)

Since the first case is also problematic (if multiple users upload files almost simultaneously they will have to wait for all files to be processed before seeing their output, since unstructured first processes and then saves everything together), I came up with this solution:

  • When a new request comes in, I create a source folder on S3, specific to that request.
  • I create and start a workflow specific to that request, which I then destroy when finished
    This way there should be no problem, do you have any better suggestions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions