You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using Prefect for the first time to create a flow for scraping a website and transcribing files that I've scraped.
Simple Flow:
Extract n URLs (Parameter) that have "todo-status" from the DB.
Download files from a website
Upload to transcription API
transform & store transcription.
The transcription is a long process, depending on some API, and after it is done, I want to reformat the results and push them into a Postgres instance.
Problem
Because of the length of the process, I would like to store it after each iteration. However, working with map I need to wait for all transcriptions to end and then push them to the DB (similar to this Prefect Advanced tutorial).
Tried solutions
Creating one Megatask for both download, upload, transform & store. This works, but I lose many of prefect benefits of having small tasks.
Using for in the flow. This requires declaring the number of URLs n beforehand as nout.
apply_map (not relevant...it remains map)
Questions
How can I run a for loop of multiple tasks serially on entities without defining the number of entities beforehand?
Small unrelated question - How can I initiate a Run in the UI with the LocalDaskExecutor?
Currently I define flow.executor = LocalDaskExecutor(scheduler="threads", num_workers=CONCURRENCY_LIMIT) before registering the flow, however I don't think that it parallelizes runs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Goal
Using Prefect for the first time to create a flow for scraping a website and transcribing files that I've scraped.
Simple Flow:
n
URLs (Parameter) that have "todo-status" from the DB.The transcription is a long process, depending on some API, and after it is done, I want to reformat the results and push them into a Postgres instance.
Problem
Because of the length of the process, I would like to store it after each iteration. However, working with
map
I need to wait for all transcriptions to end and then push them to the DB (similar to this Prefect Advanced tutorial).Tried solutions
download
,upload
,transform
&store
. This works, but I lose many of prefect benefits of having small tasks.for
in the flow. This requires declaring the number of URLsn
beforehand asnout
.Questions
Currently I define
flow.executor = LocalDaskExecutor(scheduler="threads", num_workers=CONCURRENCY_LIMIT)
before registering the flow, however I don't think that it parallelizes runs.Beta Was this translation helpful? Give feedback.
All reactions