Skip to content

The endpoint /api/flow_runs/filter can contain large amounts of data #9947

Open
@catermelon

Description

@catermelon

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I refreshed the page and this issue still occurred.
  • I checked if this issue was specific to the browser I was using by testing with a different browser.

Bug summary

TLDR: The /api/flow_runs/filter endpoint pulls the entirety of flow parameters, even if they are ~gigabytes of file contents. This can make the UI very slow.

Hi Prefect friends. I’m running self-hosted Prefect, and the UI had become ungodly slow - taking 15-20 seconds or more to resolve sometimes. No worries, still useable, new product, etc, etc

Went in to see if it was something I could put in a bugfix for, and I discovered the culprit is /api/flow_runs/filter. That API call contains all of the data files I processed that day, because they were parameters to a subflow.

I don't think I actually need a subflow here so I can remove it and probably solve my issue. But there probably should be an idiot-proof truncation limit to parameters? Or maybe they aren't shown unless specifically requested?

Reproduction

My setup looks something like this. There's a main orchestrator flow that first downloads all the data and persists it to cache. It then hands that list of cached files to a subflow to transform.

  1. main_flow() is the orchestrator.
  2. pages here ends up being a list of cached files, I think PersistedResult since they're saved to a local filesystem
  3. csv_transform(pages) is the subflow that gets handed the list of pages

This is mostly psuedocode. I just wanted to give a flavor of what I'm doing, but I can sanitize real code if you want.

@flow
def main_flow():
  pages = crawl_dumb_enterprise_api(endpoint) 
  csv_transform(pages)
  kickoff_warehouse_load()
  
def crawl_dumb_enterprise_api(endpoint):
   result = []
   for page in endpoint:
      result.append(fetch_url(url)
 
   return result
      
@task(
  retries=3,
  retry_delay_seconds=60,
  timeout_seconds=600,  # ten minutes
  cache_result_in_memory=False,
  result_serializer=CompressedJSONSerializer(),
  cache_key_fn=fetch_url_cache_key,
  cache_expiration=timedelta(hours=6),
)
def fetch_url(url):
   return(requests.get(url))

@flow 
def csv_transform(pages):
  for giant_xml_thing in pages:
    extract_and_put_in_csv_format()

(If you're wondering why I have to crawl the entire API before starting anything, it's because this is a terrible slow fragile API that uses unguessable IDs for pagination and embeds the URL to the next page in the giant result XML so I have to wait until it downloads anyways to start loading the next page not that I'm bitter or anything)

Error

It's specifically this UI call (copied out of devtools as curl & lightly edited for privacy)

curl 'https://werehaus.site/api/flow_runs/filter' \
  -H 'authority: werehaus.site' \
  -H 'accept: application/json, text/plain, */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'content-type: application/json' \
  -H 'cookie: logged_out_marketing_header_id=blahblah; _ga=blahblah; _ga_blahblah=blahblahblah' \
  -H 'origin: https://werehaus.site/' \
  -H 'referer: https://werehaus.site/flow-runs' \
  -H 'sec-ch-ua: "Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' \
  -H 'x-prefect-ui: true' \
  --data-raw '{"flow_runs":{"expected_start_time":{"before_":"2023-06-15T06:59:59.999Z","after_":"2023-06-06T07:00:00.000Z"}},"sort":"START_TIME_DESC"}' \
  --compressed

If I run that, I get about an 1GB download that has a few normal entries and then this thing:

{
  "id":"246bba3b-d01b-427a-8e1e-eb59e42d585b",
  "created":"2023-06-13T11:34:44.821804+00:00",
  "updated":"2023-06-13T11:43:46.647193+00:00",
  "name":"curly-impala",
  "flow_id":"0a5a23d3-3d0d-4a81-b114-ceb45fad54ca","state_id":"99d54758-53f8-4dfb-af4c-5d51cfb9aeb5",
  "deployment_id":null,
  "work_queue_id":null,
  "work_queue_name":null,
  "flow_version":"2d50993280a293fe8a999910e662b216",
  "parameters":{
    "pages":[
        "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n<!DOCTYPE 50O MB CHONKYTONK XML FILE...."

Browers

  • Chrome
  • Firefox
  • Safari
  • Edge

Prefect version

Version:             2.10.13
API version:         0.8.4
Python version:      3.10.11
Git commit:          179edeac
Built:               Thu, Jun 8, 2023 4:10 PM
OS/Arch:             darwin/arm64
Profile:             prod
Server type:         server

Additional context

It's possible I'm doing a dumb. If there's a better way to do this, please feel free to point this out. I appreciate it.

I didn't see an easy way to fix this without refactoring way too much, but I'm happy to take a crack at it if it's simple.

Thanks so much. I really do like working with Prefect, it's solid.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiRelated the Prefect REST APIbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions