Skip to content

Image Renderer Fails with Timeout & Retries on 60s Interval w/ InfluxDB Datasources #785

@nharper-usgs

Description

@nharper-usgs

What happened:
I have recently setup the docker version of the grafana image renderer. This docker service is a part of a larger compose setup that includes both InfluxDB and Grafana. I have the services talking to each other and can successfully download images via curl and other methods.

However, I am seeing pretty sporadic latency issues that I believe I've traced back to the renderer retrying after an initial failed network request.

More often than not, I am able to download the images in about ~2s, however, every 2-5 requests, the images take exactly ~61-62s.

(base) [user@server grafana]$ curl -L "http://[URL]" -o tmp.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52310  100 52310    0     0  31474      0  0:00:01  0:00:01 --:--:-- 31474
(base) [user@server grafana]$ curl -L "http://[URL]" -o tmp.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11196  100 11196    0     0   6397      0  0:00:01  0:00:01 --:--:--  6401
(base) [user@server grafana]$ curl -L "http://[URL]" -o tmp.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11187  100 11187    0     0    181      0  0:01:01  0:01:01 --:--:--  3199
(base) [user@server grafana]$ curl -L "http://admin:admin@localhost:3001/render/d-solo/cemi669r5v5s0f?orgId=1&from=now()&to=now()-1m&var-Drainages=Carbon&panelId=panel-67&&width=400&height=300&tz=UTC&theme=light" -o tmp.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10246  100 10246    0     0    165      0  0:01:02  0:01:01  0:00:01  2180

In the renderer logs, I get the following only when I experience the 60s+ requests:

renderer  | {"failure":"net::ERR_ABORTED","level":"error","message":"Browser request failed","method":"POST","url":"http://grafana:3000/api/ds/query?ds_type=influxdb&requestId=SQR100"}
renderer  | {"err":"TimeoutError: Waiting failed: 60000ms exceeded\n    at new WaitTask (/home/nonroot/node_modules/puppeteer-core/lib/cjs/puppeteer/common/WaitTask.js:50:34)\n    at IsolatedWorld.waitForFunction (/home/nonroot/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Realm.js:25:26)\n    at CdpFrame.waitForFunction (/home/nonroot/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Frame.js:561:43)\n    at CdpFrame.<anonymous> (/home/nonroot/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)\n    at CdpPage.waitForFunction (/home/nonroot/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:1366:37)\n    at waitForQueriesAndVisualizations (/home/nonroot/build/browser/browser.js:595:16)\n    at /home/nonroot/build/browser/browser.js:375:19\n    at callback (/home/nonroot/build/browser/browser.js:546:34)\n    at ClusteredBrowser.withMonitoring (/home/nonroot/build/browser/browser.js:553:16)\n    at ClusteredBrowser.performStep (/home/nonroot/build/browser/browser.js:509:36)","level":"error","message":"Error while performing step","step":"panelsRendered","url":"http://grafana:3000/d-solo/[URL....]"}

The reason I think there is a retry involved is because I always successfully download the image after the 60s timeout happens. This also only seems to happen with the Influx queries. I tried replicating the issue with a prometheus backend and I never hit the issue. To be clear, I don't experience delays when running the influx query in grafana directly.

What you expected to happen:
I'd expect consistent download times. It feels a bit like this could be resolved by just shortening the retry period.

How to reproduce it (as minimally and precisely as possible):
I simply retry the same image rendering requests.

Anything else we need to know?:

Environment:

  • Grafana Image Renderer version: latest (4.x.x+)
  • Grafana version: latest (12.x.x+)
  • Installed plugin or remote renderer service: remote
  • OS Grafana Image Renderer is installed on: default docker OS
  • User OS & Browser: RHEL
  • Others:
    • Influxdb v2.7.
    • I've included the compose setup below for grafana and renderer
  renderer:
    image: grafana/grafana-image-renderer:latest
    container_name: renderer
    shm_size: 1g
    environment: 
      - AUTH_TOKEN=test-token
      - RENDERING_MODE=clustered
      - RENDERING_CLUSTERING_TIMEOUT=600
      - RENDERING_VIEWPORT_MAX_WIDTH=3000
      - RENDERING_VIEWPORT_MAX_HEIGHT=3000
      - ENABLE_METRICS=true
      - RENDERING_TIMING_METRICS=true
      # Try timeout
      - LOG_LEVEL=debug
    ports:
      - "8081:8081"
    networks: 
      - test_network

  grafana:
    build:
      context: ./grafana
      dockerfile: Dockerfile
    image: grafana:latest
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_USER=test
      - GF_SECURITY_ADMIN_PASSWORD=test
      - GF_SERVER_DOMAIN=grafana
      - GF_SERVER_ROOT_URL=http://grafana:3000/
      - GF_RENDERING_CALLBACK_URL=http://grafana:3000/
      - GF_RENDERING_SERVER_URL=http://renderer:8081/render
      - GF_RENDERING_RENDERER_TOKEN=test-token
      - GF_RENDERING_RENDERING_TIMEOUT=30
    ports:
      - "3001:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - influxdb
      - renderer
    networks:
      - test_network

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions