Skip to content

[Bug?]: Image downloads eventually stall and time out #3776

@dtroyer-salad

Description

@dtroyer-salad

zot version

v2.1.13-0-g4ad3fad

Describe the bug

We are running zot 2.1.13 and we are periodically seeing pulls hanging, usually a buch at the same time. At worst no data is downloading and the open TCP connections eventually time out. All images to this server are pushed by an automated process, none are pulled by zot from other registries. The latency reported in the log for completed transfers generally gets quite long (e.g., >15min for a manifest), the connections that time out are not logged by zot as far as I can find even at trace level (am I missing something here?).

We do have Bearer auth configured and in the worst state the expected requests without an Authorization header get returned a 401 very quickly but the follow-up request with the token included is either slow or never appears due to timing out.

I need some suggestions on how to track down what zot is doing internally that might get it into this 'stuck' position. When it happened again this morning the outbound bandwidth pegged at maximum for a while after restarting the zot service but it handled the pent-up demand and eventually settled back down to normal levels. Memory usage of the zot process does gradually climb, this morning as things were falling over the utilization climbed from 1.5GB to 2.2GB, after the restart it settled at 0.3GB.

We have looked for things on the server that may be blocking zot but do not see any issues with CPU, RAM, disk I/O, etc. My gut says something internal to the service is happening here but I am not sure where to start ruling out possibilities.

To reproduce

  1. Configuration
{
  "distSpecVersion": "1.1.1",
  "storage": {
    "rootDirectory": "/storage/zot",
    "dedupe": "true",
    "gc": "true",
    "gcDelay": "1h",
    "gcInterval": "24h"
  },
  "http": {
    "address": "0.0.0.0",
    "port": "8443",
    "realm": "****",
    "tls": {
      "cert": "/etc/opt/zot/fullchain.pem",
      "key": "/etc/opt/zot/privkey.pem"
    },
    "Ratelimit": {
      "Rate": 1000
    },
    "auth": {
      "failDelay": 5,
      "bearer": {
        "realm": "****",
        "service": "****",
        "cert": "/etc/zot/auth.crt"
      }
    }
  },
  "log": {
    "level": "trace",
    "output": "/var/log/zot/zot.log",
    "audit": "/var/log/zot/zot-audit.log"
  },
  "extensions": {
    "metrics": {
      "enable": true,
      "prometheus": {
        "path": "/metrics"
      }
    },
    "lint": {
      "enable": false
    },
    "scrub": {
      "enable": false
    },
    "search": {
      "enable": false
    },
    "sync": {
      "enable": false
    },
    "trust": {
      "enable": false
    },
    "ui": {
      "enable": false
    }
  }
}

  1. Client tool used

Our internal client built using oras-go for pulling images

  1. Seen error

No actual errors but downloading eventually stops completely and tcp sessions time out.

Expected behavior

No response

Screenshots

No response

Additional context

  • zot v2.1.13-0-g4ad3fad
    {"time":"2026-02-04T23:05:42.445833401Z","level":"info","message":"version","distribution-spec":"1.1.1","commit":"v2.1.13-0-g4ad3fad","binary-type":"-events-imagetrust-lint-metrics-mgmt-profile-scrub-search-sync-ui-userprefs","go version":"go1.25.5","caller":"zotregistry.dev/zot/v2/pkg/cli/server/root.go:220","func":"zotregistry.dev/zot/v2/pkg/cli/server.NewServerRootCmd.func1","goroutine":1}
  • 128GB RAM
  • 2 x10Gb bonded NICs
  • 66TB used on a local 556TB RAID (md) array
  • bearer auth enabled
  • prometheus metrics is the only enabled extension

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrm-externalRoadmap item submitted by non-maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions