Skip to content

Occasional 429 errors due to limits introduced to 11.X #8102

@ArturAkh

Description

@ArturAkh

Dear dCache developers,

After the switch to dCache 11.2.3, we at KIT started to see HTTP 429 errors more often on our WebDAV doors and on our Web Frontends.

According to our investigations, this seems to be related to rate limiters introduced to 11.X (commits: 3f6d72a and 8e7c5b6).

While we understand now, that we have to adjust related rate limits:

Property Meaning
.limits.rate.overall Maximum overall request rate accepted by the service
.limits.rate.per-client.fractions Maximum share of the overall request budget one client may use
.limits.error.max-allowed Number of auth or permission failures allowed before temporary blocking
.limits.error.block.window.time How long a client is blocked after too many auth or permission failures
.limits.rate.per-client.block.window.time How long a client is blocked after exceeding its per-client request rate
.limits.blocked-clients.idle-time How long blocked-client state is retained while idle
.limits.max-blocked-clients Upper bound for tracked blocked-client entries

We wonder about which numbers to use, since the current defaults in dCache are too small for our operations. In addition, we can't increase those incrementaly and then observe for a few days, since a change would require a service restart of the door and/or the frontend, which means usually a downtime if being strict.

So far, we found, that logging of particular rejections due to having too many requests is only complete on DEBUG log level, such that we have to modify the logging of the domain at runtime by an appropriate admin interface command:

log set stdout org.dcache.util.jetty.RateLimitedHandlerList DEBUG

We wonder therefore, whether it is on purpose, that some of the logged messages are only at DEBUG level by default.

All in all, we would like to discuss with you the current state of dCache in that context (logging, conditions for rejection, corresponding workflow, etc.), and ask for some guidance on appropriate limit numbers.

For documentation purposes, I'm attaching at the end of this issue also the AI-based investigations which helped us understand the situation a bit:

Thank you very much for your help in advance!

Best,

Artur on behalf of dCache admin team at KIT

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions