Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:token ratelimit based on token bucket algorithm #1974

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hzhswyz
Copy link
Contributor

@hzhswyz hzhswyz commented Mar 28, 2025

Ⅰ. Describe what this PR did

token ratelimit based on token bucket algorithm.
The current rate limiting strategy based on fixed time window is not applicable to the scenario of the streaming output. On average, each round of dialogue with the llm takes about 3 minutes. With the token_per_minute strategy, the tokens in the first two minutes are wasted.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)?

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.74%. Comparing base (ef31e09) to head (64c007d).
Report is 420 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1974      +/-   ##
==========================================
+ Coverage   35.91%   43.74%   +7.83%     
==========================================
  Files          69       79      +10     
  Lines       11576    12728    +1152     
==========================================
+ Hits         4157     5568    +1411     
+ Misses       7104     6814     -290     
- Partials      315      346      +31     

see 75 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hzhswyz hzhswyz force-pushed the feat-token-bucket-ratelimit branch 2 times, most recently from 5b55b36 to 079f7b8 Compare March 28, 2025 13:06
@hzhswyz hzhswyz force-pushed the feat-token-bucket-ratelimit branch from e1418c7 to 64c007d Compare April 5, 2025 08:18
Copy link
Collaborator

@cr7258 cr7258 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐉 LGTM

Copy link
Collaborator

@johnlanni johnlanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果要支持令牌桶算法,应该是在原有限流配置上,支持切换算法。
算法选择不应该影响限流配置方式,例如都是qpd配置,既可以固定窗口实现,也可以令牌桶实现。
我们应该尽量保持配置字段的良好抽象,让用户和ai理解起来都更容易。

@cr7258
Copy link
Collaborator

cr7258 commented Apr 12, 2025

例如都是qpd配置,既可以固定窗口实现,也可以令牌桶实现

现有的限流字段 (token_per_second, token_per_minute ) 从字面上已经表明这是一个固定窗口的限流规则,如果要让原有的限流窗口也支持新的令牌桶算法,我想到有两种方法:

一个是按照当前 PR 的基础上,在 token_bucket_strategy 配置中新增一个 unit 字段进行标识,原有的固定窗口的配置 (token_per_second, token_per_minute 等) 保持不变,这样对于用户来说,配置改动最小,如果用户不需要用令牌桶算法,无需更改现有配置。

token_bucket_strategy:
  rate: 1000
  unit: minute
  capacity: 10000

另一种想法对原有的配置字段也进行调整,将原有的 token_per_second, token_per_minute 按照 unit 全部进行统一,然后按照 type 设置区分固定窗口和令牌桶限流策略。用户升级插件的话,需要相应地修改配置。

# 原有的固定窗口限流
rate_limit:
  # type: fixed_window  # 默认值,可以省略
  requests: 1000    # 每分钟 1000 个 token
  unit: minute

# 令牌桶限流
rate_limit:
  type: token_bucket
  rate: 1000  # 每分钟新增 1000 个 token 到令牌桶中
  unit: minute 
  capacity: 10000  # 容量最多 10000 个

@johnlanni WDYT?

@johnlanni
Copy link
Collaborator

我是倾向于这里的设计:
https://www.envoyproxy.io/docs/envoy/latest/api-v3/type/v3/ratelimit_strategy.proto.html#envoy-v3-api-msg-type-v3-ratelimitstrategy-requestspertimeunit

虽然他想的很好,还是跟requests_per_time_unit并列搞了个token_bucket🐶
但我觉得把token_bucket的参数默认化,也没问题的,比如100qps,就是max bucket 100,每秒填充100token

@johnlanni
Copy link
Collaborator

johnlanni commented Apr 12, 2025

可以考虑下这种配置方式:

token_per_minute:1000
strategy: 
  type: token_bucket
  interval_factor: 0.5
  fill_factor: 0.5
  capacity_factor: 1

这个配置等价:

capacity: 1000
tokens_per_fill: 500
fill_interval: 30s

strategy的默认值是:

strategy: 
  type: fixed_window

因为固定窗口能满足 90% 的场景需求,所以这样的配置对 90% 的用户更容易理解,token_bucket虽然这样理解起来有复杂度,但是想要换算法的专家级用户,理解起来还是不难的。而且这样扩展性也好,未来可以支持:

strategy: 
  type: leaky_bucket
  capacity_factor: 1
  leak_rate_factor: 1

@hzhswyz
Copy link
Contributor Author

hzhswyz commented Apr 14, 2025

可以考虑下这种配置方式:

token_per_minute:1000
strategy: 
  type: token_bucket
  interval_factor: 0.5
  fill_factor: 0.5
  capacity_factor: 1

现在基于令牌桶的限流策略,大部分只提供rate(每秒填充Token数)与capacity(桶容量)。如果使用这种配置相当于让用户能够自定义【填充间隔】与【每次填充数量】,这在lua脚本实现上会更复杂。例如token_per_minute=10,每秒的速率只有0.17,要实现起来就比较麻烦了。

@hzhswyz
Copy link
Contributor Author

hzhswyz commented Apr 14, 2025

能否把不同的策略分开配置呢,这样或许更好理解一下。

令牌桶策略

token_bucket_strategy: 
  rate: 2
  capacity: 300

固定时间窗口策略

fixed_window_strategy: 
  tokens_per_window: 200
  window_unit: minute

然后同时兼容之前的token_per_minute等配置

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants