Add quota limit for auto evaluations#720
Conversation
a8be0c2 to
5d8e312
Compare
5d8e312 to
3fbe2f0
Compare
9b5ae54 to
e2550d1
Compare
e2550d1 to
7cdcb58
Compare
ff73c90 to
884228c
Compare
7cdcb58 to
67881a2
Compare
884228c to
e4adb49
Compare
67881a2 to
e4850b6
Compare
e4adb49 to
f96ea91
Compare
e4850b6 to
dfda1b4
Compare
f96ea91 to
2f964ea
Compare
dfda1b4 to
b563490
Compare
2f964ea to
9883cc1
Compare
b563490 to
7fb2370
Compare
09e9852 to
3e18724
Compare
7fb2370 to
7296534
Compare
524abaa to
8d2835a
Compare
bf41a84 to
5517936
Compare
e3709b8 to
4cabe13
Compare
5517936 to
df2b79b
Compare
57c49c1 to
43aeb61
Compare
kevindew
left a comment
There was a problem hiding this comment.
Great job, just a few small changes
|
Thanks for the review @kevindew i've made those changes with the exception of one comment i left. |
|
I ran into an interesting issue when testing on Integration where it wasn't working as implemented. I pushed up a new commit which outlines the issue and gives a few suggestions. Any feedback the approach we should take would be great. |
|
Ooh tricky problem - thanks for outlining options. I think I like: Though I would suggest doing: Did you have problems with expires_at? I notice that's changed from expires_in |
|
I did. it didn't seem to work so i had a look at the docs and increment expects expires_in |
| Rack::Attack.cache.store = old_store | ||
| end | ||
|
|
||
| config.before(:each, :with_memory_store) do |
There was a problem hiding this comment.
this is required for the system specs that run auto evaluation because NullStore returns nil when you call increment which causes this line to blow up.
There was a problem hiding this comment.
I think rather than doing this - because it's pretty heavy, it'd be better to change the usage code to be:
# fallback to 1 in scenarios where we have a null cache (dev/test) and this returns nil
count = Rails.cache.increment("auto_evaluations_count", 1, expires_in:) || 1
|
Thanks for the feedback @kevindew i've made those changes. |
So it should be fine because the the merged_options method is supposed to convert it to an |
Hmm I've had a bit of a play with it and while it does work, there is an edge case you can hit where if you were very close to the end of an hour you'd get a negative number and an exception: which would be annoying to hit. I then considered that it might be better to do expires_in but I realised that if we hit the same sort of edge case on that we'd end up with the cache potentially spanning more than one hour. So, I've tried to think of a way to do it without these edge cases that might be a little more robust. Rather than us relying on expiry I think we should set the key based on the time and then just allow it to expire after an hour. I think that's actually what Rack::Attack does if I remember right. So something like: I think this would also have an advantage that for every hour we get a fresh cache key and that is something explicit we can debug on if we hit problems, whereas relying on the TTL for MemCache is very cryptic because you can't look it up. I learnt this in my little terminal session when I accidentally ran fetch on a key once and then it seemed to never expire: |
|
@kevindewgood shout on that. I'll remember that approach going forward 🎉 I've made the requested change. |
This adds the ability to limit the number of autoevaluations that can be run per hour. If the limit is reached, further evaluations are skipped and a warning is logged. It uses the 'auto_evaluations_count_' prefix combined with the start of the current hour's timestamp (to_i) to form the cache key. It will expire after the last evaluation in that hour plus an hour. When the first evaluation is run in the next hour and new key is created. We've set the limit to 300 evaluations per hour based on discussion with data science. We're treating topic tagging as an auto evaluation. In order to reuse the quota functionality i've updated the AnswerTopicsJob to ingerit from the base job and added the quota checks.
There's essentially duplicate tests for retries in the topics and relevancy job specs. Any 4xx or 5xx error is raised via the OpenAI or Anthropic clients, so we can generalise the shared example to cover both cases.
Description
If we run our auto-evaluation on every answer that's generated we could end up spending a significant amount of money as many metrics have multiple LLM calls. Each of which is run 3 times.
This adds a quota for evaluations per hour which we've set to 300. Each job that runs an evaluation:
auto_evaluations_countkey in the cacheI've added the functionality to the base worker so it can be reused across the workers and added a shared example to test it.
I've updated the
AnswerTopicsJobto inherit from the base worker and updated it to adhere from the quota.I've also added another shared example for retry logic to remove some duplication
Trello card
https://trello.com/c/lfwcnNaz/3045-add-quota-for-auto-evaluation-llm-requests