Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
docs/api-inference/rate-limits.md
Outdated
| You get charged for every inference request, based on the compute time x price of the underlying hardware. | ||
|
|
||
| Serverless API is not meant to be used for heavy production applications. If you need higher rate limits, consider [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) to have dedicated resources. | ||
| For instance, a request to [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) that takes 10 seconds to complete on a GPU machine that costs $0.00012 per second to run, will be billed $0.0012. |
There was a problem hiding this comment.
this is slightly confusing - isn't it more based on the inference provider and what they charge? and specifically for LLMs more on the basis of the tokens processed and generated?
Maybe we can use a different example here, say Flux, each generation is X dollars?
(so it's a bit easy to grok)
There was a problem hiding this comment.
essentially talking about the routed requests here (maybe it's supposed to be somewhere else and I'm confused)
There was a problem hiding this comment.
this doc is about HF's own Inference , not Inference Providers, but i agree its a tad confusing :)
There was a problem hiding this comment.
Let's take an example that actually runs on HF's Inference API then?
There was a problem hiding this comment.
yes, do you have one on hand?
There was a problem hiding this comment.
"deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" if we still want to ride the deepseek wave
There was a problem hiding this comment.
oops i just merged vb's suggestion (Flux) but we can add more examples in the future
There was a problem hiding this comment.
totally fine with Flux! Makes sense to set an example with fixed cost
There was a problem hiding this comment.
It gets tons of abuse tho - and is down quite a lot - recommended BFL flux instead - which always works 😅
Co-authored-by: Lucain <lucain@huggingface.co>
Co-authored-by: vb <vaibhavs10@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
the important part is in rate-limits.md