Skip to content

Conversation

@gregoirevilde-se
Copy link

@gregoirevilde-se gregoirevilde-se commented Jan 29, 2026

Added the KV cache tokens contribution to GPU power consumption + model latency, based on the calculations provided below that detail LLM transformer interactions with KV cache.

https://kipp.ly/transformer-inference-arithmetic/
https://kipp.ly/transformer-param-count/

Not an expert on this topic yet, I'm still trying to learn how transformers work in detail to improve how KV cache emissions should be calculated. This PR is a first step after 4 months of trying to find the right information to understand how to measure cache emissions.


I'm already using Ecologits to analyse my team's Cursor (https://cursor.com/) emissions. Our usage logs 4 types of tokens :

  • Input (w/o cache)
  • Input (w/ cache)
  • Cache Read
  • Output Tokens

With a quick calculation, I'm estimating a 25x increase in our team's emissions if including the Cache Read in our calculations (currently omitted) using the "1/6 factor" method.

I'm open to more discussions on this topic to help research & improve the calculation method for large context windows (mainly to compare the contribution of developers using LLMs for code generation vs the environmental cost of actually using the applications they develop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant