Skip to content

Add TPM Controller to Respect LLM API Token Rate Limits #2841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Ahmed22347
Copy link

Overview

This PR addresses the issue of exceeding the token-per-minute (TPM) limits imposed by LLM API providers (e.g., Groq). Such overuse could result in throttling or errors, which can affect application reliability.

Changes Introduced

  • Enhanced Agent Class

    • Introduced a new variable: max_tpm (maximum tokens per minute).
    • Integrated a TPM controller mechanism, modeled after the existing RPM (requests per minute) controller.
  • Created TPMController

    • Designed similarly to the RPMController.
    • Tracks and limits token usage on a per-minute basis.
    • Prevents sending more tokens than allowed in a rolling 60-second window.
  • Improved Error Handling

    • Captures exceptions caused by TPM limit violations.
    • Implements a 60-second wait before retrying requests that exceed the token rate limit.

Benefits

  • Ensures compliance with API usage policies.
  • Reduces the likelihood of service disruption .
  • Improves overall stability when interacting with high-throughput LLM APIs.

Notes

  • The TPM controller is opt-in via the max_tpm parameter and integrates seamlessly with existing rate-limiting logic.
  • Future improvements may include adding max_tpm on a crew level.

@joaomdmoura
Copy link
Collaborator

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment: Added Tokens Per Minute Counter PR

Overview

This pull request introduces a Tokens Per Minute (TPM) rate-limiting feature alongside the existing Requests Per Minute (RPM) controller. The changes span six files, enhancing the system's capacity to manage token consumption effectively.

Detailed Insights

1. New TPM Controller Implementation (tpm_controller.py)

Strengths:

  • Clean and efficient implementation using threading for TPM tracking.
  • Proper integration with the TokenProcess functionality.
  • Thread-safe operations using locks enhance robustness.

Issues and Recommendations:

  • Docstrings: Add method documentation to improve maintainability.

    def check_or_wait(self, wait: int = 0):
        """Checks token usage is within limits or waits if exceeded.
        Args:
            wait (int): Wait time in seconds if limit exceeded.
        Returns:
            bool: True if operation can proceed, else False.
  • Logging: Replace debug print statements with a logger for consistent output control.

    self.logger.debug(f"Tokens increased: {self._current_tokens}")
  • Destructor Cleanup: Ensure resources are released correctly.

    def __del__(self):
        """Cleanup timer resources."""
        self.stop_tpm_counter()

2. Modifications in Agent Class (agent.py)

Notable Issues:

  • Documentation Typo: Ensure descriptions for max_tpm are precise.

    description="Maximum number of tokens per minute that can be processed."
  • Validation: Implement validation to check max_tpm for positive values only.

    @validator('max_tpm')
    def validate_max_tpm(cls, v):
        if v is not None and v <= 0:
            raise ValueError("max_tpm must be positive.")

3. Base Agent Implementation (base_agent.py)

Key Issue:

  • Parameter Reference: Correct the reference from max_rpm to max_tpm to avoid confusion and ensure functionality.
    if self.max_tpm and not self._tpm_controller:
        self._tpm_controller = TPMController(max_tpm=self.max_tpm)

4. Agent Utilities (agent_utils.py)

Enhancements Suggested:

  • Error Handling: Add robust error handling to manage failures related to token limits.
    def handle_exceeded_token_limits(tpm_controller: TPMController):
        ...

General Recommendations

  1. Unit Testing: Comprehensive tests for TPM functionality are crucial.
  2. Metrics Collection: Implement monitoring for token usage.
  3. Configuration Options: Enable TPM limits to be set via environment variables.
  4. Logging Warnings: Introduce warnings when approaching limits.
  5. Graceful Degradation: Ensure seamless user experience under limit conditions.

Security Considerations

  1. Add sanitization for rate limiting information in logs.
  2. Secure methods for counting tokens to mitigate manipulation risks.
  3. Protection against potential timer exploitation.

Suggested Testing Scenarios

  1. Test concurrent enforcement of token limits.
  2. Verify resource cleanup under various scenarios.
  3. Validate integration performance across multiple LLM providers.

Documentation Enhancements

  1. Update README for TPM configuration.
  2. Best practices for setting TPM limits should be documented.
  3. Troubleshooting guides for common token limit issues need to be added.

This PR is a substantial improvement to token management but requires addressing the outlined issues and recommendations to ensure its readiness for production. The foundation laid out is promising, and with thorough testing and consideration of the suggested improvements, it will robustly enhance the existing functionalities.

@@ -43,6 +43,7 @@ class BaseAgent(ABC, BaseModel):
config (Optional[Dict[str, Any]]): Configuration for the agent.
verbose (bool): Verbose mode for the Agent Execution.
max_rpm (Optional[int]): Maximum number of requests per minute for the agent execution.
max_tpm (Optional[int]): Maximum number of tokens to ne used per minute for the agent execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here

max_tpm (Optional[int]): Maximum number of tokens to be used per minute for the agent execution

@@ -237,6 +251,12 @@ def set_private_attrs(self):
)
if not self._token_process:
self._token_process = TokenProcess()

if self.max_tpm and not self._tpm_controller:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to try re-initialize _tpm_controller` here?

Comment on lines +213 to +215
if is_token_limit_exceeded(e):
handle_exceeded_token_limits(self.request_within_tpm_limit)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle this error after litellm checking?

"""Handle token limit error by waiting.

Args:
token_counter: Class with Sleep function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The args is tpm_controller actually, isn't? Can you fix it?

Args:
token_counter: Class with Sleep function
"""
tpm_controller(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m assuming 1 is meant to represent max_tpm. Should we instead use the value provided by the agent? Also, consider using named parameters to make the code clearer.

Copy link
Contributor

@lucasgomide lucasgomide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ahmed22347 great work here!

I dropped some comments. I also missing tests to cover max_tpm feature. We have some max_rpm examples you can use as reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants