Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LogitTrackingProcessor #1408

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Add LogitTrackingProcessor #1408

wants to merge 30 commits into from

Conversation

cpfiffer
Copy link
Contributor

@cpfiffer cpfiffer commented Feb 8, 2025

This PR adds LogitTrackingProcessor, a logit processor that wraps around any other processor to store unstructured and structured logits through the sampled sequence. I needed this code elsewhere and figured it is popular enough to upstream into outlines.

I have included documentation on the processors, which doesn't exist currently. Tests are included as well.

LogitTrackingProcessor makes it easy to perform analysis on disagreements between structured and unstructured tokens. It will be of benefit to researchers, educators, and users who wish to debug their Outlines generators.

An example plot for a regex requiring four digits. This is the distribution of token probabilities on the first token.

image

Using the tracker is simple:

from outlines import generate, models
from outlines.processors import add_tracking
from pydantic import BaseModel
import pandas as pd

model = models.transformers("HuggingFaceTB/SmolLM2-135M-Instruct")
tokenizer = model.tokenizer.tokenizer

class Person(BaseModel):
    name: str
    age: int

# Create generator with tracking
generator = generate.json(model, Person)

# Convenience wrapper to add tracking
generator = add_tracking(generator)

# Apply templating
prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": "You are a helpful assistant, responding in JSON."}, {"role": "user", "content": "Make me a person with a name, age, zip code, and state. Return the JSON only."}],
    tokenize=False,
    add_bos=True,
    add_generation_prompt=True,
)

# Generate the response
generator(prompt)

# Retrieve the top-k tokens
top_k = generator.logits_processor.get_top_tokens(k=5)

# Get unstructured logits
for position_dict in top_k:
    position_dict['position'] # 0,1,2, etc
    position_dict['text_so_far'] # Text at this point in the sequence

    for token in position_dict['tokens']:
        token['token'] # The token
        token['unstructured_prob'] # Probability of the token in the unstructured distribution
        token['structured_prob'] # Probability of the token in the structured distribution
        token['unstructured_logit'] # Logit of the token in the unstructured distribution
        token['structured_logit'] # Logit of the token in the structured distribution
        token['is_chosen'] # Whether the token was actually sampled

# Convert to dataframe
df = generator.logits_processor.to_dataframe(show="probs", min_value=0.01)
#    position token   natural  constrained  chosen
# 0         0   You  0.021324          0.0   False
# 1         0   The  0.021959          0.0   False
# 2         0  Sure  0.025492          0.0   False
# 3         0  JSON  0.031045          0.0   False
# 4         0    To  0.031047          0.0   False

# Get the token sequence up to position 5
generator.logits_processor.sequence(5)

NOTE: Currently, the tracking processor does not support batch processing. I recommend deferring this to a later PR.

Related:

This was removed as it is more easily and flexibly constructed on the user side.
- Adds smoke tests for rolling tracking windows
- Remove get_statistics method, as that method is better handled by the user
- Add tests for getting progbabilities and logits
- Add tests to check that the correct top_k tokens are being returned
- Add tests to reconstruct the token sequence
- Add tests to track appropriate to_dataframe columns and values
- Tests to clear logit tracking
- Add tests for add_tracking
- Smoke tests to validate that the logit tracker does not modify logits
- Test for missing tokenizers
- Simplify shape checking code for returned logits/probabilities
- Update __init__.py to include add_tracking in __all__
- Expose add_tracking alongside LogitTrackingProcessor for easier import
…ility

- Completely rewrote tracking processor to be more flexible and intuitive
- Simplified tracking of unstructured and structured logits
- Added support for converting tracking data to pandas DataFrame
- Improved token tracking and sequence reconstruction
- Updated tests to validate new implementation
- Enforced single-batch processing for more predictable behavior
The message test was incorrect, as the error message for multi-batch processing has changed to

LogitTrackingProcessor only supports single-batch processing. Got batch size <n>
…examples

- Revise documentation for LogitTrackingProcessor in processors.md
- Add detailed code examples for tracking token probabilities
- Explain memory management and important usage notes
- Highlight key methods for analyzing generation results
- Clarify limitations and best practices for logit tracking
@cpfiffer cpfiffer marked this pull request as ready for review February 14, 2025 18:25
@cpfiffer
Copy link
Contributor Author

There's a few mypy-related tests to resolve but I think the meat of this is ready for review.

Current issues:

outlines/processors/tracking.py:22: error: Library stubs not installed for "pandas"  [import-untyped]
outlines/processors/tracking.py:22: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:22: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:22: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:105: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:108: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:275: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:276: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:508: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:512: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:515: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:516: error: "LogitTrackingProcessor" has no attribute "tokenizer"  [attr-defined]
outlines/processors/tracking.py:516: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:519: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]

return "".join(tokenizer.decode(tokens_to_decode))


def add_tracking(generator: "SequenceGenerator") -> "SequenceGenerator":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this track_logits instead.

- Token decoding requires the wrapped processor to have a tokenizer attribute
- Memory usage grows linearly with sequence length
- The tracking processor only supports single-batch processing
- Tracking logits can incur significant overhead -- do not use it in production environments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an example of how we can use these logits processors directly with e.g. transformer pipes

@rlouf
Copy link
Member

rlouf commented Feb 15, 2025

Looks good to me, it's a great addition. I just have a few minor comments

Copy link
Member

@rlouf rlouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! I left a few comments that need to be addressed before merging.

We default to returning a dict {'structured':..., 'unstructured':...} where the values of the dict are n_vocab by n_positions matrices.
pandas is not imported at the module level, so the return type is not available statically.
@cpfiffer
Copy link
Contributor Author

I've addressed the comments, appreciated!

I have some remaining code style issues that are kind of out of my knowledge at the moment, if people have tips on these I'd love them. Otherwise I'll have to come back to this in a week or two.

outlines/processors/tracking.py:242: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:242: note: Possible overload variants:
outlines/processors/tracking.py:242: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:242: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:243: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:243: note: Possible overload variants:
outlines/processors/tracking.py:243: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:243: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:246: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:246: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:246: note: Possible overload variants:
outlines/processors/tracking.py:246: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:246: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:247: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:247: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:247: note: Possible overload variants:
outlines/processors/tracking.py:247: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:247: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:348: error: Library stubs not installed for "pandas"  [import-untyped]
outlines/processors/tracking.py:348: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:348: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:348: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:368: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:369: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:369: note: Possible overload variants:
outlines/processors/tracking.py:369: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:369: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:370: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:370: note: Possible overload variants:
outlines/processors/tracking.py:370: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:370: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:428: error: Item "None" of "OutlinesLogitsProcessor | None" has no attribute "tokenizer"  [union-attr]
outlines/processors/tracking.py:472: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:476: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:479: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:480: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:483: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
tests/test_types.py:37: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_types.py:93: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:222: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:223: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:233: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:234: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:243: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:244: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_function.py:16: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/processors/test_tracking.py:4: error: Library stubs not installed for "pandas"  [import-untyped]

@rlouf
Copy link
Member

rlouf commented Feb 22, 2025

I can take a look.

@rlouf
Copy link
Member

rlouf commented Feb 22, 2025

I fixed the formatting issues. I'll do a little refactoring and then we'll be good to merge.

@cpfiffer
Copy link
Contributor Author

As an idle thought here -- is there an interface available to us where we could wrap the resulting generated object with the logits, rather than store it in the logit processor as I have here?

Currently we only return strings from generator calls, but is there an obvious + simple interface for providing a Result(value=..., logits=...) object? My sense is that this isn't likely to be simple, but the devex would probably be better.

Might be a "kick it down the road" thing, but curious if @rlouf @torymur @RobinPicard had an idea of whether this would be simple to do.

If it is simple, I could try refactoring this code to store logits in a response value.

@rlouf
Copy link
Member

rlouf commented Mar 5, 2025

You'd need to create an object that stores the logits but otherwise behaves like a string to keep an intuitive UX. You can try something like (haven't tested it):

class State(str):
    def __new__(cls, value, logits=None):
        instance = super().__new__(cls, value)
        instance.logits = logits
        return instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants