Add LogitTrackingProcessor #1408

cpfiffer · 2025-02-08T02:13:40Z

This PR adds LogitTrackingProcessor, a logit processor that wraps around any other processor to store unstructured and structured logits through the sampled sequence. I needed this code elsewhere and figured it is popular enough to upstream into outlines.

I have included documentation on the processors, which doesn't exist currently. Tests are included as well.

LogitTrackingProcessor makes it easy to perform analysis on disagreements between structured and unstructured tokens. It will be of benefit to researchers, educators, and users who wish to debug their Outlines generators.

An example plot for a regex requiring four digits. This is the distribution of token probabilities on the first token.

Using the tracker is simple:

from outlines import generate, models
from outlines.processors import add_tracking
from pydantic import BaseModel
import pandas as pd

model = models.transformers("HuggingFaceTB/SmolLM2-135M-Instruct")
tokenizer = model.tokenizer.tokenizer

class Person(BaseModel):
    name: str
    age: int

# Create generator with tracking
generator = generate.json(model, Person)

# Convenience wrapper to add tracking
generator = add_tracking(generator)

# Apply templating
prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": "You are a helpful assistant, responding in JSON."}, {"role": "user", "content": "Make me a person with a name, age, zip code, and state. Return the JSON only."}],
    tokenize=False,
    add_bos=True,
    add_generation_prompt=True,
)

# Generate the response
generator(prompt)

# Retrieve the top-k tokens
top_k = generator.logits_processor.get_top_tokens(k=5)

# Get unstructured logits
for position_dict in top_k:
    position_dict['position'] # 0,1,2, etc
    position_dict['text_so_far'] # Text at this point in the sequence

    for token in position_dict['tokens']:
        token['token'] # The token
        token['unstructured_prob'] # Probability of the token in the unstructured distribution
        token['structured_prob'] # Probability of the token in the structured distribution
        token['unstructured_logit'] # Logit of the token in the unstructured distribution
        token['structured_logit'] # Logit of the token in the structured distribution
        token['is_chosen'] # Whether the token was actually sampled

# Convert to dataframe
df = generator.logits_processor.to_dataframe(show="probs", min_value=0.01)
#    position token   natural  constrained  chosen
# 0         0   You  0.021324          0.0   False
# 1         0   The  0.021959          0.0   False
# 2         0  Sure  0.025492          0.0   False
# 3         0  JSON  0.031045          0.0   False
# 4         0    To  0.031047          0.0   False

# Get the token sequence up to position 5
generator.logits_processor.sequence(5)

NOTE: Currently, the tracking processor does not support batch processing. I recommend deferring this to a later PR.

Add probability distribution to choices #479 for the primary discussion of having access to the underlying logits
Allow Debug Logging of Logits #614
Suite of outlines.processors for Sampling Techniques and Debug Logging #1055 as this can be used as a debugging tool

This was removed as it is more easily and flexibly constructed on the user side.

- Adds smoke tests for rolling tracking windows - Remove get_statistics method, as that method is better handled by the user - Add tests for getting progbabilities and logits - Add tests to check that the correct top_k tokens are being returned - Add tests to reconstruct the token sequence - Add tests to track appropriate to_dataframe columns and values - Tests to clear logit tracking - Add tests for add_tracking - Smoke tests to validate that the logit tracker does not modify logits - Test for missing tokenizers - Simplify shape checking code for returned logits/probabilities

- Update __init__.py to include add_tracking in __all__ - Expose add_tracking alongside LogitTrackingProcessor for easier import

…ility - Completely rewrote tracking processor to be more flexible and intuitive - Simplified tracking of unstructured and structured logits - Added support for converting tracking data to pandas DataFrame - Improved token tracking and sequence reconstruction - Updated tests to validate new implementation - Enforced single-batch processing for more predictable behavior

The message test was incorrect, as the error message for multi-batch processing has changed to LogitTrackingProcessor only supports single-batch processing. Got batch size <n>

…examples - Revise documentation for LogitTrackingProcessor in processors.md - Add detailed code examples for tracking token probabilities - Explain memory management and important usage notes - Highlight key methods for analyzing generation results - Clarify limitations and best practices for logit tracking

cpfiffer · 2025-02-14T18:26:41Z

There's a few mypy-related tests to resolve but I think the meat of this is ready for review.

Current issues:

outlines/processors/tracking.py:22: error: Library stubs not installed for "pandas"  [import-untyped]
outlines/processors/tracking.py:22: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:22: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:22: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:105: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:108: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:275: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:276: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:508: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:512: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:515: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:516: error: "LogitTrackingProcessor" has no attribute "tokenizer"  [attr-defined]
outlines/processors/tracking.py:516: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:519: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]

rlouf · 2025-02-15T10:02:28Z

outlines/processors/tracking.py

+        return "".join(tokenizer.decode(tokens_to_decode))
+
+
+def add_tracking(generator: "SequenceGenerator") -> "SequenceGenerator":


I would name this track_logits instead.

docs/reference/processors.md

rlouf · 2025-02-15T10:04:52Z

docs/reference/processors.md

+- Token decoding requires the wrapped processor to have a tokenizer attribute
+- Memory usage grows linearly with sequence length
+- The tracking processor only supports single-batch processing
+- Tracking logits can incur significant overhead -- do not use it in production environments


I would add an example of how we can use these logits processors directly with e.g. transformer pipes

outlines/processors/tracking.py

rlouf · 2025-02-15T10:11:11Z

Looks good to me, it's a great addition. I just have a few minor comments

rlouf

Great addition! I left a few comments that need to be addressed before merging.

We default to returning a dict {'structured':..., 'unstructured':...} where the values of the dict are n_vocab by n_positions matrices.

…ck_logits

pandas is not imported at the module level, so the return type is not available statically.

cpfiffer · 2025-02-22T01:03:31Z

I've addressed the comments, appreciated!

I have some remaining code style issues that are kind of out of my knowledge at the moment, if people have tips on these I'd love them. Otherwise I'll have to come back to this in a week or two.

outlines/processors/tracking.py:242: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:242: note: Possible overload variants:
outlines/processors/tracking.py:242: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:242: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:243: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:243: note: Possible overload variants:
outlines/processors/tracking.py:243: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:243: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:246: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:246: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:246: note: Possible overload variants:
outlines/processors/tracking.py:246: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:246: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:247: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable  [index]
outlines/processors/tracking.py:247: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:247: note: Possible overload variants:
outlines/processors/tracking.py:247: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:247: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:348: error: Library stubs not installed for "pandas"  [import-untyped]
outlines/processors/tracking.py:348: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:348: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:348: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:368: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape"  [union-attr]
outlines/processors/tracking.py:369: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:369: note: Possible overload variants:
outlines/processors/tracking.py:369: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:369: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:370: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]"  [call-overload]
outlines/processors/tracking.py:370: note: Possible overload variants:
outlines/processors/tracking.py:370: note:     def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:370: note:     def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:428: error: Item "None" of "OutlinesLogitsProcessor | None" has no attribute "tokenizer"  [union-attr]
outlines/processors/tracking.py:472: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:476: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:479: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:480: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
outlines/processors/tracking.py:483: error: "SequenceGenerator" has no attribute "logits_processor"  [attr-defined]
tests/test_types.py:37: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_types.py:93: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:222: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:223: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:233: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:234: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:243: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_prompts.py:244: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/test_function.py:16: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
tests/processors/test_tracking.py:4: error: Library stubs not installed for "pandas"  [import-untyped]

rlouf · 2025-02-22T08:30:45Z

I can take a look.

rlouf · 2025-02-22T17:45:03Z

I fixed the formatting issues. I'll do a little refactoring and then we'll be good to merge.

cpfiffer · 2025-02-26T20:13:46Z

As an idle thought here -- is there an interface available to us where we could wrap the resulting generated object with the logits, rather than store it in the logit processor as I have here?

Currently we only return strings from generator calls, but is there an obvious + simple interface for providing a Result(value=..., logits=...) object? My sense is that this isn't likely to be simple, but the devex would probably be better.

Might be a "kick it down the road" thing, but curious if @rlouf @torymur @RobinPicard had an idea of whether this would be simple to do.

If it is simple, I could try refactoring this code to store logits in a response value.

rlouf · 2025-03-05T06:38:29Z

You'd need to create an object that stores the logits but otherwise behaves like a string to keep an intuitive UX. You can try something like (haven't tested it):

class State(str):
    def __new__(cls, value, logits=None):
        instance = super().__new__(cls, value)
        instance.logits = logits
        return instance

cpfiffer · 2025-05-07T16:43:08Z

I am going to close this for now -- I'll revisit this later for v1.0 and generally tidy up the interface.

cpfiffer added 20 commits February 7, 2025 16:28

Add LogitTrackingProcessor

a9c90f4

Adds documentation for processors

33d1966

Add processors.md to documentation table of contents

ed41c6d

Correct trailing whitespace

b2bce09

Added LogitTrackingProcessor tests

72ceb32

Handle circular imports in tracking.py

6df1d91

Add check for tensor shapes in LogitTrackingProcessor.process_logits

dd3489a

Remove LogitTrackingProcessor.get_sequence

ec1086c

This was removed as it is more easily and flexibly constructed on the user side.

Adds example code to plot demoonstrate simple usage of logit tracking

6b40204

Export add_tracking function in processors module

4187ed2

- Update __init__.py to include add_tracking in __all__ - Expose add_tracking alongside LogitTrackingProcessor for easier import

Moves plotting utilities into logit_tracking_demo

be46363

Remove extra import from logit_tracking_demo

30cc1fa

Correct batch size smoke check

eaa6f5b

The message test was incorrect, as the error message for multi-batch processing has changed to LogitTrackingProcessor only supports single-batch processing. Got batch size <n>

Make logit_tracking_demo adhere to code style

80584d8

Make tacking and test_tracking adhere to code style

d8837ca

Change incorrect SequenceGenerator import

fd68ba6

Remove extraneous f-string from test_tracking

de12af1

cpfiffer marked this pull request as ready for review February 14, 2025 18:25

rlouf reviewed Feb 15, 2025

View reviewed changes

rlouf approved these changes Feb 15, 2025

View reviewed changes

rlouf mentioned this pull request Feb 17, 2025

Probabilities for choices #1230

Open

cpfiffer mentioned this pull request Feb 19, 2025

Token-triggered processors #1407

Open

rlouf requested changes Feb 21, 2025

View reviewed changes

cpfiffer added 2 commits February 21, 2025 14:16

Remove documentation of base processor

ce89212

Import pandas directly in to_dataframe LogitTrackingProcessor method

28e83d1

cpfiffer added 6 commits February 21, 2025 15:40

Add transformers pipeline example for LogitTrackingProcessor

d64177c

Make logit tracking require a base processor as an argument

d93c4be

Remove as_matrix for LogitTrackingProcessor

7dd3824

We default to returning a dict {'structured':..., 'unstructured':...} where the values of the dict are n_vocab by n_positions matrices.

Remove as_matrix for get probs/get logits, change add_tracking to tra…

78a7d63

…ck_logits

trim trailing whitespace

25d3bdc

Removing return typing to to_dataframe

19e6486

pandas is not imported at the module level, so the return type is not available statically.

rlouf force-pushed the main branch from 953d04d to 641deb7 Compare February 22, 2025 17:27

rlouf force-pushed the main branch from 641deb7 to 8a728b1 Compare February 22, 2025 18:59

Add LogitTrackingProcessor

6d82c91

rlouf force-pushed the main branch from 8a728b1 to 6d82c91 Compare February 22, 2025 19:19

rlouf self-assigned this Feb 23, 2025

Merge branch 'main' of https://github.com/cpfiffer/outlines

b19cdd0

rlouf mentioned this pull request Mar 8, 2025

Make it possible to pass a logits processor to Generator #1487

Closed

cpfiffer mentioned this pull request Apr 8, 2025

Add backward compatibility with v0 #1518

Merged

cpfiffer closed this May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LogitTrackingProcessor #1408

Add LogitTrackingProcessor #1408

Uh oh!

cpfiffer commented Feb 8, 2025 •

edited

Loading

Uh oh!

cpfiffer commented Feb 14, 2025

Uh oh!

rlouf Feb 15, 2025

Uh oh!

Uh oh!

Uh oh!

rlouf Feb 15, 2025

Uh oh!

Uh oh!

Uh oh!

rlouf commented Feb 15, 2025 •

edited

Loading

Uh oh!

rlouf left a comment

Uh oh!

cpfiffer commented Feb 22, 2025

Uh oh!

rlouf commented Feb 22, 2025

Uh oh!

rlouf commented Feb 22, 2025

Uh oh!

cpfiffer commented Feb 26, 2025

Uh oh!

rlouf commented Mar 5, 2025 •

edited

Loading

Uh oh!

cpfiffer commented May 7, 2025

Uh oh!

Uh oh!

		return "".join(tokenizer.decode(tokens_to_decode))


		def add_tracking(generator: "SequenceGenerator") -> "SequenceGenerator":

Add LogitTrackingProcessor #1408

Add LogitTrackingProcessor #1408

Uh oh!

Conversation

cpfiffer commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpfiffer commented Feb 14, 2025

Uh oh!

rlouf Feb 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rlouf Feb 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rlouf commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlouf left a comment

Choose a reason for hiding this comment

Uh oh!

cpfiffer commented Feb 22, 2025

Uh oh!

rlouf commented Feb 22, 2025

Uh oh!

rlouf commented Feb 22, 2025

Uh oh!

cpfiffer commented Feb 26, 2025

Uh oh!

rlouf commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpfiffer commented May 7, 2025

Uh oh!

Uh oh!

cpfiffer commented Feb 8, 2025 •

edited

Loading

rlouf commented Feb 15, 2025 •

edited

Loading

rlouf commented Mar 5, 2025 •

edited

Loading