-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LogitTrackingProcessor #1408
base: main
Are you sure you want to change the base?
Conversation
This was removed as it is more easily and flexibly constructed on the user side.
- Adds smoke tests for rolling tracking windows - Remove get_statistics method, as that method is better handled by the user - Add tests for getting progbabilities and logits - Add tests to check that the correct top_k tokens are being returned - Add tests to reconstruct the token sequence - Add tests to track appropriate to_dataframe columns and values - Tests to clear logit tracking - Add tests for add_tracking - Smoke tests to validate that the logit tracker does not modify logits - Test for missing tokenizers - Simplify shape checking code for returned logits/probabilities
- Update __init__.py to include add_tracking in __all__ - Expose add_tracking alongside LogitTrackingProcessor for easier import
…ility - Completely rewrote tracking processor to be more flexible and intuitive - Simplified tracking of unstructured and structured logits - Added support for converting tracking data to pandas DataFrame - Improved token tracking and sequence reconstruction - Updated tests to validate new implementation - Enforced single-batch processing for more predictable behavior
The message test was incorrect, as the error message for multi-batch processing has changed to LogitTrackingProcessor only supports single-batch processing. Got batch size <n>
…examples - Revise documentation for LogitTrackingProcessor in processors.md - Add detailed code examples for tracking token probabilities - Explain memory management and important usage notes - Highlight key methods for analyzing generation results - Clarify limitations and best practices for logit tracking
There's a few Current issues: outlines/processors/tracking.py:22: error: Library stubs not installed for "pandas" [import-untyped]
outlines/processors/tracking.py:22: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:22: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:22: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:105: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape" [union-attr]
outlines/processors/tracking.py:108: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape" [union-attr]
outlines/processors/tracking.py:275: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:276: error: Value of type "dict[str, list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:508: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:512: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:515: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:516: error: "LogitTrackingProcessor" has no attribute "tokenizer" [attr-defined]
outlines/processors/tracking.py:516: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:519: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined] |
outlines/processors/tracking.py
Outdated
return "".join(tokenizer.decode(tokens_to_decode)) | ||
|
||
|
||
def add_tracking(generator: "SequenceGenerator") -> "SequenceGenerator": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this track_logits
instead.
- Token decoding requires the wrapped processor to have a tokenizer attribute | ||
- Memory usage grows linearly with sequence length | ||
- The tracking processor only supports single-batch processing | ||
- Tracking logits can incur significant overhead -- do not use it in production environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an example of how we can use these logits processors directly with e.g. transformer pipes
Looks good to me, it's a great addition. I just have a few minor comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition! I left a few comments that need to be addressed before merging.
We default to returning a dict {'structured':..., 'unstructured':...} where the values of the dict are n_vocab by n_positions matrices.
pandas is not imported at the module level, so the return type is not available statically.
I've addressed the comments, appreciated! I have some remaining code style issues that are kind of out of my knowledge at the moment, if people have tips on these I'd love them. Otherwise I'll have to come back to this in a week or two. outlines/processors/tracking.py:242: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:242: note: Possible overload variants:
outlines/processors/tracking.py:242: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:242: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:243: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:243: note: Possible overload variants:
outlines/processors/tracking.py:243: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:243: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:246: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:246: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:246: note: Possible overload variants:
outlines/processors/tracking.py:246: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:246: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:247: error: Value of type "dict[str, Any | Any | list[Any] | Any] | None" is not indexable [index]
outlines/processors/tracking.py:247: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:247: note: Possible overload variants:
outlines/processors/tracking.py:247: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:247: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:348: error: Library stubs not installed for "pandas" [import-untyped]
outlines/processors/tracking.py:348: note: Hint: "python3 -m pip install pandas-stubs"
outlines/processors/tracking.py:348: note: (or run "mypy --install-types" to install all missing stub packages)
outlines/processors/tracking.py:348: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
outlines/processors/tracking.py:368: error: Item "list[Any]" of "Any | Any | list[Any] | Any" has no attribute "shape" [union-attr]
outlines/processors/tracking.py:369: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:369: note: Possible overload variants:
outlines/processors/tracking.py:369: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:369: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:370: error: No overload variant of "__getitem__" of "list" matches argument type "tuple[slice[None, None, None], int]" [call-overload]
outlines/processors/tracking.py:370: note: Possible overload variants:
outlines/processors/tracking.py:370: note: def __getitem__(self, SupportsIndex, /) -> Any
outlines/processors/tracking.py:370: note: def __getitem__(self, slice[Any, Any, Any], /) -> list[Any]
outlines/processors/tracking.py:428: error: Item "None" of "OutlinesLogitsProcessor | None" has no attribute "tokenizer" [union-attr]
outlines/processors/tracking.py:472: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:476: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:479: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:480: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
outlines/processors/tracking.py:483: error: "SequenceGenerator" has no attribute "logits_processor" [attr-defined]
tests/test_types.py:37: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_types.py:93: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:222: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:223: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:233: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:234: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:243: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_prompts.py:244: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/test_function.py:16: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]
tests/processors/test_tracking.py:4: error: Library stubs not installed for "pandas" [import-untyped] |
I can take a look. |
I fixed the formatting issues. I'll do a little refactoring and then we'll be good to merge. |
As an idle thought here -- is there an interface available to us where we could wrap the resulting generated object with the logits, rather than store it in the logit processor as I have here? Currently we only return strings from Might be a "kick it down the road" thing, but curious if @rlouf @torymur @RobinPicard had an idea of whether this would be simple to do. If it is simple, I could try refactoring this code to store logits in a response value. |
You'd need to create an object that stores the logits but otherwise behaves like a string to keep an intuitive UX. You can try something like (haven't tested it): class State(str):
def __new__(cls, value, logits=None):
instance = super().__new__(cls, value)
instance.logits = logits
return instance |
This PR adds
LogitTrackingProcessor
, a logit processor that wraps around any other processor to store unstructured and structured logits through the sampled sequence. I needed this code elsewhere and figured it is popular enough to upstream intooutlines
.I have included documentation on the processors, which doesn't exist currently. Tests are included as well.
LogitTrackingProcessor
makes it easy to perform analysis on disagreements between structured and unstructured tokens. It will be of benefit to researchers, educators, and users who wish to debug their Outlines generators.An example plot for a regex requiring four digits. This is the distribution of token probabilities on the first token.
Using the tracker is simple:
NOTE: Currently, the tracking processor does not support batch processing. I recommend deferring this to a later PR.
Related:
outlines.processors
for Sampling Techniques and Debug Logging #1055 as this can be used as a debugging tool