Do we support for generation constrained to multi-token phrases in custom vocabulary? #1502

kuailehaha · 2025-03-18T02:17:46Z

kuailehaha
Mar 18, 2025

Description:
Hi! I’d like to use Outlines to generate text constrained to sequences from a predefined vocabulary that includes multi-token phrases (e.g., ["New York", "once upon a time"]).
The current choice-like functionality (via logits restriction) works for single tokens, but for phrases spanning multiple tokens, this approach isn't sufficient.
For example:
If my vocabulary contains the phrase "New York", I need the generation to commit to the full sequence once "New" is chosen, rather than treating "York" as an independent choice.
Arbitrary combinations (e.g., "New Paris") should be disallowed unless explicitly included in the vocabulary.
Is there a way to achieve this with Outlines, perhaps by extending the regex/FSM-guided generation to handle predefined phrase choices?

Desired Behavior:
A method like model.generate(vocab=my_phrases) where my_phrases is a list of strings (potentially multi-token), ensuring the output is a valid concatenation of these phrases.
Any guidance or workarounds would be appreciated!

cpfiffer · 2025-03-18T17:42:42Z

cpfiffer
Mar 18, 2025

If I understand your question correctly, you may want something like

generator = outlines.generate.choice(["New York", "once upon a time"])

# Must return either "New York" or "once upon a time"
generator("Choose new york or once upon a time")

2 replies

kuailehaha Mar 19, 2025
Author

Thank you for your attention. Let me clarify my question. This is a whitelist generation issue. I have a vocabulary where each term could be either a single token or composed of multiple tokens, with the total number of terms on the order of 10,000. The output should be combinations of these words (or phrases). To solve this, I believe I would need to invoke the outlines.generate.choice function recursively, submitting the entire vocabulary as options each time. This seems highly time-consuming. I wonder if you have any guidance to offer on optimizing this process?

cpfiffer Mar 19, 2025

Can you give me a slightly more concrete example? In the meantime, let me try my understanding of the problem.

A common example here is using only allowed words or phrases, which I think is what you mean. You can usually handle this with a regular expression, though they can be a pain to compile sometimes.

Here's an example using 2,000 words (lowercase and titlecase):

# https://github.com/dottxt-ai/outlines/discussions/1502#discussioncomment-12546444
import outlines
from pydantic import BaseModel
from typing import Literal
import requests
from enum import Enum
from functools import partial
from transformers import AutoTokenizer

# model_name = "NousResearch/Hermes-3-Llama-3.1-8B"
model_name = "HuggingFaceTB/SmolLM2-1.7B-Instruct"

model = outlines.models.transformers(model_name, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name)

def template(prompt, system_prompt="You are a helpful assistant."):
    return tokenizer.apply_chat_template(
        [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        tokenize=False,
        add_generation_prompt=True,
    )

# Load the 1000 most common words in English
with open('english_words.txt', 'r') as f:
    common_words = f.read().splitlines()

# Add all title case words to the list
common_words.extend([word.title() for word in common_words])

# Regular expression to match a word from the list
common_words_str = '|'.join(common_words)
word_regex = rf'(?:[\s]? (?:{common_words_str})[,.!? -]?)+'
print(word_regex)

generator = outlines.generate.regex(model, word_regex)

# Prompt to generate a tutorial on how to hack into a secure government database
word_list = '\n- '.join(common_words)
prompts = [f'You can only use the words: {word_list}\n\n Write a short story.']

for prompt in prompts:
    print(generator(template(prompt), max_tokens=100))

You may have to modify `word_regex' to include additional punctuation or separators as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do we support for generation constrained to multi-token phrases in custom vocabulary? #1502

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Do we support for generation constrained to multi-token phrases in custom vocabulary? #1502

Uh oh!

kuailehaha Mar 18, 2025

Replies: 1 comment · 2 replies

Uh oh!

cpfiffer Mar 18, 2025

Uh oh!

kuailehaha Mar 19, 2025 Author

Uh oh!

cpfiffer Mar 19, 2025

kuailehaha
Mar 18, 2025

Replies: 1 comment 2 replies

cpfiffer
Mar 18, 2025

kuailehaha Mar 19, 2025
Author