Skip to content

Structured output support #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Structured output support #40

wants to merge 2 commits into from

Conversation

dhicks
Copy link

@dhicks dhicks commented Mar 23, 2025

This PR provides a minimal implementation of the approach to structured outputs I sketched in #39. Here's a not-quite-minimal working example of using structured outputs:

library(mall)

library(rjson) # For parsing structured JSON output

library(dplyr) # For efficiently wrangling parsed JSON columns
library(tidyr)
library(purrr)

# Define a JSON schema as a list to constrain a model's output
format <- list(
  type = "object",
  properties = list(
    name = list(type = "string"),
    capital = list(type = "string"),
    languages = list(type = "array",
                     items = list(type = "string")
    )
  ),
  required = list("name", "capital", "languages")
)

# Set up the model, passing `output` and `format`
llm_use('ollama', 'llama3.2', 
        output = 'structured', 
        format = format, 
        seed = 2025-03-23)

# Vectorized version can be piped directly into JSON parser
llm_vec_custom('Canada', 'tell me about the following country') |> 
  fromJSON()
# $name
# [1] "Canada"
# 
# $capital
# [1] "Ottawa"
# 
# $languages
# [1] "English"              "French"              
# [3] "indigenous languages"

# Data frame version requires more wrangling
dataf = data.frame(country = c('Canada', 'Mexico', 'United States'))

llm_custom(dataf, country, 'tell me about the following country') |>
  mutate(output = map(.pred, fromJSON)) |> 
  unnest_wider(output) |> 
  str()
# tibble [3 × 5] (S3: tbl_df/tbl/data.frame)
# $ country  : chr [1:3] "Canada" "Mexico" "United States"
# $ .pred    : chr [1:3] "{ \"name\": \"Canada\", \"capital\": \"Ottawa\", \"languages\": [\"English\",\"French\" , \"indigenous language"| __truncated__ "{\"name\": \"Mexico\", \"capital\": \"Mexico City\", \"languages\": [\"Spanish\", \"Maya\", \"Nahua\", \"Zapote"| __truncated__ "{\n\"name\": \"United States of America\",\n\"capital\": \"Washington D.C.\",\n\"languages\": [\"English\", \"S"| __truncated__
# $ name     : chr [1:3] "Canada" "Mexico" "United States of America"
# $ capital  : chr [1:3] "Ottawa" "Mexico City" "Washington D.C."
# $ languages:List of 3
# ..$ : chr [1:3] "English" "French" "indigenous languages"
# ..$ : chr [1:5] "Spanish" "Maya" "Nahua" "Zapotec" ...
# ..$ : chr [1:14] "English" "Spanish" "French" "Chinese" ...

Text outputs appear to be working as before. Four tests fail, three due to small differences in a code snapshot, eg, order of arguments:

  • Failure (test-llm-classify.R:38:3): Preview works
  • Failure (test-llm-use.R:28:3): Stops cache
  • Failure (test-llm-verify.R:36:3): Preview works

The fourth test appears to relate to the number of objects in Ollama's cache:

  • Failure (test-zzz-cache.R:3:3): Ollama cache exists and delete (actual is 61.0 vs. expected 59.0)

Since I'm not sure how these tests work, at this time I'm not going to update the snapshots or make other changes to the tests themselves. I'm also not adding a test for the structured output use case. It seems like it would be desirable to define functions llm_structured and llm_structured_vec specifically for vectored output, and then write tests for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant