Structured output support #40

dhicks · 2025-03-23T22:24:44Z

This PR provides a minimal implementation of the approach to structured outputs I sketched in #39. Here's a not-quite-minimal working example of using structured outputs:

library(mall)

library(rjson) # For parsing structured JSON output

library(dplyr) # For efficiently wrangling parsed JSON columns
library(tidyr)
library(purrr)

# Define a JSON schema as a list to constrain a model's output
format <- list(
  type = "object",
  properties = list(
    name = list(type = "string"),
    capital = list(type = "string"),
    languages = list(type = "array",
                     items = list(type = "string")
    )
  ),
  required = list("name", "capital", "languages")
)

# Set up the model, passing `output` and `format`
llm_use('ollama', 'llama3.2', 
        output = 'structured', 
        format = format, 
        seed = 2025-03-23)

# Vectorized version can be piped directly into JSON parser
llm_vec_custom('Canada', 'tell me about the following country') |> 
  fromJSON()
# $name
# [1] "Canada"
# 
# $capital
# [1] "Ottawa"
# 
# $languages
# [1] "English"              "French"              
# [3] "indigenous languages"

# Data frame version requires more wrangling
dataf = data.frame(country = c('Canada', 'Mexico', 'United States'))

llm_custom(dataf, country, 'tell me about the following country') |>
  mutate(output = map(.pred, fromJSON)) |> 
  unnest_wider(output) |> 
  str()
# tibble [3 × 5] (S3: tbl_df/tbl/data.frame)
# $ country  : chr [1:3] "Canada" "Mexico" "United States"
# $ .pred    : chr [1:3] "{ \"name\": \"Canada\", \"capital\": \"Ottawa\", \"languages\": [\"English\",\"French\" , \"indigenous language"| __truncated__ "{\"name\": \"Mexico\", \"capital\": \"Mexico City\", \"languages\": [\"Spanish\", \"Maya\", \"Nahua\", \"Zapote"| __truncated__ "{\n\"name\": \"United States of America\",\n\"capital\": \"Washington D.C.\",\n\"languages\": [\"English\", \"S"| __truncated__
# $ name     : chr [1:3] "Canada" "Mexico" "United States of America"
# $ capital  : chr [1:3] "Ottawa" "Mexico City" "Washington D.C."
# $ languages:List of 3
# ..$ : chr [1:3] "English" "French" "indigenous languages"
# ..$ : chr [1:5] "Spanish" "Maya" "Nahua" "Zapotec" ...
# ..$ : chr [1:14] "English" "Spanish" "French" "Chinese" ...

Text outputs appear to be working as before. Four tests fail, three due to small differences in a code snapshot, eg, order of arguments:

Failure (test-llm-classify.R:38:3): Preview works
Failure (test-llm-use.R:28:3): Stops cache
Failure (test-llm-verify.R:36:3): Preview works

The fourth test appears to relate to the number of objects in Ollama's cache:

Failure (test-zzz-cache.R:3:3): Ollama cache exists and delete (actual is 61.0 vs. expected 59.0)

Since I'm not sure how these tests work, at this time I'm not going to update the snapshots or make other changes to the tests themselves. I'm also not adding a test for the structured output use case. It seems like it would be desirable to define functions llm_structured and llm_structured_vec specifically for vectored output, and then write tests for those.

edgararuiz · 2025-05-14T16:40:48Z

Hi @dhicks , thank you for this PR. I'm wondering if llm_use() is really the best way for this feature. Maybe it's something we should add to llm_custom() to start with, maybe something like this: llm_vec_custom('Canada', 'tell me about the following country', format = format). We can default format to NULL, so if you pass format then the output will be inferred to be a structured object. Thoughts?

dhicks · 2025-05-14T17:22:56Z

I'm not tied to any particular implementation — this was the result of me playing around with the package to see if it would work for my needs — so llm_custom() would probably be fine.

It did occur to me a couple weeks ago, though, that structured outputs would provide a better way of controlling output throughout the whole package. IIRC currently invalid responses are detected after they're returned and converted to NAs. Structured outputs can be used to ensure valid responses on the model side, as in this script:

library(tidyverse)

library(readxl)
library(here)
library(tictoc)

library(mall)

source('/Users/danhicks/Google Drive/Coding/*ST text mining/R/load_coding.R')

data_dir = here('data')
out_dir = here('out')

## Load comment text and manual coding ----
dataf = load_manual_coding() |> 
    select(comment_id, support) |> 
    left_join({read_excel(here(out_dir, '06_docs_2024-01-16.xlsx')) |> 
            select(comment_id, text)}, 
            by = 'comment_id')

## Values are `oppose`, `support`, and `NA`
count(dataf, support)
# # A tibble: 3 × 2
# support     n
# <chr>   <int>
# 1 oppose    625
# 2 support   156
# 3 NA         33

## Set up LLM ----
schema = list(
    type = 'object', 
    properties = list(
        think = list(type = 'string'), 
        support = list(enum = c('support', 
                                'oppose', 
                                'unclear'))
    ),
    required = list('think', 'support')
)

prompt = 'This is a comment on Strengthening Transparency in Regulatory Science, a rule proposed by the Environmental Protection Agency in the first Trump Administration. The rule would have introduced a strong open data requirement at EPA. Classify the comment as supporting the rule (`support`) or opposing it (`oppose`), or as `unclear` if you are uncertain whether the comment supports or opposes the rule.'

llm_use('ollama', 'gemma3:12b', seed = 2025-03-26, .cache = '', 
        output = 'structured', 
        format = schema)


tic()
llm_vec_custom(dataf$text[1], prompt = prompt)
toc()

tic()
out_df = dataf |> 
    # slice(1:100) |>
    llm_custom(text, 
               prompt = prompt)
toc()

out_df |> 
    rename(gt = support) |> 
    mutate(output = map(.pred, rjson::fromJSON)) |> 
    unnest_wider(output) |> 
    count(gt, support)

## llama3.2 ----
# # A tibble: 8 × 3
# support .classify     n
# <chr>   <chr>     <int>
# 1 oppose  oppose      620
# 2 oppose  NA            5
# 3 support oppose      145
# 4 support support       7
# 5 support NA            4
# 6 NA      oppose       31
# 7 NA      support       1
# 8 NA      NA            1

(Sorry for just dumping that; it's finals week here and I need to finish my grading!)

dhicks and others added 3 commits March 23, 2025 14:25

set via an argument in

d478497

m_backend_submit.mall_ollama doesn't assume text output

69bd15b

Merge branch 'main' into structured

bd54ac0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Structured output support #40

Structured output support #40

Uh oh!

dhicks commented Mar 23, 2025

Uh oh!

edgararuiz commented May 14, 2025

Uh oh!

dhicks commented May 14, 2025

Uh oh!

Uh oh!

Structured output support #40

Are you sure you want to change the base?

Structured output support #40

Uh oh!

Conversation

dhicks commented Mar 23, 2025

Uh oh!

edgararuiz commented May 14, 2025

Uh oh!

dhicks commented May 14, 2025

Uh oh!

Uh oh!