Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ Above, `use_best=True` implements mitigation so that the uncertainty-minimized r
* Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))
* BERTScore ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675))
* Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/abs/2412.05563))
* Functional Entropy for Code Generation ([Bouchard et al., 2026](https://arxiv.org/abs/2605.28500))
* Code-adapted scorers via [`CodeGenUQ`](#code-generation-uq).

### White-Box Scorers (Token-Probability-Based)

Expand Down Expand Up @@ -322,6 +322,29 @@ Above `response` and `entailment` reflect the original response and response-lev
* Graph-based scorers ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))
* Generalized long-form semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))


### Code Generation UQ

For code-generation tasks, UQLM provides `CodeGenUQ`, a specialized interface for predicting whether LLM-generated code is functionally correct without requiring execution. `CodeGenUQ` includes white-box scorers, code-adapted black-box scorers based on functional equivalence, and reflexive self-evaluation scorers. The white-box methods are the same token-probability-based scorers available through `WhiteBoxUQ`.

**Example Usage:**

```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

from uqlm import CodeGenUQ
cguq = CodeGenUQ(
llm=llm,
scorers=["functional_equivalence_rate"]
)

results = await cguq.generate_and_score(prompts=prompts, num_responses=5)
results.to_df()
```

For a more detailed demo, refer to our [`CodeGenUQ` Demo](./examples/codegen_demo.ipynb). More details on code generation scorers are available in [Bouchard et al., 2026](https://arxiv.org/abs/2605.28500).

## Documentation
Check out our [documentation site](https://cvs-health.github.io/uqlm/latest/index.html) for detailed instructions on using this package, including API reference and more.

Expand Down
52 changes: 52 additions & 0 deletions docs/source/scorer_definitions/code_generation/code_similarity.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
Code Similarity Scorers
=======================

.. currentmodule:: uqlm.scorers

Definition
----------

Code similarity scorers generate sampled code responses from the same prompt and compare each sampled response with the original response. Higher average similarity indicates higher confidence.

``cosine_sim`` embeds the original and sampled code responses with a code embedding model, then computes normalized average cosine similarity:

.. math::

NCS(y; \tilde{\mathbf{y}}) = \frac{1}{2} + \frac{1}{2m} \sum_{j=1}^{m} \frac{V(y) \cdot V(\tilde{y}_j)}{\|V(y)\| \cdot \|V(\tilde{y}_j)\|}

``code_bleu`` computes average CodeBLEU similarity between the original code response and sampled responses:

.. math::

CBC(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \text{CodeBLEU}(y, \tilde{y}_j)

**Key Properties:**

- Code-adapted black-box consistency scoring
- Uses structural or embedding-based similarity rather than natural-language entailment
- Score range: :math:`[0, 1]`

Parameters
----------

When using :class:`CodeGenUQ`, specify ``"cosine_sim"`` or ``"code_bleu"`` in the ``scorers`` list. You can also set ``sentence_transformer`` for ``cosine_sim`` and ``language`` for ``code_bleu``.

Example
-------

.. code-block:: python

from uqlm import CodeGenUQ

code_uq = CodeGenUQ(
llm=llm,
scorers=["cosine_sim", "code_bleu"],
language="python",
)

results = await code_uq.generate_and_score(prompts=prompts, num_responses=5)

See Also
--------

- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Functional Equivalence Scorers
==============================

.. currentmodule:: uqlm.scorers

Definition
----------

Functional equivalence scorers use an LLM to judge whether two code snippets are functionally equivalent, meaning they would produce the same outputs for valid inputs. These scorers were proposed by Bouchard et al. (2026).

``functional_equivalence_rate`` estimates the proportion of sampled responses that are functionally equivalent to the original response:

.. math::

FER(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \mathbb{I}[y \equiv \tilde{y}_j]

``functional_negentropy`` clusters the original and sampled responses by functional equivalence, computes entropy over the cluster distribution, and normalizes it to a confidence score. Let :math:`\mathcal{C}` denote the set of functional equivalence clusters, and let :math:`P(C)` denote the proportion of responses in cluster :math:`C`. Functional entropy is:

.. math::

FE(y; \tilde{\mathbf{y}}) = -\sum_{C \in \mathcal{C}} P(C) \log P(C)

The normalized confidence score is:

.. math::

NFN(y; \tilde{\mathbf{y}}) = 1 - \frac{FE(y; \tilde{\mathbf{y}})}{\log(m + 1)}

``functional_sets_confidence`` counts the number of functional equivalence clusters and normalizes it to :math:`[0, 1]`:

.. math::

FSC(y; \tilde{\mathbf{y}}) = \frac{m + 1 - |\mathcal{C}|}{m}

where :math:`|\mathcal{C}|` is the number of functional equivalence clusters among the original response and :math:`m` sampled responses.

**Key Properties:**

- Directly targets functional agreement rather than textual similarity
- Requires an LLM for equivalence judgments
- Score range: :math:`[0, 1]`

Parameters
----------

When using :class:`CodeGenUQ`, specify ``"functional_equivalence_rate"``, ``"functional_negentropy"``, or ``"functional_sets_confidence"`` in the ``scorers`` list. You can set ``equivalence_llm`` to use a separate model for equivalence judgments.

Example
-------

.. code-block:: python

from uqlm import CodeGenUQ

code_uq = CodeGenUQ(
llm=llm,
equivalence_llm=equivalence_llm,
scorers=[
"functional_equivalence_rate",
"functional_negentropy",
"functional_sets_confidence",
],
language="python",
)

results = await code_uq.generate_and_score(prompts=prompts, num_responses=5)

References
----------

- Bouchard, D., et al. (2026). `Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification <https://arxiv.org/pdf/2605.28500>`_. *arXiv*.

See Also
--------

- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification
29 changes: 29 additions & 0 deletions docs/source/scorer_definitions/code_generation/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Code-Generation Scorers
=======================

.. currentmodule:: uqlm.scorers

Code-generation uncertainty quantification uses :class:`CodeGenUQ` to score generated code. These scorers either reuse existing short-form UQ methods or adapt black-box consistency scoring to code by comparing structural similarity or functional equivalence across sampled generations.

**Key Characteristics:**

- **White-box compatibility:** Token-probability scorers are identical to the corresponding :doc:`white-box scorers <../white_box/index>`.
- **Code-aware consistency:** Black-box scorers compare sampled code generations using code embeddings, CodeBLEU, or LLM-judged functional equivalence.
- **Score range:** :math:`[0, 1]`, where higher values indicate higher confidence.

**Trade-offs:**

- **Dependency requirements:** Some code-aware scorers require code-specific models or language tooling.
- **Higher cost:** Functional equivalence scorers require additional LLM calls.

Code-Generation Scoring Methods
-------------------------------

There are three main categories of code-generation scoring methods offered by UQLM:

.. toctree::
:maxdepth: 1

token_probability
code_similarity
functional_equivalence
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Token-Probability Code Scorers
==============================

.. currentmodule:: uqlm.scorers

Definition
----------

Token-probability code scorers are the same methods used by :class:`WhiteBoxUQ`, applied to generated code responses through :class:`CodeGenUQ`.

Available scorers:

- ``sequence_probability``
- ``min_probability``
- ``mean_token_negentropy``
- ``min_token_negentropy``
- ``probability_margin``
- ``monte_carlo_probability``
- ``p_true``

**Key Properties:**

- Requires token probabilities from the LLM/API
- Uses the same definitions as the corresponding white-box short-form scorers
- Score range: :math:`[0, 1]`

Parameters
----------

When using :class:`CodeGenUQ`, specify one or more token-probability scorer names in the ``scorers`` list.

Example
-------

.. code-block:: python

from uqlm import CodeGenUQ

code_uq = CodeGenUQ(
llm=llm,
scorers=["sequence_probability", "min_probability"],
language="python",
)

results = await code_uq.generate_and_score(prompts=prompts)

See Also
--------

- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification
- :class:`WhiteBoxUQ` - Class for white-box uncertainty quantification
2 changes: 1 addition & 1 deletion docs/source/scorer_definitions/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ For detailed API documentation and usage examples, see the :doc:`API Reference <
llm_judges/index
ensemble/index
long_text/index

code_generation/index
14 changes: 9 additions & 5 deletions examples/long_text_graph_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
"* Betweenness Centrality ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
"* PageRank ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
"* Degree Centrality ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
"* Harmonic Centrality\n",
"* Laplacian Centrality\n",
"* Harmonic Centrality ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n",
"* Laplacian Centrality ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n",
"\n",
"</div>\n",
"\n",
Expand Down Expand Up @@ -106,12 +106,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**. "
"In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"tags": []
},
Expand Down Expand Up @@ -204,7 +204,11 @@
"source": [
"# Load example dataset (FactScore)\n",
"factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n",
"factscore.head()"
"factscore.head()\n",
"\n",
"# # Alternative dataset (FactScore-STEM-Geo)\n",
"# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n",
"# factscore_stem_geo.head()\n"
]
},
{
Expand Down
21 changes: 11 additions & 10 deletions examples/long_text_qa_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
" </p>\n",
" \n",
"* Long-form Semantic Entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))\n",
"* Black-Box Generalizations of Long-form Semantic Entropy\n",
"* Black-Box Generalizations of Long-form Semantic Entropy ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n",
"\n",
"</div>\n",
"\n",
Expand Down Expand Up @@ -92,15 +92,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**. "
"In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -182,15 +180,18 @@
"4 Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ... "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
"output_type": "display_data"
}
],
"source": [
"# Load example dataset (FactScore)\n",
"factscore = load_example_dataset(\"factscore\", n=15)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n",
"factscore.head()"
"factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n",
"factscore.head()\n",
"\n",
"# # Alternative dataset (FactScore-STEM-Geo)\n",
"# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n",
"# factscore_stem_geo.head()\n"
]
},
{
Expand Down
Loading
Loading