cvs-health · dylanbouchard · Jun 8, 2026 · May 31, 2026 · May 31, 2026 · May 31, 2026
diff --git a/README.md b/README.md
@@ -116,7 +116,7 @@ Above, `use_best=True` implements mitigation so that the uncertainty-minimized r
 *   Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))
 *   BERTScore ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675))
 *   Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/abs/2412.05563))
-* Functional Entropy for Code Generation ([Bouchard et al., 2026](https://arxiv.org/abs/2605.28500))
+*   Code-adapted scorers via [`CodeGenUQ`](#code-generation-uq).
 
 ### White-Box Scorers (Token-Probability-Based)
 
@@ -322,6 +322,29 @@ Above `response` and `entailment` reflect the original response and response-lev
 *   Graph-based scorers ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))
 *   Generalized long-form semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))
 
+
+### Code Generation UQ
+
+For code-generation tasks, UQLM provides `CodeGenUQ`, a specialized interface for predicting whether LLM-generated code is functionally correct without requiring execution. `CodeGenUQ` includes white-box scorers, code-adapted black-box scorers based on functional equivalence, and reflexive self-evaluation scorers. The white-box methods are the same token-probability-based scorers available through `WhiteBoxUQ`. 
+
+**Example Usage:**
+
+```python
+from langchain_openai import ChatOpenAI
+llm = ChatOpenAI(model="gpt-4o-mini")
+
+from uqlm import CodeGenUQ
+cguq = CodeGenUQ(
+    llm=llm,
+    scorers=["functional_equivalence_rate"]
+)
+
+results = await cguq.generate_and_score(prompts=prompts, num_responses=5)
+results.to_df()
+```
+
+For a more detailed demo, refer to our [`CodeGenUQ` Demo](./examples/codegen_demo.ipynb). More details on code generation scorers are available in [Bouchard et al., 2026](https://arxiv.org/abs/2605.28500).
+
 ## Documentation
 Check out our [documentation site](https://cvs-health.github.io/uqlm/latest/index.html) for detailed instructions on using this package, including API reference and more.
 

diff --git a/docs/source/scorer_definitions/code_generation/code_similarity.rst b/docs/source/scorer_definitions/code_generation/code_similarity.rst
@@ -0,0 +1,52 @@
+Code Similarity Scorers
+=======================
+
+.. currentmodule:: uqlm.scorers
+
+Definition
+----------
+
+Code similarity scorers generate sampled code responses from the same prompt and compare each sampled response with the original response. Higher average similarity indicates higher confidence.
+
+``cosine_sim`` embeds the original and sampled code responses with a code embedding model, then computes normalized average cosine similarity:
+
+.. math::
+
+    NCS(y; \tilde{\mathbf{y}}) = \frac{1}{2} + \frac{1}{2m} \sum_{j=1}^{m} \frac{V(y) \cdot V(\tilde{y}_j)}{\|V(y)\| \cdot \|V(\tilde{y}_j)\|}
+
+``code_bleu`` computes average CodeBLEU similarity between the original code response and sampled responses:
+
+.. math::
+
+    CBC(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \text{CodeBLEU}(y, \tilde{y}_j)
+
+**Key Properties:**
+
+- Code-adapted black-box consistency scoring
+- Uses structural or embedding-based similarity rather than natural-language entailment
+- Score range: :math:`[0, 1]`
+
+Parameters
+----------
+
+When using :class:`CodeGenUQ`, specify ``"cosine_sim"`` or ``"code_bleu"`` in the ``scorers`` list. You can also set ``sentence_transformer`` for ``cosine_sim`` and ``language`` for ``code_bleu``.
+
+Example
+-------
+
+.. code-block:: python
+
+    from uqlm import CodeGenUQ
+
+    code_uq = CodeGenUQ(
+        llm=llm,
+        scorers=["cosine_sim", "code_bleu"],
+        language="python",
+    )
+
+    results = await code_uq.generate_and_score(prompts=prompts, num_responses=5)
+
+See Also
+--------
+
+- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification
diff --git a/docs/source/scorer_definitions/code_generation/functional_equivalence.rst b/docs/source/scorer_definitions/code_generation/functional_equivalence.rst
@@ -0,0 +1,76 @@
+Functional Equivalence Scorers
+==============================
+
+.. currentmodule:: uqlm.scorers
+
+Definition
+----------
+
+Functional equivalence scorers use an LLM to judge whether two code snippets are functionally equivalent, meaning they would produce the same outputs for valid inputs. These scorers were proposed by Bouchard et al. (2026).
+
+``functional_equivalence_rate`` estimates the proportion of sampled responses that are functionally equivalent to the original response:
+
+.. math::
+
+    FER(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \mathbb{I}[y \equiv \tilde{y}_j]
+
+``functional_negentropy`` clusters the original and sampled responses by functional equivalence, computes entropy over the cluster distribution, and normalizes it to a confidence score. Let :math:`\mathcal{C}` denote the set of functional equivalence clusters, and let :math:`P(C)` denote the proportion of responses in cluster :math:`C`. Functional entropy is:
+
+.. math::
+
+    FE(y; \tilde{\mathbf{y}}) = -\sum_{C \in \mathcal{C}} P(C) \log P(C)
+
+The normalized confidence score is:
+
+.. math::
+
+    NFN(y; \tilde{\mathbf{y}}) = 1 - \frac{FE(y; \tilde{\mathbf{y}})}{\log(m + 1)}
+
+``functional_sets_confidence`` counts the number of functional equivalence clusters and normalizes it to :math:`[0, 1]`:
+
+.. math::
+
+    FSC(y; \tilde{\mathbf{y}}) = \frac{m + 1 - |\mathcal{C}|}{m}
+
+where :math:`|\mathcal{C}|` is the number of functional equivalence clusters among the original response and :math:`m` sampled responses.
+
+**Key Properties:**
+
+- Directly targets functional agreement rather than textual similarity
+- Requires an LLM for equivalence judgments
+- Score range: :math:`[0, 1]`
+
+Parameters
+----------
+
+When using :class:`CodeGenUQ`, specify ``"functional_equivalence_rate"``, ``"functional_negentropy"``, or ``"functional_sets_confidence"`` in the ``scorers`` list. You can set ``equivalence_llm`` to use a separate model for equivalence judgments.
+
+Example
+-------
+
+.. code-block:: python
+
+    from uqlm import CodeGenUQ
+
+    code_uq = CodeGenUQ(
+        llm=llm,
+        equivalence_llm=equivalence_llm,
+        scorers=[
+            "functional_equivalence_rate",
+            "functional_negentropy",
+            "functional_sets_confidence",
+        ],
+        language="python",
+    )
+
+    results = await code_uq.generate_and_score(prompts=prompts, num_responses=5)
+
+References
+----------
+
+- Bouchard, D., et al. (2026). `Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification <https://arxiv.org/pdf/2605.28500>`_. *arXiv*.
+
+See Also
+--------
+
+- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification
diff --git a/docs/source/scorer_definitions/code_generation/index.rst b/docs/source/scorer_definitions/code_generation/index.rst
@@ -0,0 +1,29 @@
+Code-Generation Scorers
+=======================
+
+.. currentmodule:: uqlm.scorers
+
+Code-generation uncertainty quantification uses :class:`CodeGenUQ` to score generated code. These scorers either reuse existing short-form UQ methods or adapt black-box consistency scoring to code by comparing structural similarity or functional equivalence across sampled generations.
+
+**Key Characteristics:**
+
+- **White-box compatibility:** Token-probability scorers are identical to the corresponding :doc:`white-box scorers <../white_box/index>`.
+- **Code-aware consistency:** Black-box scorers compare sampled code generations using code embeddings, CodeBLEU, or LLM-judged functional equivalence.
+- **Score range:** :math:`[0, 1]`, where higher values indicate higher confidence.
+
+**Trade-offs:**
+
+- **Dependency requirements:** Some code-aware scorers require code-specific models or language tooling.
+- **Higher cost:** Functional equivalence scorers require additional LLM calls.
+
+Code-Generation Scoring Methods
+-------------------------------
+
+There are three main categories of code-generation scoring methods offered by UQLM:
+
+.. toctree::
+   :maxdepth: 1
+
+   token_probability
+   code_similarity
+   functional_equivalence
diff --git a/docs/source/scorer_definitions/code_generation/token_probability.rst b/docs/source/scorer_definitions/code_generation/token_probability.rst
@@ -0,0 +1,51 @@
+Token-Probability Code Scorers
+==============================
+
+.. currentmodule:: uqlm.scorers
+
+Definition
+----------
+
+Token-probability code scorers are the same methods used by :class:`WhiteBoxUQ`, applied to generated code responses through :class:`CodeGenUQ`.
+
+Available scorers:
+
+- ``sequence_probability``
+- ``min_probability``
+- ``mean_token_negentropy``
+- ``min_token_negentropy``
+- ``probability_margin``
+- ``monte_carlo_probability``
+- ``p_true``
+
+**Key Properties:**
+
+- Requires token probabilities from the LLM/API
+- Uses the same definitions as the corresponding white-box short-form scorers
+- Score range: :math:`[0, 1]`
+
+Parameters
+----------
+
+When using :class:`CodeGenUQ`, specify one or more token-probability scorer names in the ``scorers`` list.
+
+Example
+-------
+
+.. code-block:: python
+
+    from uqlm import CodeGenUQ
+
+    code_uq = CodeGenUQ(
+        llm=llm,
+        scorers=["sequence_probability", "min_probability"],
+        language="python",
+    )
+
+    results = await code_uq.generate_and_score(prompts=prompts)
+
+See Also
+--------
+
+- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification
+- :class:`WhiteBoxUQ` - Class for white-box uncertainty quantification
diff --git a/docs/source/scorer_definitions/index.rst b/docs/source/scorer_definitions/index.rst
@@ -15,4 +15,4 @@ For detailed API documentation and usage examples, see the :doc:`API Reference <
    llm_judges/index
    ensemble/index
    long_text/index
-
+   code_generation/index
diff --git a/examples/long_text_graph_demo.ipynb b/examples/long_text_graph_demo.ipynb
@@ -16,8 +16,8 @@
     "*   Betweenness Centrality ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
     "*   PageRank ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n",
     "*   Degree Centrality ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n",
-    "*   Harmonic Centrality\n",
-    "*   Laplacian Centrality\n",
+    "*   Harmonic Centrality ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n",
+    "*   Laplacian Centrality ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n",
     "\n",
     "</div>\n",
     "\n",
@@ -106,12 +106,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**.  "
+    "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**.  "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {
     "tags": []
    },
@@ -204,7 +204,11 @@
    "source": [
     "# Load example dataset (FactScore)\n",
     "factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n",
-    "factscore.head()"
+    "factscore.head()\n",
+    "\n",
+    "# # Alternative dataset (FactScore-STEM-Geo)\n",
+    "# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n",
+    "# factscore_stem_geo.head()\n"
    ]
   },
   {

diff --git a/examples/long_text_qa_demo.ipynb b/examples/long_text_qa_demo.ipynb
@@ -12,7 +12,7 @@
     "  </p>\n",
     "      \n",
     "*   Long-form Semantic Entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))\n",
-    "*   Black-Box Generalizations of Long-form Semantic Entropy\n",
+    "*   Black-Box Generalizations of Long-form Semantic Entropy ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n",
     "\n",
     "</div>\n",
     "\n",
@@ -92,15 +92,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**.  "
+    "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**.  "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {
-    "tags": []
-   },
+   "execution_count": null,
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -182,15 +180,18 @@
        "4  Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ...  "
       ]
      },
-     "execution_count": 2,
      "metadata": {},
-     "output_type": "execute_result"
+     "output_type": "display_data"
     }
    ],
    "source": [
     "# Load example dataset (FactScore)\n",
-    "factscore = load_example_dataset(\"factscore\", n=15)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n",
-    "factscore.head()"
+    "factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n",
+    "factscore.head()\n",
+    "\n",
+    "# # Alternative dataset (FactScore-STEM-Geo)\n",
+    "# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n",
+    "# factscore_stem_geo.head()\n"
    ]
   },
   {