Skip to content

Commit 7acd188

Browse files
Merge pull request #321 from cvs-health/patch/v0.5.2
Patch release: `v0.5.2`
2 parents 7bd62f1 + 2c015e2 commit 7acd188

24 files changed

Lines changed: 1236 additions & 1651 deletions

docs/source/_notebooks/examples/long_text_graph_demo.ipynb

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,11 @@
252252
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=\"claim\" and claim_decomposition_llm is None, the provided `llm` will be used for claim decomposition.</td>\n",
253253
" </tr>\n",
254254
" <tr>\n",
255+
" <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">nli_llm</td>\n",
256+
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">BaseChatModel<br><code>default=None</code></td>\n",
257+
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A LangChain chat model for LLM-based NLI inference. If provided, takes precedence over nli_model_name. Only used for mode=\"unit_response\"</td>\n",
258+
" </tr>\n",
259+
" <tr>\n",
255260
" <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">device</td>\n",
256261
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">str or torch.device<br><code>default=None</code></td>\n",
257262
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.</td>\n",
@@ -297,6 +302,8 @@
297302
" <li><code>llm</code></li>\n",
298303
" <li><code>system_prompt</code></li>\n",
299304
" <li><code>sampling_temperature</code></li>\n",
305+
" <li><code>claim_decomposition_llm</code></li>\n",
306+
" <li><code>nli_llm</code></li>\n",
300307
" </ul>\n",
301308
" </div>\n",
302309
" <div style=\"flex: 1; padding: 10px; background-color: rgba(0, 200, 0, 0.1); border-radius: 5px; border: 1px solid rgba(0, 200, 0, 0.2);\">\n",
@@ -1356,7 +1363,7 @@
13561363
"\n",
13571364
" - **PageRank** - $ \\frac{1-d}{|V|} + d \\sum_{v \\in N(s)} \\frac{C_{PR}(v)}{N(v)}$ is the stationary distribution probability of a random walk with restart probability $(1-d)$, where $N(s)$ denotes the set of neighboring nodes of $s$ and $C_{PR}(v)$ is PageRank of node $v$.\n",
13581365
"\n",
1359-
"where $\\mathbf{y}^{(s)}_{\\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}$ are $m$ candidate responses."
1366+
"where $\\mathbf{y}^{(s)}_{\\text{cand}} = \\{y_1^{(s)}, ..., y_m^{(s)}\\}$ are $m$ candidate responses."
13601367
]
13611368
},
13621369
{

docs/source/_notebooks/examples/long_text_qa_demo.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1313,7 +1313,7 @@
13131313
"\n",
13141314
"$$c_g(s; y_0^{(s)}, \\mathbf{y}^{(s)}_{\\text{cand}}) = \\frac{1}{m} \\sum_{j=1}^m \\eta(y_0^{(s)}, y_j^{(s)})$$\n",
13151315
"\n",
1316-
"where $y_0^{(s)}$ is the original unit response, $\\mathbf{y}^{(s)}_{\\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}$ are $m$ candidate responses to the unit's question, and $\\eta$ is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency."
1316+
"where $y_0^{(s)}$ is the original unit response, $\\mathbf{y}^{(s)}_{\\text{cand}} = \\{y_1^{(s)}, ..., y_m^{(s)}\\}$ are $m$ candidate responses to the unit's question, and $\\eta$ is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency."
13171317
]
13181318
},
13191319
{

docs/source/_notebooks/examples/long_text_uq_demo.ipynb

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,11 @@
291291
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=\"claim\" and claim_decomposition_llm is None, the provided `llm` will be used for claim decomposition.</td>\n",
292292
" </tr>\n",
293293
" <tr>\n",
294+
" <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">nli_llm</td>\n",
295+
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">BaseChatModel<br><code>default=None</code></td>\n",
296+
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A LangChain chat model for LLM-based NLI inference. If provided, takes precedence over nli_model_name. Only used for mode=\"unit_response\"</td>\n",
297+
" </tr>\n",
298+
" <tr>\n",
294299
" <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">device</td>\n",
295300
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">str or torch.device<br><code>default=None</code></td>\n",
296301
" <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.</td>\n",
@@ -336,6 +341,8 @@
336341
" <li><code>llm</code></li>\n",
337342
" <li><code>system_prompt</code></li>\n",
338343
" <li><code>sampling_temperature</code></li>\n",
344+
" <li><code>claim_decomposition_llm</code></li>\n",
345+
" <li><code>nli_llm</code></li>\n",
339346
" </ul>\n",
340347
" </div>\n",
341348
" <div style=\"flex: 1; padding: 10px; background-color: rgba(0, 200, 0, 0.1); border-radius: 5px; border: 1px solid rgba(0, 200, 0, 0.2);\">\n",

docs/source/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ API
1010
uqlm.black_box
1111
uqlm.white_box
1212
uqlm.judges
13+
uqlm.longform
1314
uqlm.nli
1415
uqlm.calibration
1516
uqlm.resources

docs/source/getstarted.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@ These scorers take a fine-grained approach and score confidence/uncertainty at t
155155
Below is a sample of code illustrating how to use the LongTextUQ class to conduct claim-level hallucination detection and uncertainty-aware response refinement.
156156

157157
.. code-block:: python
158+
158159
from langchain_openai import ChatOpenAI
159160
llm = ChatOpenAI(model="gpt-4o")
160161

docs/source/scorer_definitions/long_text/graph.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,20 @@ Definition
99
Graph-based scorers, proposed by Jiang et al. (2024), decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. These scorers operate only at the claim level, as sentences typically contain multiple claims, meaning their union is not well-defined. Formally, we denote a bipartite graph :math:`G` with node set :math:`V = \mathbf{s} \cup \mathbf{y}`, where :math:`\mathbf{y}` is a set of :math:`m` responses generated from the same prompt and :math:`\mathbf{s}` is the union of all unique claims across those decomposed responses. In particular, an edge exists between a claim-response pair :math:`(s, y) \in \mathbf{s} \times \mathbf{y}` if and only if claim :math:`s` is entailed in response :math:`y`. We define the following graph metrics for claim :math:`s`:
1010

1111

12-
* **Degree Centrality** - :math:`\frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)$ is the average edge weight, measured by entailment probability for claim node $s$.
12+
* **Degree Centrality** - :math:`\frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)` is the average edge weight, measured by entailment probability for claim node `s`.
1313

14-
* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]$, $p = \frac{(|\mathbf{s}| - 1)}{m}$, and $t = (|\mathbf{s}| - 1) \mod m`.
14+
* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]`, `p = \frac{(|\mathbf{s}| - 1)}{m}`, and `t = (|\mathbf{s}| - 1) \mod m`.
1515

1616

1717
* **Closeness Centrality** - :math:`\frac{m + 2(|\mathbf{s}| - 1) }{\sum_{v \neq s}dist(s, v)}` measures the inverse sum of distances to all other nodes, normalized by the minimum possible distance.
1818

1919
* **Harmonic Centrality** - :math:`\frac{1}{H_{\text{max}}}\sum_{v \neq s}\frac{1}{dist(s, v)}` is the sum of inverse of distances to all other nodes, normalized by the maximum possible value, where :math:`H_{\text{max}}=m + \frac{ |\mathbf{s}| - 1}{2}`.
2020

21-
* **Laplacian Centrality** - :math:`\frac{E_L (G)-E_L (G_{\text{-} s})}{E_L (G)}` is the proportional drop in Laplacian energy :math:`E_L (G)` resulting from dropping node $s$ from the graph, where :math:`G_{\text{-}s}` denotes the graph :math:`G` with node $s$ removed, :math:`E_L (G) = \sum_{i} \lambda_i^2`, and :math:`\lambda_i` are the eigenvalues of :math:`G`'s Laplacian matrix.
21+
* **Laplacian Centrality** - :math:`\frac{E_L (G)-E_L (G_{\text{-} s})}{E_L (G)}` is the proportional drop in Laplacian energy :math:`E_L (G)` resulting from dropping node :math:`s` from the graph, where :math:`G_{\text{-}s}` denotes the graph :math:`G` with node :math:`s` removed, :math:`E_L (G) = \sum_{i} \lambda_i^2`, and :math:`\lambda_i` are the eigenvalues of :math:`G`'s Laplacian matrix.
2222

2323
* **PageRank** - :math:`\frac{1-d}{|V|} + d \sum_{v \in N(s)} \frac{C_{PR}(v)}{N(v)}` is the stationary distribution probability of a random walk with restart probability :math:`(1-d)`, where :math:`N(s)` denotes the set of neighboring nodes of :math:`s` and :math:`C_{PR}(v)` is PageRank of node :math:`v`.
2424

25-
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are $m$ candidate responses.
25+
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are :math:`m` candidate responses.
2626

2727
**Key Properties:**
2828

docs/source/scorer_definitions/long_text/index.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@ Long-Text Scorers
33

44
Long-form uncertainty quantification implements a three-stage pipeline after response generation:
55

6-
1. Response Decomposition: The response :math:`y` is decomposed into units (claims or sentences), where a unit as denoted as $s$.
6+
1. Response Decomposition: The response :math:`y` is decomposed into units (claims or sentences), where a unit as denoted as :math:`s`.
77

8-
2. Unit-Level Confidence Scoring: Confidence scores are computed using a unit-level scoring function with values in :math:`[0, 1]`. Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold $\tau$ are flagged as potential hallucinations.
8+
2. Unit-Level Confidence Scoring: Confidence scores are computed using a unit-level scoring function with values in :math:`[0, 1]`. Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold :math:`\tau` are flagged as potential hallucinations.
99

1010
3. Response-Level Aggregation: Unit scores are combined to provide an overall response confidence.
1111

docs/source/scorer_definitions/long_text/luq.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ The Long-text UQ (LUQ) approach demonstrated here is adapted from Zhang et al. (
1010

1111
.. math::
1212
13-
c_g(s; \mathbf{y}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s
13+
c_g(s; \mathbf{y}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)
1414
15-
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that $s$ is entailed in :math:`y_j`.
15+
where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that :math:`s` is entailed in :math:`y_j`.
1616

1717
**Key Properties:**
1818

docs/source/scorer_definitions/long_text/qa.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024).
1010

1111
.. math::
1212
13-
c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)}), s
13+
c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)})
1414
15-
where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses to the unit's question, and :math:`\eta` is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency.
15+
where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are :math:`m` candidate responses to the unit's question, and :math:`\eta` is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency.
1616

1717
**Key Properties:**
1818

0 commit comments

Comments
 (0)