cvs-health
diff --git a/‎docs/source/_notebooks/examples/long_text_graph_demo.ipynb‎
Lines changed: 8 additions & 1 deletion b/‎docs/source/_notebooks/examples/long_text_graph_demo.ipynb‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎docs/source/_notebooks/examples/long_text_qa_demo.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/_notebooks/examples/long_text_qa_demo.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/_notebooks/examples/long_text_uq_demo.ipynb‎
Lines changed: 7 additions & 0 deletions b/‎docs/source/_notebooks/examples/long_text_uq_demo.ipynb‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/source/api.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/api.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/getstarted.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/getstarted.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/scorer_definitions/long_text/graph.rst‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/scorer_definitions/long_text/graph.rst‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/scorer_definitions/long_text/index.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/scorer_definitions/long_text/index.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/scorer_definitions/long_text/luq.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/scorer_definitions/long_text/luq.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/scorer_definitions/long_text/qa.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/scorer_definitions/long_text/qa.rst‎
Lines changed: 2 additions & 2 deletions
@@ -252,6 +252,11 @@
     "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=\"claim\" and claim_decomposition_llm is None, the provided `llm` will be used for claim decomposition.</td>\n",
     "  </tr>\n",
     "  <tr>\n",
+    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">nli_llm</td>\n",
+    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">BaseChatModel<br><code>default=None</code></td>\n",
+    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A LangChain chat model for LLM-based NLI inference. If provided, takes precedence over nli_model_name. Only used for mode=\"unit_response\"</td>\n",
+    "  </tr>\n",
+    "  <tr>\n",
     "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">device</td>\n",
     "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">str or torch.device<br><code>default=None</code></td>\n",
     "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.</td>\n",
@@ -297,6 +302,8 @@
     "      <li><code>llm</code></li>\n",
     "      <li><code>system_prompt</code></li>\n",
     "      <li><code>sampling_temperature</code></li>\n",
+    "      <li><code>claim_decomposition_llm</code></li>\n",
+    "      <li><code>nli_llm</code></li>\n",
     "    </ul>\n",
     "  </div>\n",
     "  <div style=\"flex: 1; padding: 10px; background-color: rgba(0, 200, 0, 0.1); border-radius: 5px; border: 1px solid rgba(0, 200, 0, 0.2);\">\n",
@@ -1356,7 +1363,7 @@
     "\n",
     " - **PageRank** - $ \\frac{1-d}{|V|} + d \\sum_{v \\in N(s)} \\frac{C_{PR}(v)}{N(v)}$ is the stationary distribution probability of a random walk with restart probability $(1-d)$, where $N(s)$ denotes the set of neighboring nodes of $s$ and $C_{PR}(v)$ is PageRank of node $v$.\n",
     "\n",
-    "where $\\mathbf{y}^{(s)}_{\\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}$ are $m$ candidate responses."
+    "where $\\mathbf{y}^{(s)}_{\\text{cand}} = \\{y_1^{(s)}, ..., y_m^{(s)}\\}$ are $m$ candidate responses."
    ]
   },
   {
 
@@ -1313,7 +1313,7 @@
     "\n",
     "$$c_g(s; y_0^{(s)}, \\mathbf{y}^{(s)}_{\\text{cand}}) = \\frac{1}{m} \\sum_{j=1}^m \\eta(y_0^{(s)}, y_j^{(s)})$$\n",
     "\n",
-    "where $y_0^{(s)}$ is the original unit response, $\\mathbf{y}^{(s)}_{\\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}$ are $m$ candidate responses to the unit's question, and $\\eta$ is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency."
+    "where $y_0^{(s)}$ is the original unit response, $\\mathbf{y}^{(s)}_{\\text{cand}} = \\{y_1^{(s)}, ..., y_m^{(s)}\\}$ are $m$ candidate responses to the unit's question, and $\\eta$ is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency."
    ]
   },
   {
 
@@ -291,6 +291,11 @@
     "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A langchain llm `BaseChatModel` to be used for decomposing responses into individual claims. Also used for claim refinement. If granularity=\"claim\" and claim_decomposition_llm is None, the provided `llm` will be used for claim decomposition.</td>\n",
     "  </tr>\n",
     "  <tr>\n",
+    "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">nli_llm</td>\n",
+    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">BaseChatModel<br><code>default=None</code></td>\n",
+    "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">A LangChain chat model for LLM-based NLI inference. If provided, takes precedence over nli_model_name. Only used for mode=\"unit_response\"</td>\n",
+    "  </tr>\n",
+    "  <tr>\n",
     "    <td style=\"font-weight: bold; padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">device</td>\n",
     "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">str or torch.device<br><code>default=None</code></td>\n",
     "    <td style=\"padding: 8px; border: 1px solid rgba(127, 127, 127, 0.2);\">Specifies the device that NLI model use for prediction. If None, detects and returns the best available PyTorch device. Prioritizes CUDA (NVIDIA GPU), then MPS (macOS), then CPU.</td>\n",
@@ -336,6 +341,8 @@
     "      <li><code>llm</code></li>\n",
     "      <li><code>system_prompt</code></li>\n",
     "      <li><code>sampling_temperature</code></li>\n",
+    "      <li><code>claim_decomposition_llm</code></li>\n",
+    "      <li><code>nli_llm</code></li>\n",
     "    </ul>\n",
     "  </div>\n",
     "  <div style=\"flex: 1; padding: 10px; background-color: rgba(0, 200, 0, 0.1); border-radius: 5px; border: 1px solid rgba(0, 200, 0, 0.2);\">\n",
 
@@ -10,6 +10,7 @@ API
     uqlm.black_box
     uqlm.white_box
     uqlm.judges
+    uqlm.longform
     uqlm.nli
     uqlm.calibration
     uqlm.resources
 
@@ -155,6 +155,7 @@ These scorers take a fine-grained approach and score confidence/uncertainty at t
 Below is a sample of code illustrating how to use the LongTextUQ class to conduct claim-level hallucination detection and uncertainty-aware response refinement.
 
 .. code-block:: python
+
     from langchain_openai import ChatOpenAI
     llm = ChatOpenAI(model="gpt-4o")
 
 
@@ -9,20 +9,20 @@ Definition
 Graph-based scorers, proposed by Jiang et al. (2024), decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. These scorers operate only at the claim level, as sentences typically contain multiple claims, meaning their union is not well-defined. Formally, we denote a bipartite graph :math:`G` with node set :math:`V = \mathbf{s} \cup  \mathbf{y}`, where :math:`\mathbf{y}` is a set of :math:`m` responses generated from the same prompt and :math:`\mathbf{s}` is the union of all unique claims across those decomposed responses. In particular, an edge exists between a claim-response pair :math:`(s, y) \in  \mathbf{s} \times \mathbf{y}` if and only if claim :math:`s` is entailed in response :math:`y`. We define the following graph metrics for claim :math:`s`:
 
 
-* **Degree Centrality** - :math:`\frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)$ is the average edge weight, measured by entailment probability for claim node $s$. 
+* **Degree Centrality** - :math:`\frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)` is the average edge weight, measured by entailment probability for claim node `s`. 
 
-* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]$, $p = \frac{(|\mathbf{s}| - 1)}{m}$, and $t = (|\mathbf{s}| - 1) \mod m`.
+* **Betweenness Centrality** - :math:`\frac{1}{B_{\text{max}}}\sum_{u \neq v \neq s} \frac{\sigma_{uv}(s)}{\sigma_{uv}}` measures uncertainty by calculating the proportion of shortest paths between node pairs that pass through node :math:`s`, where :math:`\sigma_{uv}` represents all shortest paths between nodes :math:`u` and :math:`v`, and :math:`B_{\text{max}}` is the maximum possible value, given by :math:`B_{\text{max}}=\frac{1}{2} [m^2 (p + 1)^2 + m (p + 1)(2t - p - 1) - t (2p - t + 3)]`, `p = \frac{(|\mathbf{s}| - 1)}{m}`, and `t = (|\mathbf{s}| - 1) \mod m`.
 
 
 * **Closeness Centrality** - :math:`\frac{m + 2(|\mathbf{s}| - 1) }{\sum_{v \neq s}dist(s, v)}` measures the inverse sum of distances to all other nodes, normalized by the minimum possible distance.
 
 * **Harmonic Centrality** - :math:`\frac{1}{H_{\text{max}}}\sum_{v \neq s}\frac{1}{dist(s, v)}` is the sum of inverse of distances to all other nodes, normalized by the maximum possible value, where :math:`H_{\text{max}}=m + \frac{ |\mathbf{s}| - 1}{2}`.
 
-* **Laplacian Centrality** - :math:`\frac{E_L (G)-E_L (G_{\text{-} s})}{E_L (G)}` is the proportional drop in Laplacian energy :math:`E_L (G)` resulting from dropping node $s$ from the graph, where :math:`G_{\text{-}s}` denotes the graph :math:`G` with node $s$ removed, :math:`E_L (G)  = \sum_{i} \lambda_i^2`, and :math:`\lambda_i` are the eigenvalues of :math:`G`'s Laplacian matrix.
+* **Laplacian Centrality** - :math:`\frac{E_L (G)-E_L (G_{\text{-} s})}{E_L (G)}` is the proportional drop in Laplacian energy :math:`E_L (G)` resulting from dropping node :math:`s` from the graph, where :math:`G_{\text{-}s}` denotes the graph :math:`G` with node :math:`s` removed, :math:`E_L (G)  = \sum_{i} \lambda_i^2`, and :math:`\lambda_i` are the eigenvalues of :math:`G`'s Laplacian matrix.
 
 * **PageRank** - :math:`\frac{1-d}{|V|} + d \sum_{v \in N(s)} \frac{C_{PR}(v)}{N(v)}` is the stationary distribution probability of a random walk with restart probability :math:`(1-d)`, where :math:`N(s)` denotes the set of neighboring nodes of :math:`s` and :math:`C_{PR}(v)` is PageRank of node :math:`v`.
 
-where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are $m$ candidate responses.
+where :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are :math:`m` candidate responses.
 
 **Key Properties:**
 
 
@@ -3,9 +3,9 @@ Long-Text Scorers
 
 Long-form uncertainty quantification implements a three-stage pipeline after response generation:
 
-1. Response Decomposition: The response :math:`y` is decomposed into units (claims or sentences), where a unit as denoted as $s$.
+1. Response Decomposition: The response :math:`y` is decomposed into units (claims or sentences), where a unit as denoted as :math:`s`.
 
-2. Unit-Level Confidence Scoring: Confidence scores are computed using a unit-level scoring function with values in :math:`[0, 1]`. Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold $\tau$ are flagged as potential hallucinations.
+2. Unit-Level Confidence Scoring: Confidence scores are computed using a unit-level scoring function with values in :math:`[0, 1]`. Higher scores indicate greater likelihood of factual correctness. Units with scores below threshold :math:`\tau` are flagged as potential hallucinations.
 
 3. Response-Level Aggregation: Unit scores are combined to provide an overall response confidence.
 
 
@@ -10,9 +10,9 @@ The Long-text UQ (LUQ) approach demonstrated here is adapted from Zhang et al. (
 
 .. math::
 
-    c_g(s; \mathbf{y}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s
+    c_g(s; \mathbf{y}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m P(\text{entail}|y_j, s)
 
-where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that $s$ is entailed in :math:`y_j`.
+where :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses, and :math:`P(\text{entail}|y_j, s)` denotes the NLI-estimated probability that :math:`s` is entailed in :math:`y_j`.
 
 **Key Properties:**
 
 
@@ -10,9 +10,9 @@ The Claim-QA approach demonstrated here is adapted from Farquhar et al. (2024).
 
 .. math::
 
-    c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)}), s
+    c_g(s; y_0^{(s)}, \mathbf{y}^{(s)}_{\text{cand}}) = \frac{1}{m} \sum_{j=1}^m \eta(y_0^{(s)}, y_j^{(s)})
 
-where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}` are :math:`m` candidate responses to the unit's question, and :math:`\eta` is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency.
+where :math:`y_0^{(s)}` is the original unit response, :math:`\mathbf{y}^{(s)}_{\text{cand}} = \{y_1^{(s)}, ..., y_m^{(s)}\}` are :math:`m` candidate responses to the unit's question, and :math:`\eta` is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency.
 
 **Key Properties:**
Original file line number	Diff line number	Diff line change
`@@ -1313,7 +1313,7 @@`
`1313`	`1313`	`"\n",`
`1314`	`1314`	`"$$c_g(s; y_0^{(s)}, \\mathbf{y}^{(s)}_{\\text{cand}}) = \\frac{1}{m} \\sum_{j=1}^m \\eta(y_0^{(s)}, y_j^{(s)})$$\n",`
`1315`	`1315`	`"\n",`
`1316`		`- "where $y_0^{(s)}$ is the original unit response, $\\mathbf{y}^{(s)}_{\\text{cand}} = {y_1^{(s)}, ..., y_m^{(s)}}$ are $m$ candidate responses to the unit's question, and $\\eta$ is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency."`
	`1316`	`+ "where $y_0^{(s)}$ is the original unit response, $\\mathbf{y}^{(s)}_{\\text{cand}} = \\{y_1^{(s)}, ..., y_m^{(s)}\\}$ are $m$ candidate responses to the unit's question, and $\\eta$ is a consistency function such as contradiction probability, cosine similarity, or BERTScore F1. Semantic entropy, which follows a slightly different functional form, can also be used to measure consistency."`
`1317`	`1317`	`]`
`1318`	`1318`	`},`
`1319`	`1319`	`{`