Enable Token probability based Semantic Entropy by mohitcek · Pull Request #76 · cvs-health/uqlm

mohitcek · 2025-06-27T14:02:54Z

This PR updates NLIScorer class to enable computation of Semantic Entropy Score using token probabilities (Issue #24 ). The BlackBoxUQ and SemanticEntropy classes are update accordingly to reflect the relevant changes.

Input attribute discrete is deprecated and now user can directly provide logprobs, if logprobs are not provided, uqlm implement discrete approach
Unit tests are updated to validate computation using logprobs (without changing/running data generation files and ensured 100% code coverage)

To see the implementation of different scenarios, refer to these notebooks on a different branch, which include print statements at various statements
Semantic Entropy
Black Box

dylanbouchard

I would suggest the following changes:

black box only uses discrete (so we are consistent across all black box scorers that no token probabilities are used)
for SemanticEntropy scorer class, let's compute only discrete
if logprobs are not available and compute both if they are.
Let's enalbe simulataneous computation of discrete and token-based entrpy in NLI class.

dylanbouchard · 2025-06-28T16:36:56Z

        if self.use_nli:
            compute_entropy = "semantic_negentropy" in self.scorers
-            nli_scores = self.nli_scorer.evaluate(responses=self.responses, sampled_responses=self.sampled_responses, use_best=self.use_best, compute_entropy=compute_entropy)
+            responses_logprobs = self.logprobs if hasattr(self.llm, "logprobs") else None


My preference is for black box we avoid using token probabilities altogether. Let's just stick with discrete entropy here.

dylanbouchard · 2025-06-28T16:37:55Z

-            best_responses[i], semantic_entropy[i], scores = tmp
+
+            candidate_logprobs = [self.logprobs[i]] + self.multiple_logprobs[i] if (self.logprobs and self.multiple_logprobs) else None
+            tmp = self.nli_scorer._semantic_entropy_process(candidates=candidates, i=i, logprobs_results=candidate_logprobs)


Perhaps we enable computation of both simultaneously? Let me know what you think. It's barely any extra time/effort to compute both after NLI clustering is done

…icEntropy

dylanbouchard · 2025-07-07T20:06:40Z

+        clustered_responses, cluster_indices, nli_scores = self._cluster_responses(responses=candidates, response_probabilities=response_probabilities)
+        # Compute discrete semantic entropy
+        cluster_probabilities = self._compute_cluster_probabilities(response_probabilities=response_probabilities, cluster_indices=cluster_indices)
+        best_response = clustered_responses[cluster_probabilities.index(max(cluster_probabilities))][0]


Let's have this be the default calculation for best response

dylanbouchard · 2025-07-07T20:06:56Z

+        tokenprob_semantic_entropy = None
+        if tokenprob_response_probabilities:
+            tokenprob_cluster_probabilities = self._compute_cluster_probabilities(response_probabilities=tokenprob_response_probabilities, cluster_indices=cluster_indices)
+            best_response = clustered_responses[tokenprob_cluster_probabilities.index(max(tokenprob_cluster_probabilities))][0]


let's create a parameter that determines how best response is selected

This can be used if users deviate from default selection approach

dylanbouchard · 2025-07-07T20:53:29Z

-            best_response, semantic_negentropy, scores = tmp
+            all_logprobs = [self.logprobs[i]] + self.multiple_logprobs[i] if (self.logprobs and self.multiple_logprobs) else None
+            tmp = self._semantic_entropy_process(candidates=all_responses, i=i, logprobs_results=all_logprobs)
+            best_response, semantic_negentropy, scores, tokenprob_semantic_entropy = tmp


let's rename semantic_entropy -> discrete entropy . This will be better naming convention and more consistent. Please note this line will need to be updated as well:
https://github.com/cvs-health/uqlm/blob/main/uqlm/scorers/black_box.py#L164

dylanbouchard · 2025-07-07T20:58:38Z

+    def _compute_response_probabilities(self, logprobs_results: List[List[Dict[str, Any]]], num_responses: int = None) -> List[float]:
+        """Compute response probabilities"""
+        uniform_response_probabilities = [1 / num_responses] * num_responses
+        tokenprob_response_probabilities = [self.avg_logprob(logprobs_i) if logprobs_i else np.nan for logprobs_i in logprobs_results] if logprobs_results else None


lets update token prob based response probabilities as discussed

dylanbouchard · 2025-07-07T20:59:56Z

        Helper function to compute semantic entropy score from cluster probabilities
        """
-        return abs(sum([p * math.log(p) for p in cluster_probabilities]))
+        return abs(sum([p * math.log(p) if p > 0.0 else 0 for p in cluster_probabilities]))


Is it possible that a cluster has a non-positive probability? I don't think that should be possible

@mohitcek what do you think about this?

dylanbouchard · 2025-07-07T21:00:22Z

+    @staticmethod
+    def avg_logprob(logprobs: List[Dict[str, Any]]) -> float:
+        "Compute average logprob"
+        return np.mean([np.exp(d["logprob"]) for d in logprobs])


let's update this as discussed

dylanbouchard · 2025-07-14T16:56:32Z

+        if hasattr(self.llm, "logprobs"):
+            print("UQLM: Using logprobs to compute response probabilities for semantic entropy score")
+            self.llm.logprobs = True
+


How about we instead check if logprobs is not available and warn that only Discrete Semantic Entropy will be used. Maybe something like this:

if not hasattr(self.llm, "logprobs"): warnings.warn("The provided LLM does not support logprobs access. Only discrete semantic entropy will be computed.") else: self.llm.logprobs = True

dylanbouchard · 2025-07-14T17:00:00Z

+        best_response_selection : Callable, default=None
+            Specifies the function to select the best response from the clustered responses.
+            If None, the default function will be used.


Is this parameter actually needed for this class?

Removed this callable attribute from entropy.py and nli.py. Now, using this variable name for entropy_type variable

dylanbouchard · 2025-07-14T17:00:58Z

+        entropy_type : str, default="discrete"
+            Specifies the type of entropy confidence score to compute best response. Must be one of "discrete" or "token-level".
+


Since we are returning both entropy types, we should rename this parameter as best_response_selection. Also, can we replace "token-level" with "token-based"

* update NLIScorer to handle logprobs * update SemanticEntropy class to input logprobs to NLIScorer methods * changes related to edge cases and minor refactoring * updated unit tests * ruff formatting * remove changes from BlackBoxUQ * nliscorer class returns both SE scores * update and rerun demo notebook * ruff format nli.py file * updates based on reviewer's comment * update unit tests * updated example notebook * ruff formating * ruff format nli module * renamed entropy_type variable and additional changes * ruff formatting

mohitcek added 5 commits June 26, 2025 09:54

update NLIScorer to handle logprobs

7b97a0f

update SemanticEntropy class to input logprobs to NLIScorer methods

9450011

changes related to edge cases and minor refactoring

faffe98

updated unit tests

f8ca19c

ruff formatting

377b587

dylanbouchard requested changes Jun 28, 2025

View reviewed changes

remove changes from BlackBoxUQ

3fceae3

dylanbouchard linked an issue Jun 30, 2025 that may be closed by this pull request

Enable token-probability-based semantic entropy #24

Closed

mohitcek added 4 commits July 2, 2025 11:46

nliscorer class returns both SE scores

1f8d81f

update and rerun demo notebook

d2c5029

ruff format nli.py file

4a43347

Merge remote-tracking branch 'upstream/develop' into tokenProb/Semant…

09a2499

…icEntropy

dylanbouchard requested changes Jul 7, 2025

View reviewed changes

mohitcek added 5 commits July 10, 2025 12:42

updates based on reviewer's comment

12f897d

update unit tests

59e940f

updated example notebook

5be1154

ruff formating

8378596

ruff format nli module

e87630d

mohitcek requested a review from dylanbouchard July 10, 2025 17:10

dylanbouchard requested changes Jul 14, 2025

View reviewed changes

dylanbouchard reviewed Jul 14, 2025

View reviewed changes

mohitcek added 2 commits July 14, 2025 13:45

renamed entropy_type variable and additional changes

8cf698b

ruff formatting

a6c088b

mohitcek requested a review from dylanbouchard July 14, 2025 17:59

dylanbouchard approved these changes Jul 14, 2025

View reviewed changes

dylanbouchard merged commit 71ec419 into cvs-health:develop Jul 14, 2025
16 checks passed

		entropy_type : str, default="discrete"
		Specifies the type of entropy confidence score to compute best response. Must be one of "discrete" or "token-level".

Uh oh!

Conversation

mohitcek commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylanbouchard left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mohitcek commented Jun 27, 2025 •

edited

Loading

dylanbouchard left a comment •

edited

Loading