Skip to content

Enable Token probability based Semantic Entropy#76

Merged
dylanbouchard merged 17 commits into
cvs-health:developfrom
mohitcek:tokenProb/SemanticEntropy
Jul 14, 2025
Merged

Enable Token probability based Semantic Entropy#76
dylanbouchard merged 17 commits into
cvs-health:developfrom
mohitcek:tokenProb/SemanticEntropy

Conversation

@mohitcek

@mohitcek mohitcek commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

This PR updates NLIScorer class to enable computation of Semantic Entropy Score using token probabilities (Issue #24 ). The BlackBoxUQ and SemanticEntropy classes are update accordingly to reflect the relevant changes.

  • Input attribute discrete is deprecated and now user can directly provide logprobs, if logprobs are not provided, uqlm implement discrete approach
  • Unit tests are updated to validate computation using logprobs (without changing/running data generation files and ensured 100% code coverage)

To see the implementation of different scenarios, refer to these notebooks on a different branch, which include print statements at various statements
Semantic Entropy
Black Box

Screenshot 2025-06-27 at 9 47 28 AM

@dylanbouchard dylanbouchard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest the following changes:

  1. black box only uses discrete (so we are consistent across all black box scorers that no token probabilities are used)
  2. for SemanticEntropy scorer class, let's compute only discrete
    if logprobs are not available and compute both if they are.
  3. Let's enalbe simulataneous computation of discrete and token-based entrpy in NLI class.

Comment thread uqlm/scorers/black_box.py Outdated
if self.use_nli:
compute_entropy = "semantic_negentropy" in self.scorers
nli_scores = self.nli_scorer.evaluate(responses=self.responses, sampled_responses=self.sampled_responses, use_best=self.use_best, compute_entropy=compute_entropy)
responses_logprobs = self.logprobs if hasattr(self.llm, "logprobs") else None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is for black box we avoid using token probabilities altogether. Let's just stick with discrete entropy here.

Comment thread uqlm/scorers/entropy.py Outdated
best_responses[i], semantic_entropy[i], scores = tmp

candidate_logprobs = [self.logprobs[i]] + self.multiple_logprobs[i] if (self.logprobs and self.multiple_logprobs) else None
tmp = self.nli_scorer._semantic_entropy_process(candidates=candidates, i=i, logprobs_results=candidate_logprobs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we enable computation of both simultaneously? Let me know what you think. It's barely any extra time/effort to compute both after NLI clustering is done

@dylanbouchard dylanbouchard linked an issue Jun 30, 2025 that may be closed by this pull request
Comment thread uqlm/black_box/nli.py Outdated
clustered_responses, cluster_indices, nli_scores = self._cluster_responses(responses=candidates, response_probabilities=response_probabilities)
# Compute discrete semantic entropy
cluster_probabilities = self._compute_cluster_probabilities(response_probabilities=response_probabilities, cluster_indices=cluster_indices)
best_response = clustered_responses[cluster_probabilities.index(max(cluster_probabilities))][0]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this be the default calculation for best response

Comment thread uqlm/black_box/nli.py Outdated
tokenprob_semantic_entropy = None
if tokenprob_response_probabilities:
tokenprob_cluster_probabilities = self._compute_cluster_probabilities(response_probabilities=tokenprob_response_probabilities, cluster_indices=cluster_indices)
best_response = clustered_responses[tokenprob_cluster_probabilities.index(max(tokenprob_cluster_probabilities))][0]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a parameter that determines how best response is selected

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used if users deviate from default selection approach

Comment thread uqlm/black_box/nli.py Outdated
best_response, semantic_negentropy, scores = tmp
all_logprobs = [self.logprobs[i]] + self.multiple_logprobs[i] if (self.logprobs and self.multiple_logprobs) else None
tmp = self._semantic_entropy_process(candidates=all_responses, i=i, logprobs_results=all_logprobs)
best_response, semantic_negentropy, scores, tokenprob_semantic_entropy = tmp

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename semantic_entropy -> discrete entropy . This will be better naming convention and more consistent. Please note this line will need to be updated as well:
https://github.com/cvs-health/uqlm/blob/main/uqlm/scorers/black_box.py#L164

Comment thread uqlm/black_box/nli.py
def _compute_response_probabilities(self, logprobs_results: List[List[Dict[str, Any]]], num_responses: int = None) -> List[float]:
"""Compute response probabilities"""
uniform_response_probabilities = [1 / num_responses] * num_responses
tokenprob_response_probabilities = [self.avg_logprob(logprobs_i) if logprobs_i else np.nan for logprobs_i in logprobs_results] if logprobs_results else None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets update token prob based response probabilities as discussed

Comment thread uqlm/black_box/nli.py
Helper function to compute semantic entropy score from cluster probabilities
"""
return abs(sum([p * math.log(p) for p in cluster_probabilities]))
return abs(sum([p * math.log(p) if p > 0.0 else 0 for p in cluster_probabilities]))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that a cluster has a non-positive probability? I don't think that should be possible

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohitcek what do you think about this?

Comment thread uqlm/black_box/nli.py Outdated
@staticmethod
def avg_logprob(logprobs: List[Dict[str, Any]]) -> float:
"Compute average logprob"
return np.mean([np.exp(d["logprob"]) for d in logprobs])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update this as discussed

@mohitcek mohitcek requested a review from dylanbouchard July 10, 2025 17:10
Comment thread uqlm/scorers/entropy.py
Comment on lines +124 to +127
if hasattr(self.llm, "logprobs"):
print("UQLM: Using logprobs to compute response probabilities for semantic entropy score")
self.llm.logprobs = True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we instead check if logprobs is not available and warn that only Discrete Semantic Entropy will be used. Maybe something like this:

        if not hasattr(self.llm, "logprobs"):
            warnings.warn("The provided LLM does not support logprobs access. Only discrete semantic entropy will be computed.")
        else:    
            self.llm.logprobs = True

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread uqlm/scorers/entropy.py
Comment thread uqlm/scorers/entropy.py Outdated
Comment on lines +63 to +65
best_response_selection : Callable, default=None
Specifies the function to select the best response from the clustered responses.
If None, the default function will be used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this parameter actually needed for this class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this callable attribute from entropy.py and nli.py. Now, using this variable name for entropy_type variable

Comment thread uqlm/scorers/entropy.py Outdated
Comment on lines +60 to +62
entropy_type : str, default="discrete"
Specifies the type of entropy confidence score to compute best response. Must be one of "discrete" or "token-level".

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are returning both entropy types, we should rename this parameter as best_response_selection. Also, can we replace "token-level" with "token-based"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mohitcek mohitcek requested a review from dylanbouchard July 14, 2025 17:59
@dylanbouchard dylanbouchard merged commit 71ec419 into cvs-health:develop Jul 14, 2025
16 checks passed
nklswld pushed a commit to nklswld/uqlm_white_box_scorer that referenced this pull request Nov 7, 2025
* update NLIScorer to handle logprobs

* update SemanticEntropy class to input logprobs to NLIScorer methods

* changes related to edge cases and minor refactoring

* updated unit tests

* ruff formatting

* remove changes from BlackBoxUQ

* nliscorer class returns both SE scores

* update and rerun demo notebook

* ruff format nli.py file

* updates based on reviewer's comment

* update unit tests

* updated example notebook

* ruff formating

* ruff format nli module

* renamed entropy_type variable and additional changes

* ruff formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable token-probability-based semantic entropy

2 participants