Enable Token probability based Semantic Entropy#76
Conversation
There was a problem hiding this comment.
I would suggest the following changes:
- black box only uses discrete (so we are consistent across all black box scorers that no token probabilities are used)
- for
SemanticEntropyscorer class, let's compute only discrete
if logprobs are not available and compute both if they are. - Let's enalbe simulataneous computation of discrete and token-based entrpy in NLI class.
| if self.use_nli: | ||
| compute_entropy = "semantic_negentropy" in self.scorers | ||
| nli_scores = self.nli_scorer.evaluate(responses=self.responses, sampled_responses=self.sampled_responses, use_best=self.use_best, compute_entropy=compute_entropy) | ||
| responses_logprobs = self.logprobs if hasattr(self.llm, "logprobs") else None |
There was a problem hiding this comment.
My preference is for black box we avoid using token probabilities altogether. Let's just stick with discrete entropy here.
| best_responses[i], semantic_entropy[i], scores = tmp | ||
|
|
||
| candidate_logprobs = [self.logprobs[i]] + self.multiple_logprobs[i] if (self.logprobs and self.multiple_logprobs) else None | ||
| tmp = self.nli_scorer._semantic_entropy_process(candidates=candidates, i=i, logprobs_results=candidate_logprobs) |
There was a problem hiding this comment.
Perhaps we enable computation of both simultaneously? Let me know what you think. It's barely any extra time/effort to compute both after NLI clustering is done
| clustered_responses, cluster_indices, nli_scores = self._cluster_responses(responses=candidates, response_probabilities=response_probabilities) | ||
| # Compute discrete semantic entropy | ||
| cluster_probabilities = self._compute_cluster_probabilities(response_probabilities=response_probabilities, cluster_indices=cluster_indices) | ||
| best_response = clustered_responses[cluster_probabilities.index(max(cluster_probabilities))][0] |
There was a problem hiding this comment.
Let's have this be the default calculation for best response
| tokenprob_semantic_entropy = None | ||
| if tokenprob_response_probabilities: | ||
| tokenprob_cluster_probabilities = self._compute_cluster_probabilities(response_probabilities=tokenprob_response_probabilities, cluster_indices=cluster_indices) | ||
| best_response = clustered_responses[tokenprob_cluster_probabilities.index(max(tokenprob_cluster_probabilities))][0] |
There was a problem hiding this comment.
let's create a parameter that determines how best response is selected
There was a problem hiding this comment.
This can be used if users deviate from default selection approach
| best_response, semantic_negentropy, scores = tmp | ||
| all_logprobs = [self.logprobs[i]] + self.multiple_logprobs[i] if (self.logprobs and self.multiple_logprobs) else None | ||
| tmp = self._semantic_entropy_process(candidates=all_responses, i=i, logprobs_results=all_logprobs) | ||
| best_response, semantic_negentropy, scores, tokenprob_semantic_entropy = tmp |
There was a problem hiding this comment.
let's rename semantic_entropy -> discrete entropy . This will be better naming convention and more consistent. Please note this line will need to be updated as well:
https://github.com/cvs-health/uqlm/blob/main/uqlm/scorers/black_box.py#L164
| def _compute_response_probabilities(self, logprobs_results: List[List[Dict[str, Any]]], num_responses: int = None) -> List[float]: | ||
| """Compute response probabilities""" | ||
| uniform_response_probabilities = [1 / num_responses] * num_responses | ||
| tokenprob_response_probabilities = [self.avg_logprob(logprobs_i) if logprobs_i else np.nan for logprobs_i in logprobs_results] if logprobs_results else None |
There was a problem hiding this comment.
lets update token prob based response probabilities as discussed
| Helper function to compute semantic entropy score from cluster probabilities | ||
| """ | ||
| return abs(sum([p * math.log(p) for p in cluster_probabilities])) | ||
| return abs(sum([p * math.log(p) if p > 0.0 else 0 for p in cluster_probabilities])) |
There was a problem hiding this comment.
Is it possible that a cluster has a non-positive probability? I don't think that should be possible
| @staticmethod | ||
| def avg_logprob(logprobs: List[Dict[str, Any]]) -> float: | ||
| "Compute average logprob" | ||
| return np.mean([np.exp(d["logprob"]) for d in logprobs]) |
There was a problem hiding this comment.
let's update this as discussed
| if hasattr(self.llm, "logprobs"): | ||
| print("UQLM: Using logprobs to compute response probabilities for semantic entropy score") | ||
| self.llm.logprobs = True | ||
|
|
There was a problem hiding this comment.
How about we instead check if logprobs is not available and warn that only Discrete Semantic Entropy will be used. Maybe something like this:
if not hasattr(self.llm, "logprobs"):
warnings.warn("The provided LLM does not support logprobs access. Only discrete semantic entropy will be computed.")
else:
self.llm.logprobs = True| best_response_selection : Callable, default=None | ||
| Specifies the function to select the best response from the clustered responses. | ||
| If None, the default function will be used. |
There was a problem hiding this comment.
Is this parameter actually needed for this class?
There was a problem hiding this comment.
Removed this callable attribute from entropy.py and nli.py. Now, using this variable name for entropy_type variable
| entropy_type : str, default="discrete" | ||
| Specifies the type of entropy confidence score to compute best response. Must be one of "discrete" or "token-level". | ||
|
|
There was a problem hiding this comment.
Since we are returning both entropy types, we should rename this parameter as best_response_selection. Also, can we replace "token-level" with "token-based"
* update NLIScorer to handle logprobs * update SemanticEntropy class to input logprobs to NLIScorer methods * changes related to edge cases and minor refactoring * updated unit tests * ruff formatting * remove changes from BlackBoxUQ * nliscorer class returns both SE scores * update and rerun demo notebook * ruff format nli.py file * updates based on reviewer's comment * update unit tests * updated example notebook * ruff formating * ruff format nli module * renamed entropy_type variable and additional changes * ruff formatting
This PR updates
NLIScorerclass to enable computation of Semantic Entropy Score using token probabilities (Issue #24 ). TheBlackBoxUQandSemanticEntropyclasses are update accordingly to reflect the relevant changes.discreteis deprecated and now user can directly providelogprobs, if logprobs are not provided, uqlm implement discrete approachTo see the implementation of different scenarios, refer to these notebooks on a different branch, which include print statements at various statements
Semantic Entropy
Black Box