https://github.com/asudjianto-xml/SDM-RF/blob/main/sdm_RF.pdf
Authors: Agus Sudjianto (H2O.ai & University of North Carolina Charlotte)
Date: September 2025
This paper proposes a computational enhancement to Halperin's Semantic Divergence Metrics (SDM) framework for hallucination detection in Large Language Models. While SDM represents a significant theoretical advance in prompt-aware hallucination detection, its computational demands through hierarchical clustering (O(S²)) may limit practical deployment. We introduce Random Forest-based topic modeling as an alternative clustering approach, reducing complexity to O(S log S) while preserving the framework's theoretical integrity and diagnostic capabilities.
- Clustering Complexity Reduction: From O(S² log S) to O(S log² S)
- Memory Efficiency: Eliminates need for full distance matrices
- Scalable Implementation: Enables real-time analysis of large conversational contexts
- Information-Theoretic Validity: Proves Random Forest partitions preserve mutual information properties required by SDM
- Consistency Guarantees: Establishes convergence properties under standard Random Forest assumptions
- Semantic Coherence: Shows tree leaves in embedding spaces correspond to semantic topics
- Vectorized Operations: Batch processing of all sentences simultaneously
- Amortized Computation: Tree traversal costs amortized across ensemble calculations
- Sparse Matrix Utilization: Efficient topic distribution computation
Halperin's SDM framework provides prompt-aware hallucination detection through:
- Two-dimensional analysis: Semantic Instability (S_H) and Semantic Exploration (KL divergence)
- Joint clustering: Shared topic space for prompts and answers
- Semantic Box: Four-quadrant classification system
Input: Prompt embeddings, Answer embeddings
1. Create joint binary classification dataset
2. Train Random Forest with controlled depth
3. Extract leaf assignments as topic clusters
4. Apply original SDM metrics (JS divergence, KL divergence)
5. Compute final S_H score
Algorithm 1: RF-Enhanced SDM
- Joint dataset creation from prompt/answer embeddings
- Random Forest training with entropy criterion
- Topic assignment via leaf indices
- Standard SDM metric computation
Algorithm 2: Enhanced Ensemble JS Computation
- Single-pass leaf assignment for all sentences
- Vectorized ensemble computation across paraphrase groups
- Complexity: O(S × B + M × k) vs O(M × k × log k)
For any internal node t: Δ(t) = P(t)I(Y; Bt | t)
Σ_t Δ(t) = I(Y; Leaf)
Random Forest estimator converges in L₁ under appropriate conditions
- Tree Depth: max_depth = max(5, ⌊log₂(S/10)⌋)
- Minimum Leaf Size: min_samples_leaf = max(5, S/100)
- Ensemble Size: n_estimators = 100-200
The enhancement serves as a drop-in replacement for the clustering component while maintaining full compatibility with the original SDM pipeline:
class SDMFramework:
def __init__(self, clustering_method='hierarchical'):
self.clustering_method = clustering_method
def compute_sdm(self, prompt_embeddings, answer_embeddings):
if self.clustering_method == 'random_forest':
topics = self._rf_cluster(prompt_embeddings, answer_embeddings)
else:
topics = self._hierarchical_cluster(prompt_embeddings, answer_embeddings)
return self._compute_metrics(topics) # Same downstream processing- Scalability: Handles large document collections and multi-turn conversations
- Real-time Deployment: Enables streaming analysis in production systems
- Memory Efficiency: No distance matrix storage requirements
- Preserves SDM Properties: Information-theoretic foundations remain intact
- Maintains Diagnostic Power: Semantic Box classification continues to apply
- Enhanced Interpretability: Tree-based decisions provide interpretable semantic boundaries
- Modular Design: Easy integration with existing SDM implementations
- Production Ready: Addresses computational bottlenecks for deployment
- Broader Adoption: Makes SDM accessible for resource-constrained applications
- Comparative Analysis: Test both clustering approaches on original SDM benchmarks
- Correlation Assessment: Measure correlation between hierarchical and RF-based S_H scores
- Detection Performance: Evaluate hallucination detection accuracy
- Scaling Validation: Confirm performance benefits in realistic scenarios
- Discretization Effects: Tree-based partitioning may introduce additional discretization
- Hyperparameter Sensitivity: Performance depends on appropriate depth/leaf size selection
- Embedding Space Assumptions: Assumes axis-aligned splits correspond to semantic boundaries
This enhancement is designed to complement, not replace, the original SDM methodology. Hierarchical clustering remains valuable for:
- Research applications with abundant computational resources
- Scenarios requiring maximum clustering fidelity
- Comparative analysis and method validation
@article{sudjianto2025scaling,
title={Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment},
author={Sudjianto, Agus},
journal={https://github.com/asudjianto-xml/SDM-RF/blob/main/sdm_RF.pdf},
year={2025}
}- Original SDM Framework: Halperin, I. (2025). Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models. SSRN.
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1):5-32.
- Farquhar, S., et al. (2024). Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630:625-630.
This work is dedicated to making Halperin's groundbreaking SDM framework accessible to the broader AI community. By removing computational barriers while preserving theoretical rigor, we hope to enable widespread adoption of SDM's sophisticated hallucination detection capabilities, ultimately improving LLM reliability across industries and applications.
The Random Forest enhancement serves as a bridge between academic innovation and practical deployment, ensuring that advanced research insights can benefit real-world AI systems at scale.
Agus Sudjianto
H2O.ai & University of North Carolina Charlotte
[email protected]