Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment

https://github.com/asudjianto-xml/SDM-RF/blob/main/sdm_RF.pdf

Authors: Agus Sudjianto (H2O.ai & University of North Carolina Charlotte)
Date: September 2025

Abstract

This paper proposes a computational enhancement to Halperin's Semantic Divergence Metrics (SDM) framework for hallucination detection in Large Language Models. While SDM represents a significant theoretical advance in prompt-aware hallucination detection, its computational demands through hierarchical clustering (O(S²)) may limit practical deployment. We introduce Random Forest-based topic modeling as an alternative clustering approach, reducing complexity to O(S log S) while preserving the framework's theoretical integrity and diagnostic capabilities.

Key Contributions

1. Computational Enhancement

Clustering Complexity Reduction: From O(S² log S) to O(S log² S)
Memory Efficiency: Eliminates need for full distance matrices
Scalable Implementation: Enables real-time analysis of large conversational contexts

2. Theoretical Foundation

Information-Theoretic Validity: Proves Random Forest partitions preserve mutual information properties required by SDM
Consistency Guarantees: Establishes convergence properties under standard Random Forest assumptions
Semantic Coherence: Shows tree leaves in embedding spaces correspond to semantic topics

3. Enhanced Ensemble Metrics

Vectorized Operations: Batch processing of all sentences simultaneously
Amortized Computation: Tree traversal costs amortized across ensemble calculations
Sparse Matrix Utilization: Efficient topic distribution computation

Methodology Overview

Original SDM Framework

Halperin's SDM framework provides prompt-aware hallucination detection through:

Two-dimensional analysis: Semantic Instability (S_H) and Semantic Exploration (KL divergence)
Joint clustering: Shared topic space for prompts and answers
Semantic Box: Four-quadrant classification system

Random Forest Enhancement

Input: Prompt embeddings, Answer embeddings
1. Create joint binary classification dataset
2. Train Random Forest with controlled depth
3. Extract leaf assignments as topic clusters
4. Apply original SDM metrics (JS divergence, KL divergence)
5. Compute final S_H score

Key Algorithms

Algorithm 1: RF-Enhanced SDM

Joint dataset creation from prompt/answer embeddings
Random Forest training with entropy criterion
Topic assignment via leaf indices
Standard SDM metric computation

Algorithm 2: Enhanced Ensemble JS Computation

Single-pass leaf assignment for all sentences
Vectorized ensemble computation across paraphrase groups
Complexity: O(S × B + M × k) vs O(M × k × log k)

Theoretical Results

Lemma 1 (Per-node gain is conditional MI)

For any internal node t: Δ(t) = P(t)I(Y; Bt | t)

Theorem 1 (Tree-level MI decomposition)

Σ_t Δ(t) = I(Y; Leaf)

Theorem 3 (Validation-set consistency)

Random Forest estimator converges in L₁ under appropriate conditions

Implementation Guidelines

Hyperparameter Recommendations

Tree Depth: max_depth = max(5, ⌊log₂(S/10)⌋)
Minimum Leaf Size: min_samples_leaf = max(5, S/100)
Ensemble Size: n_estimators = 100-200

Integration Strategy

The enhancement serves as a drop-in replacement for the clustering component while maintaining full compatibility with the original SDM pipeline:

class SDMFramework:
    def __init__(self, clustering_method='hierarchical'):
        self.clustering_method = clustering_method
    
    def compute_sdm(self, prompt_embeddings, answer_embeddings):
        if self.clustering_method == 'random_forest':
            topics = self._rf_cluster(prompt_embeddings, answer_embeddings)
        else:
            topics = self._hierarchical_cluster(prompt_embeddings, answer_embeddings)
        
        return self._compute_metrics(topics)  # Same downstream processing

Advantages and Benefits

Computational

Scalability: Handles large document collections and multi-turn conversations
Real-time Deployment: Enables streaming analysis in production systems
Memory Efficiency: No distance matrix storage requirements

Theoretical

Preserves SDM Properties: Information-theoretic foundations remain intact
Maintains Diagnostic Power: Semantic Box classification continues to apply
Enhanced Interpretability: Tree-based decisions provide interpretable semantic boundaries

Practical

Modular Design: Easy integration with existing SDM implementations
Production Ready: Addresses computational bottlenecks for deployment
Broader Adoption: Makes SDM accessible for resource-constrained applications

Validation Protocol

Comparative Analysis: Test both clustering approaches on original SDM benchmarks
Correlation Assessment: Measure correlation between hierarchical and RF-based S_H scores
Detection Performance: Evaluate hallucination detection accuracy
Scaling Validation: Confirm performance benefits in realistic scenarios

Limitations and Considerations

Trade-offs

Discretization Effects: Tree-based partitioning may introduce additional discretization
Hyperparameter Sensitivity: Performance depends on appropriate depth/leaf size selection
Embedding Space Assumptions: Assumes axis-aligned splits correspond to semantic boundaries

Complementary Nature

This enhancement is designed to complement, not replace, the original SDM methodology. Hierarchical clustering remains valuable for:

Research applications with abundant computational resources
Scenarios requiring maximum clustering fidelity
Comparative analysis and method validation

Citation

@article{sudjianto2025scaling,
  title={Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment},
  author={Sudjianto, Agus},
  journal={https://github.com/asudjianto-xml/SDM-RF/blob/main/sdm_RF.pdf},
  year={2025}
}

References

Original SDM Framework: Halperin, I. (2025). Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models. SSRN.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1):5-32.
Farquhar, S., et al. (2024). Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630:625-630.

Acknowledgments

This work is dedicated to making Halperin's groundbreaking SDM framework accessible to the broader AI community. By removing computational barriers while preserving theoretical rigor, we hope to enable widespread adoption of SDM's sophisticated hallucination detection capabilities, ultimately improving LLM reliability across industries and applications.

The Random Forest enhancement serves as a bridge between academic innovation and practical deployment, ensuring that advanced research insights can benefit real-world AI systems at scale.

Contact

Agus Sudjianto
H2O.ai & University of North Carolina Charlotte
[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
readme.md		readme.md
sdm_RF.pdf		sdm_RF.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment

Abstract

Key Contributions

1. Computational Enhancement

2. Theoretical Foundation

3. Enhanced Ensemble Metrics

Methodology Overview

Original SDM Framework

Random Forest Enhancement

Key Algorithms

Theoretical Results

Lemma 1 (Per-node gain is conditional MI)

Theorem 1 (Tree-level MI decomposition)

Theorem 3 (Validation-set consistency)

Implementation Guidelines

Hyperparameter Recommendations

Integration Strategy

Advantages and Benefits

Computational

Theoretical

Practical

Validation Protocol

Limitations and Considerations

Trade-offs

Complementary Nature

Citation

References

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

asudjianto-xml/SDM-RF

Folders and files

Latest commit

History

Repository files navigation

Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment

Abstract

Key Contributions

1. Computational Enhancement

2. Theoretical Foundation

3. Enhanced Ensemble Metrics

Methodology Overview

Original SDM Framework

Random Forest Enhancement

Key Algorithms

Theoretical Results

Lemma 1 (Per-node gain is conditional MI)

Theorem 1 (Tree-level MI decomposition)

Theorem 3 (Validation-set consistency)

Implementation Guidelines

Hyperparameter Recommendations

Integration Strategy

Advantages and Benefits

Computational

Theoretical

Practical

Validation Protocol

Limitations and Considerations

Trade-offs

Complementary Nature

Citation

References

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages