Skip to content

asudjianto-xml/SDM-RF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment

https://github.com/asudjianto-xml/SDM-RF/blob/main/sdm_RF.pdf

Authors: Agus Sudjianto (H2O.ai & University of North Carolina Charlotte)
Date: September 2025

Abstract

This paper proposes a computational enhancement to Halperin's Semantic Divergence Metrics (SDM) framework for hallucination detection in Large Language Models. While SDM represents a significant theoretical advance in prompt-aware hallucination detection, its computational demands through hierarchical clustering (O(S²)) may limit practical deployment. We introduce Random Forest-based topic modeling as an alternative clustering approach, reducing complexity to O(S log S) while preserving the framework's theoretical integrity and diagnostic capabilities.

Key Contributions

1. Computational Enhancement

  • Clustering Complexity Reduction: From O(S² log S) to O(S log² S)
  • Memory Efficiency: Eliminates need for full distance matrices
  • Scalable Implementation: Enables real-time analysis of large conversational contexts

2. Theoretical Foundation

  • Information-Theoretic Validity: Proves Random Forest partitions preserve mutual information properties required by SDM
  • Consistency Guarantees: Establishes convergence properties under standard Random Forest assumptions
  • Semantic Coherence: Shows tree leaves in embedding spaces correspond to semantic topics

3. Enhanced Ensemble Metrics

  • Vectorized Operations: Batch processing of all sentences simultaneously
  • Amortized Computation: Tree traversal costs amortized across ensemble calculations
  • Sparse Matrix Utilization: Efficient topic distribution computation

Methodology Overview

Original SDM Framework

Halperin's SDM framework provides prompt-aware hallucination detection through:

  • Two-dimensional analysis: Semantic Instability (S_H) and Semantic Exploration (KL divergence)
  • Joint clustering: Shared topic space for prompts and answers
  • Semantic Box: Four-quadrant classification system

Random Forest Enhancement

Input: Prompt embeddings, Answer embeddings
1. Create joint binary classification dataset
2. Train Random Forest with controlled depth
3. Extract leaf assignments as topic clusters
4. Apply original SDM metrics (JS divergence, KL divergence)
5. Compute final S_H score

Key Algorithms

Algorithm 1: RF-Enhanced SDM

  • Joint dataset creation from prompt/answer embeddings
  • Random Forest training with entropy criterion
  • Topic assignment via leaf indices
  • Standard SDM metric computation

Algorithm 2: Enhanced Ensemble JS Computation

  • Single-pass leaf assignment for all sentences
  • Vectorized ensemble computation across paraphrase groups
  • Complexity: O(S × B + M × k) vs O(M × k × log k)

Theoretical Results

Lemma 1 (Per-node gain is conditional MI)

For any internal node t: Δ(t) = P(t)I(Y; Bt | t)

Theorem 1 (Tree-level MI decomposition)

Σ_t Δ(t) = I(Y; Leaf)

Theorem 3 (Validation-set consistency)

Random Forest estimator converges in L₁ under appropriate conditions

Implementation Guidelines

Hyperparameter Recommendations

  • Tree Depth: max_depth = max(5, ⌊log₂(S/10)⌋)
  • Minimum Leaf Size: min_samples_leaf = max(5, S/100)
  • Ensemble Size: n_estimators = 100-200

Integration Strategy

The enhancement serves as a drop-in replacement for the clustering component while maintaining full compatibility with the original SDM pipeline:

class SDMFramework:
    def __init__(self, clustering_method='hierarchical'):
        self.clustering_method = clustering_method
    
    def compute_sdm(self, prompt_embeddings, answer_embeddings):
        if self.clustering_method == 'random_forest':
            topics = self._rf_cluster(prompt_embeddings, answer_embeddings)
        else:
            topics = self._hierarchical_cluster(prompt_embeddings, answer_embeddings)
        
        return self._compute_metrics(topics)  # Same downstream processing

Advantages and Benefits

Computational

  • Scalability: Handles large document collections and multi-turn conversations
  • Real-time Deployment: Enables streaming analysis in production systems
  • Memory Efficiency: No distance matrix storage requirements

Theoretical

  • Preserves SDM Properties: Information-theoretic foundations remain intact
  • Maintains Diagnostic Power: Semantic Box classification continues to apply
  • Enhanced Interpretability: Tree-based decisions provide interpretable semantic boundaries

Practical

  • Modular Design: Easy integration with existing SDM implementations
  • Production Ready: Addresses computational bottlenecks for deployment
  • Broader Adoption: Makes SDM accessible for resource-constrained applications

Validation Protocol

  1. Comparative Analysis: Test both clustering approaches on original SDM benchmarks
  2. Correlation Assessment: Measure correlation between hierarchical and RF-based S_H scores
  3. Detection Performance: Evaluate hallucination detection accuracy
  4. Scaling Validation: Confirm performance benefits in realistic scenarios

Limitations and Considerations

Trade-offs

  • Discretization Effects: Tree-based partitioning may introduce additional discretization
  • Hyperparameter Sensitivity: Performance depends on appropriate depth/leaf size selection
  • Embedding Space Assumptions: Assumes axis-aligned splits correspond to semantic boundaries

Complementary Nature

This enhancement is designed to complement, not replace, the original SDM methodology. Hierarchical clustering remains valuable for:

  • Research applications with abundant computational resources
  • Scenarios requiring maximum clustering fidelity
  • Comparative analysis and method validation

Citation

@article{sudjianto2025scaling,
  title={Scaling Semantic Divergence Metrics: A Random Forest Enhancement for Practical Deployment},
  author={Sudjianto, Agus},
  journal={https://github.com/asudjianto-xml/SDM-RF/blob/main/sdm_RF.pdf},
  year={2025}
}

References

Acknowledgments

This work is dedicated to making Halperin's groundbreaking SDM framework accessible to the broader AI community. By removing computational barriers while preserving theoretical rigor, we hope to enable widespread adoption of SDM's sophisticated hallucination detection capabilities, ultimately improving LLM reliability across industries and applications.

The Random Forest enhancement serves as a bridge between academic innovation and practical deployment, ensuring that advanced research insights can benefit real-world AI systems at scale.

Contact

Agus Sudjianto
H2O.ai & University of North Carolina Charlotte
[email protected]

About

Semantic Divergence Metrics via Random Forest for LLM Hallucination Evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published