Description
I recently found this paper by Binkley et al.
A short extract from this paper follows:
- b – the number of burn-in iterations
- n – the number of samples (random variates)
- si – the sampling interval
If si is large enough, the observations are practically independent. However, too small a value risks unwanted correlation. To summarize the effect of b, n, and si: if any of these settings are too low, then the Gibbs sampler will produce inaccurate or inadequate information; if any of these settings are too high, then the only penalty is wasted computational effort.
Unfortunately, as described in Section 6, support for extracting
interval-separated observations is limited in existing LDA tools. For example,
For example, Mallet provides this capability but appears to suffer from a local maxima problem
with a footnote linking to http://www.cs.loyola.edu/~binkley/topic_models/additional-images/mallet-fixation/
Does this problem still exist?
Reference:
Binkley, D., Heinz, D., Lawrie, D., & Overfelt, J. (2014). Understanding LDA in source code analysis. 22nd International Conference on Program Comprehension, ICPC 2014 - Proceedings, 26–36. https://doi.org/10.1145/2597008.2597150