-
Notifications
You must be signed in to change notification settings - Fork 11
RItools' numerical arithmetic
This is a page for discussion/planning of numerical calculations and cross-checks.
During 2015, on the clusters branch, @markmfredrickson added a bunch more sparse matrix based calculations, in the process of building up cluster()
support. When there was a choice to be made between matrix multiplication and use of SparseM:slm()
, we seemed to get better memory performance out of the latter; on this basis, that's our preference.
Over time there's been some churn in the "combined baseline differences" calculation, ie the conversion of a series of univariate imbalance measures into a Mahalanobis distance (that gets compared to a chi-square reference distribution). In 2013 we added a fallback from base:svd
to propack:svd
; see discussion under issue #18. As of this writing (2016), @nullsatz has some promising work in the issue25 branch toward doing away with the svd altogether and instead relying on QR. This in turn may have some alternation among different QR routines before fully converging, so a robust testing strategy will be important.
For covariance matrices that approach singularity, what should our gold standard be? Should our threshold for zeroing out a principal component be smaller or larger than the threshold one finds inside of MASS:ginv
(which we copied over in the original version)? Should the covariates be standardized before this is decided? (If so, then how is this standardization adapted to categorical variables that may have rare categories?) If you have thoughts on any of this, do share.
Presently we have regression checks but no explicit checks against other modes of calculating balance-related statistics. (At the time JB and BH wrote the first versions of xBal, there were some cross checks against direct calculations of "sum statistics." But these don't appear to have made it into the the testing battery, at least not as currently configured.)
For stratified or unstratified comparisons w/o clustering, we should be able to compare adjusted mean diffs to lm
's fitted regression coefficients. In the stratified case, this will require use of harmonic stratum weights on the xBal side, and a rescaling of the outputs. (The comparison should be to the univariate inferential statistics, as described here.)
That seems to be the main thing to test. Here are some other things to consider testing -- strategies for doing this tbd, please discuss.
Extensions (of the basic harmonic-weighted calculations):
- stratum weights other than harmonic
- clusters
- ...
Embellishments (of the basic harmonic-weighted calcs):
- pooled sd's
- standard differences
- ...
At present (winter 2016), univariate descriptives are handled separately. Because this may change, and because the descriptives don't get used in the multivariate calc, testing these seems the lesser priority. But we should get around to this too, as we do hear from users who rely on this material. Contributions/suggestions are welcome here.
Relative to assumption that the univariate inferentials are in order, we still need to cross-check:
- our calculation of the univariates' covariance;
- the Mahalanobis combination of them.
For (1), we might try coming up with something based around the comparison of lm(Xes ~ z)
to lm(Xes ~ 1)
, where Xes
is a matrix of covariates and z
is the treatment variable. We'd need the score-test covariance, ie the covariance of the z
coefficients under the null model that z
makes no contribution. If we can get this to work, might also test stratified case by comparing lm(Xes ~ z + strata(foo))
to lm(Xes ~ strata(foo))
.
For (2), we can use stats:mahanalobis
. This uses base:solve
, which may or may not be the best approach for near-singular matrices, see discussion above under Computational Strategies. Pending resolution on that issue, we can test against stats:mahalanobis
with test data that's not particularly singular.