diff --git a/Gemfile.lock b/Gemfile.lock index 296f6ad..a6de14a 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -25,6 +25,7 @@ GEM ffi (1.17.2-arm-linux-gnu) ffi (1.17.2-arm-linux-musl) ffi (1.17.2-arm64-darwin) + ffi (1.17.2-x64-mingw-ucrt) ffi (1.17.2-x86-linux-gnu) ffi (1.17.2-x86-linux-musl) ffi (1.17.2-x86_64-darwin) @@ -40,6 +41,9 @@ GEM google-protobuf (4.30.2-arm64-darwin) bigdecimal rake (>= 13) + google-protobuf (4.30.2-x64-mingw-ucrt) + bigdecimal + rake (>= 13) google-protobuf (4.30.2-x86-linux) bigdecimal rake (>= 13) @@ -169,6 +173,8 @@ GEM google-protobuf (~> 4.30) sass-embedded (1.88.0-riscv64-linux-musl) google-protobuf (~> 4.30) + sass-embedded (1.88.0-x64-mingw-ucrt) + google-protobuf (~> 4.30) sass-embedded (1.88.0-x86_64-darwin) google-protobuf (~> 4.30) sass-embedded (1.88.0-x86_64-linux-android) @@ -202,6 +208,7 @@ PLATFORMS riscv64-linux-gnu riscv64-linux-musl ruby + x64-mingw-ucrt x86-linux x86-linux-gnu x86-linux-musl diff --git a/_posts/2025-08-07-smarter_tree_splits.md b/_posts/2025-08-07-smarter_tree_splits.md new file mode 100644 index 0000000..96a5cc0 --- /dev/null +++ b/_posts/2025-08-07-smarter_tree_splits.md @@ -0,0 +1,143 @@ +--- +title: "Smarter Tree Splits: Understanding Friedman MSE in Regression Trees" +categories: +- machine-learning +- tree-algorithms + +tags: +- decision-trees +- regression +- MSE +- gradient-boosting +- scikit-learn +- xgboost +- lightgbm + +author_profile: false +seo_title: "Friedman MSE vs Classic MSE in Regression Trees" +seo_description: "Explore the differences between Classic MSE and Friedman MSE in regression trees. Learn why Friedman MSE offers smarter, faster, and more stable tree splits in gradient boosting algorithms." +excerpt: "Explore the smarter way of splitting nodes in regression trees using Friedman MSE, a computationally efficient and numerically stable alternative to classic variance-based MSE." +summary: "Understand how Friedman MSE improves split decisions in regression trees. Learn about its mathematical foundation, practical advantages, and role in modern libraries like LightGBM and XGBoost." +keywords: +- "Friedman MSE" +- "Classic MSE" +- "Decision Trees" +- "Gradient Boosting" +- "LightGBM" +- "XGBoost" +- "scikit-learn" +classes: wide +--- + +When building regression trees, whether in standalone models or ensembles like Random Forests and Gradient Boosted Trees, the key objective is to decide the best way to split nodes for optimal predictive performance. Traditionally, this has been done using **Mean Squared Error (MSE)** as a split criterion. However, many modern implementations — such as those in **LightGBM**, **XGBoost**, and **scikit-learn’s HistGradientBoostingRegressor** — use a mathematically equivalent but computationally superior alternative: **Friedman MSE**. + +This article demystifies the idea behind Friedman MSE, compares it to classic MSE, and explains why it’s increasingly the preferred method in scalable tree-based machine learning. + + +## Classic MSE in Regression Trees + +In a traditional regression tree, splits are chosen to minimize the variance of the target variable $y$ in the resulting child nodes. The goal is to maximize the *gain*, or reduction in impurity, achieved by splitting a node. + +The gain from a potential split is calculated using: + +$$ +\text{Gain} = \text{Var}(y_{\text{parent}}) - \left( \frac{n_{\text{left}}}{n_{\text{parent}}} \cdot \text{Var}(y_{\text{left}}) + \frac{n_{\text{right}}}{n_{\text{parent}}} \cdot \text{Var}(y_{\text{right}}) \right) +$$ + +Where: + +- $$\text{Var}(y)$$ is the variance of the target variable in a node. +- $$n$$ denotes the number of samples in each respective node. + +This formulation effectively measures how much more "pure" (less variance) the child nodes become after a split. It works well in many cases but has some computational and numerical limitations, especially at scale. + +--- + +## The Friedman MSE Formulation + +Jerome Friedman, while developing **Gradient Boosting Machines (GBMs)**, introduced a smarter way to compute split gain without explicitly calculating variances or means. His formulation is based on sums of target values, which are cheaper and more stable to compute: + +$$ +\text{FriedmanMSE} = \frac{\left( \sum y_{\text{left}} \right)^2}{n_{\text{left}}} + \frac{\left( \sum y_{\text{right}} \right)^2}{n_{\text{right}}} - \frac{\left( \sum y_{\text{parent}} \right)^2}{n_{\text{parent}}} +$$ + +This method retains the goal of minimizing squared error but eliminates the need for floating-point division and subtraction of means, which can be error-prone. + +--- + +## Mathematical Equivalence and Efficiency + +Despite looking different, Friedman’s method is algebraically equivalent to minimizing MSE under squared loss. To see this, note that variance can be expressed in terms of the mean squared values, which in turn relate to sums of values. + +By using only the sum and count of target values, this method: + +- Avoids recomputation of sample means for every candidate split. +- Allows incremental updates of statistics during tree traversal. +- Greatly speeds up histogram-based methods where feature values are bucketed and pre-aggregated. + +This efficiency is a major reason why libraries like **LightGBM** can scan millions of potential splits across thousands of features without breaking a sweat. + +--- + +## Numerical Stability and Practical Robustness + +Computing variance requires subtracting the mean from each data point — a step that introduces floating-point rounding errors, especially when target values are large or nearly identical. + +Friedman MSE avoids this by working only with sums and counts, both of which are more robust under finite precision arithmetic. As a result, it tends to: + +- Be less sensitive to large-magnitude values, +- Handle outliers more gracefully, +- Reduce the risk of numerical instability in deep trees. + +This becomes particularly important when dealing with real-world datasets that often contain outliers, duplicates, or unscaled values. + +--- + +## Comparative Table: Classic MSE vs Friedman MSE + +| Feature | Classic MSE | Friedman MSE | +|------------------------|-----------------------------|----------------------------------------| +| Formula | Variance reduction | Sum of squares (per count) | +| Computational Cost | Moderate | Low (sums and counts only) | +| Outlier Sensitivity | Higher | Lower | +| Numerical Stability | Moderate | High | +| Used In | Small regression trees | Gradient boosting, histogram trees | + +--- + +## Use Cases and When to Use Each + +While both methods aim to minimize prediction error via informative splits, they differ in suitability based on context: + +- **Small Datasets with Clean Targets**: Classic MSE works well and is interpretable. +- **Large Tabular Datasets**: Friedman MSE is more scalable and efficient. +- **Gradient Boosted Trees**: Friedman’s approach aligns with the boosting objective and is the default in most frameworks. +- **High Cardinality Features**: Efficient computation via histograms makes Friedman MSE ideal. +- **Presence of Noise or Outliers**: The robustness of Friedman MSE makes it a better default. + +In practice, if you're using libraries like **LightGBM**, **XGBoost**, or **scikit-learn’s HistGradientBoostingRegressor**, you’re already benefiting from Friedman MSE — often without realizing it. + +--- + +## Historical Origins and Impact + +The method is named after **Jerome Friedman**, the statistician who introduced it while developing the **MART (Multiple Additive Regression Trees)** algorithm, which later evolved into what we know as Gradient Boosting Machines. + +By reformulating the split criterion to depend only on aggregate statistics, Friedman laid the foundation for fast, scalable, and robust boosting algorithms. This innovation, though mathematically simple, had a profound impact on how tree-based models are implemented today. + +--- + +## Summary Insights + +Friedman MSE exemplifies how a clever mathematical simplification can drive both *speed* and *accuracy* in machine learning systems. While classic MSE is still valid and sometimes preferable in small-scale or academic scenarios, Friedman’s formulation dominates in real-world applications. + +By leveraging only sums and counts, it reduces computational overhead, increases numerical stability, and integrates seamlessly with histogram-based algorithms. It’s a powerful example of how understanding the internals of a model — even a minor detail like the split criterion — can help practitioners make more informed choices and build better-performing systems. + +--- + +## References + +- Friedman, J. H. (2001). [Greedy Function Approximation: A Gradient Boosting Machine](https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation--A-gradient-boosting-machine/10.1214/aos/1013203451.full). *Annals of Statistics*. +- [scikit-learn documentation: Decision Trees](https://scikit-learn.org/stable/modules/tree.html) +- [LightGBM documentation: Histogram-based algorithms](https://lightgbm.readthedocs.io/) +- [XGBoost documentation: Tree Booster Parameters](https://xgboost.readthedocs.io/en/stable/) diff --git a/_posts/2025-08-13-preregistering_structural_equation_modeling .md b/_posts/2025-08-13-preregistering_structural_equation_modeling .md new file mode 100644 index 0000000..8d52927 --- /dev/null +++ b/_posts/2025-08-13-preregistering_structural_equation_modeling .md @@ -0,0 +1,163 @@ +--- +title: >- + Preregistering Structural Equation Modeling (SEM) Studies: A Comprehensive + Guide +categories: + - Research Methods +tags: + - SEM + - Preregistration + - Open Science + - Reproducibility +author_profile: false +seo_title: How to Preregister Structural Equation Modeling (SEM) Studies +seo_description: >- + A comprehensive guide to preregistering SEM studies, including software + environments, modeling decisions, fit criteria, and contingency plans for + robust and reproducible analysis. +excerpt: >- + Learn how to preregister your SEM study by systematically locking down + modeling and analytic decisions to improve scientific transparency and reduce + bias. +summary: >- + This guide explains how to preregister structural equation modeling (SEM) + studies across seven major decision domains—software, model structure, + statistical modeling, estimation, fit assessment, contingency planning, and + robustness checks—ensuring confirmatory analyses remain unbiased and + reproducible. +keywords: + - structural equation modeling + - preregistration + - open science + - research reproducibility + - confirmatory analysis +classes: wide +--- + +Structural Equation Modeling (SEM) is a powerful analytical tool, capable of modeling complex latent structures and causal relationships between variables. From psychology to marketing, SEM is used in diverse fields to test theoretical models with observed data. Yet, the same flexibility that makes SEM attractive also opens the door to excessive researcher degrees of freedom. Without constraints, analysts can tweak specifications post hoc--knowingly or unknowingly--to produce more favorable results. + +Preregistration addresses this issue by setting in stone the analysis plan _before_ seeing the data. In the context of frequentist statistics, this step is crucial: data-contingent modeling decisions can invalidate p-values and confidence intervals by increasing the risk of false positives. By formally documenting decisions in advance, researchers commit to a confirmatory route, while still retaining space for transparent exploratory analysis. + +This article outlines the major decision domains to address in a preregistered SEM study. Each domain includes actionable recommendations to make your research more reproducible and credible. + +-------------------------------------------------------------------------------- + +# 1\. Locking the Software Environment + +A reproducible analysis begins with a stable computational environment. SEM models are sensitive not only to the software used but also to subtle changes across versions of packages, operating systems, and numerical libraries. + +Specify the software and exact version you will use--such as `lavaan 0.6-18` in `R 4.4.1`, `Mplus 8.10`, or `semopy 2.x` in Python. Don't stop at the modeling package; include the operating system, any hardware dependencies, and versions of supporting math libraries (e.g., BLAS, OpenBLAS, MKL) that may affect floating-point operations. + +Use tools like `renv::snapshot()` in R, `requirements.txt` in Python, or `conda` environments to freeze dependencies. For maximal reproducibility, build a Docker container and share the `Dockerfile` or image. + +Also include the exact script or notebook you intend to run during analysis. This provides a literal representation of your analysis plan--helpful not only for others but also for future-you. + +# 2\. Defining the Scientific and Structural Model + +Before data ever enters the picture, you need a clear theoretical model. This involves both conceptual and graphical representations of expected relationships. + +Develop a complete path diagram showing hypothesized relationships between latent constructs and observed variables. Each latent variable should have a definition rooted in prior literature, and every item should be justified in terms of what it captures. This process helps clarify construct validity and prevents arbitrary inclusion of items during analysis. + +Declare whether your model is directional (i.e., causal paths), and specify which variables are exogenous. For studies involving group comparisons or longitudinal designs, indicate plans for assessing measurement invariance. + +To preserve confirmatory integrity, clearly state that no additional paths will be added unless in a predefined exploratory phase. If modifications are planned, make them conditional on specified thresholds or theoretical rationales. + +# 3\. Operationalizing the Statistical Model + +The theoretical structure must be translated into a formal statistical model. This includes selecting the appropriate SEM framework, specifying assumptions, and handling practical modeling choices. + +Indicate the model type: Confirmatory Factor Analysis (CFA), full SEM, latent growth curve models, MIMIC, multilevel SEM, or network SEM. Each requires different identification strategies and introduces different assumptions. + +Define your assumptions about variable distributions. For instance, will ordinal items be treated as continuous, or will you use polychoric correlations? Will you allow for non-normality or heavy-tailed distributions? + +Declare how residuals are treated--are any error covariances theory-justified? Describe the strategy for missing data, whether Full Information Maximum Likelihood (FIML), multiple imputation, or listwise deletion. + +Also fix your identification strategy: marker-variable (loading fixed to 1) or unit-variance scaling (latent variance fixed to 1). Changes to these decisions post hoc can affect parameter estimates, so preregistration helps avoid retrofitting models to the data. + +# 4\. Estimation Methods and Robustness Considerations + +Choosing an estimator isn't just a technical detail--it affects parameter accuracy, standard errors, and fit indices. Preregister your primary estimation method, such as Maximum Likelihood (ML), Robust ML (MLR), Diagonally Weighted Least Squares (DWLS/WLSMV), Unweighted Least Squares (ULS), or Bayesian estimation. + +If you anticipate potential violations of assumptions, specify robustness corrections ahead of time. For instance, include Satorra–Bentler scaled chi-square if using MLR. Document the maximum number of iterations, convergence thresholds, and behavior in case of non-convergence. + +For studies involving bootstrapping, state how many samples will be used (e.g., 5,000 BCa resamples), which statistics will be bootstrapped, and how the results will be interpreted. + +Robustness checks should be planned--not reactive. They belong in a separate sensitivity analysis tier rather than as an opportunistic fix after primary analyses fail. + +# 5\. Measurement Model Decisions and Fit Criteria + +A major temptation in SEM is to "tune" the model post hoc to improve fit. Preregistration prevents this by locking in the criteria by which model fit will be judged. + +Declare your primary fit indices: CFI, TLI, RMSEA, and SRMR are common. Specify thresholds for both good and acceptable fit (e.g., CFI > 0.95 for good, > 0.90 for acceptable). + +State whether any correlated errors or cross-loadings are allowed based on theory. Describe how (or whether) modification indices (MI) will be consulted. If MIs are to be used, define a strict rule--for example, MI > 10 _and_ theoretical justification must both be met. + +A best practice is to employ a two-tiered strategy: analyze the confirmatory model as preregistered, and if fit is poor, then conduct a clearly labeled exploratory refinement. Keep the confirmatory and exploratory results separate in interpretation and reporting. + +-------------------------------------------------------------------------------- + +-------------------------------------------------------------------------------- + +# 6\. Predefined Backup Plans and Contingency Responses + +Even well-specified models can fail. Convergence issues, inadmissible solutions, or severe model misfit are not uncommon in SEM. Rather than improvising fixes, define contingency plans in advance to preserve the integrity of your confirmatory claims. + +Start by specifying a tiered approach to non-convergence. This might involve modifying starting values, switching optimization algorithms (e.g., from `nlminb` to `BFGS` in `lavaan`), or simplifying the model by removing problematic latent variables or paths. + +Plan for the appearance of **Heywood cases**, such as negative variance estimates or standardized loadings exceeding 1\. You might specify that if a Heywood case appears and is minor (e.g., loading = 1.01), you will retain the solution; but if it exceeds a certain threshold, a predefined reduced model will be estimated instead. + +If model fit is poor based on your preregistered criteria, define whether and how you will proceed. For example, you could specify that if RMSEA > 0.10, you will estimate a simplified version of the model that excludes certain weakly identified paths, provided the revision is consistent with theoretical expectations. + +Contingencies can also include assumption violations, such as non-normality or outliers. If these are detected using predefined diagnostics (e.g., Mardia's test, Q-Q plots), you may move to robust estimation or data transformation--again, only if such actions were laid out in the preregistration. + +Use a decision table mapping common problems to specific, predefined responses. This reduces the need for subjective choices once the data are visible. + +# 7\. Multiverse and Sensitivity Analyses + +To demonstrate that your results are not fragile, preregister a **multiverse analysis**--a systematic variation of defensible analytical decisions. This goes beyond robustness checks by explicitly modeling the uncertainty introduced by researcher degrees of freedom. + +List all plausible alternatives in areas such as: + +- Missing data strategy: FIML vs. multiple imputation +- Treatment of ordinal items: as continuous vs. polychoric-based CFA +- Scaling method: marker variable vs. unit variance +- Grouping strategy: multigroup vs. covariate modeling +- Outlier handling: Winsorizing, robust Mahalanobis distance exclusion + +Create a plan to fit all combinations of these decisions and extract a key parameter of interest (e.g., path coefficient from A → B). Then visualize its distribution across specifications via a **specification curve**. + +Include robustness checks such as **leave-one-indicator-out** analysis, where the measurement model is re-estimated repeatedly while omitting one indicator at a time to test for over-reliance on specific items. + +The goal here is not to eliminate all variation, but to demonstrate that your conclusions hold across reasonable decision spaces. Automate this process before analyzing real data using scripts or workflows that can be rerun unchanged. + +# 8\. Additional Preregistration Components + +A high-quality preregistration does more than lock analytic decisions--it anticipates all aspects of confirmatory research. + +Specify your **sample size** and the method used to determine it. This may involve a Monte Carlo simulation to assess statistical power for detecting your hypothesized effects under specific assumptions about model structure and measurement quality. + +State your **primary outcomes** and hypotheses clearly. A good rule of thumb is one primary effect per hypothesis. Secondary effects should be labeled exploratory unless they are also preregistered with equal rigor. + +If testing multiple effects, plan for **multiplicity correction**. This could involve controlling the false discovery rate (FDR), using Bonferroni or Holm corrections, or adopting a Bayesian approach with posterior probabilities. + +Define your **reporting plan**, including how confirmatory results will be separated from exploratory ones. Make clear which figures, tables, and model variations will be included in the final paper. + +Finally, consider uploading the preregistration to a public registry, such as [AsPredicted](https://aspredicted.org), [OSF Registries](https://osf.io/registries), or a journal-specific format if submitting under a Registered Reports model. + +# 9\. Final Thoughts on Transparency and Rigor + +Preregistration does not limit scientific creativity--it clarifies it. By defining your confirmatory analysis plan in advance, you create a clean boundary between hypothesis testing and hypothesis generation. Readers can trust that the results you present as confirmatory were not achieved through post hoc modifications. + +A robust SEM preregistration spans more than just model syntax. It includes your computational setup, theoretical justifications, modeling assumptions, contingency plans, and robustness checks. It acknowledges the complexity of SEM and uses structure to prevent that complexity from becoming a liability. + +Think of your preregistration as a recipe. If another researcher followed it precisely, without speaking to you, they should arrive at the same confirmatory results. When this happens, science advances--not just with findings, but with trust. + +-------------------------------------------------------------------------------- + +# Resources and Templates + +- [Preregistration Template for SEM Studies (OSF)](https://osf.io/registries) +- [lavaan Model Syntax Documentation](https://lavaan.ugent.be/tutorial/index.html) +- [Mplus User's Guide](https://www.statmodel.com/download/usersguide/MplusUserGuideVer_8.pdf) +- [semopy Documentation](https://semopy.com/) +- [Specifying and Visualizing Specification Curves](https://journals.sagepub.com/doi/10.1177/2515245919864955)