new article

DiogoRibeiro7 · DiogoRibeiro7 · commit 75c3f8635c8f · 2025-08-07T12:37:56.000+01:00
diff --git a/_posts/2025-08-07-smarter_tree_splits.md b/_posts/2025-08-07-smarter_tree_splits.md
@@ -0,0 +1,143 @@
+---
+title: "Smarter Tree Splits: Understanding Friedman MSE in Regression Trees"
+categories:
+- machine-learning
+- tree-algorithms
+
+tags:
+- decision-trees
+- regression
+- MSE
+- gradient-boosting
+- scikit-learn
+- xgboost
+- lightgbm
+
+author_profile: false
+seo_title: "Friedman MSE vs Classic MSE in Regression Trees"
+seo_description: "Explore the differences between Classic MSE and Friedman MSE in regression trees. Learn why Friedman MSE offers smarter, faster, and more stable tree splits in gradient boosting algorithms."
+excerpt: "Explore the smarter way of splitting nodes in regression trees using Friedman MSE, a computationally efficient and numerically stable alternative to classic variance-based MSE."
+summary: "Understand how Friedman MSE improves split decisions in regression trees. Learn about its mathematical foundation, practical advantages, and role in modern libraries like LightGBM and XGBoost."
+keywords: 
+- "Friedman MSE"
+- "Classic MSE"
+- "Decision Trees"
+- "Gradient Boosting"
+- "LightGBM"
+- "XGBoost"
+- "scikit-learn"
+classes: wide
+---
+
+When building regression trees, whether in standalone models or ensembles like Random Forests and Gradient Boosted Trees, the key objective is to decide the best way to split nodes for optimal predictive performance. Traditionally, this has been done using **Mean Squared Error (MSE)** as a split criterion. However, many modern implementations — such as those in **LightGBM**, **XGBoost**, and **scikit-learn’s HistGradientBoostingRegressor** — use a mathematically equivalent but computationally superior alternative: **Friedman MSE**.
+
+This article demystifies the idea behind Friedman MSE, compares it to classic MSE, and explains why it’s increasingly the preferred method in scalable tree-based machine learning.
+
+
+## Classic MSE in Regression Trees
+
+In a traditional regression tree, splits are chosen to minimize the variance of the target variable $y$ in the resulting child nodes. The goal is to maximize the *gain*, or reduction in impurity, achieved by splitting a node.
+
+The gain from a potential split is calculated using:
+
+$$
+\text{Gain} = \text{Var}(y_{\text{parent}}) - \left( \frac{n_{\text{left}}}{n_{\text{parent}}} \cdot \text{Var}(y_{\text{left}}) + \frac{n_{\text{right}}}{n_{\text{parent}}} \cdot \text{Var}(y_{\text{right}}) \right)
+$$
+
+Where:
+
+- $\text{Var}(y)$ is the variance of the target variable in a node.
+- $n$ denotes the number of samples in each respective node.
+
+This formulation effectively measures how much more "pure" (less variance) the child nodes become after a split. It works well in many cases but has some computational and numerical limitations, especially at scale.
+
+---
+
+## The Friedman MSE Formulation
+
+Jerome Friedman, while developing **Gradient Boosting Machines (GBMs)**, introduced a smarter way to compute split gain without explicitly calculating variances or means. His formulation is based on sums of target values, which are cheaper and more stable to compute:
+
+$$
+\text{FriedmanMSE} = \frac{\left( \sum y_{\text{left}} \right)^2}{n_{\text{left}}} + \frac{\left( \sum y_{\text{right}} \right)^2}{n_{\text{right}}} - \frac{\left( \sum y_{\text{parent}} \right)^2}{n_{\text{parent}}}
+$$
+
+This method retains the goal of minimizing squared error but eliminates the need for floating-point division and subtraction of means, which can be error-prone.
+
+---
+
+## Mathematical Equivalence and Efficiency
+
+Despite looking different, Friedman’s method is algebraically equivalent to minimizing MSE under squared loss. To see this, note that variance can be expressed in terms of the mean squared values, which in turn relate to sums of values.
+
+By using only the sum and count of target values, this method:
+
+- Avoids recomputation of sample means for every candidate split.
+- Allows incremental updates of statistics during tree traversal.
+- Greatly speeds up histogram-based methods where feature values are bucketed and pre-aggregated.
+
+This efficiency is a major reason why libraries like **LightGBM** can scan millions of potential splits across thousands of features without breaking a sweat.
+
+---
+
+## Numerical Stability and Practical Robustness
+
+Computing variance requires subtracting the mean from each data point — a step that introduces floating-point rounding errors, especially when target values are large or nearly identical.
+
+Friedman MSE avoids this by working only with sums and counts, both of which are more robust under finite precision arithmetic. As a result, it tends to:
+
+- Be less sensitive to large-magnitude values,
+- Handle outliers more gracefully,
+- Reduce the risk of numerical instability in deep trees.
+
+This becomes particularly important when dealing with real-world datasets that often contain outliers, duplicates, or unscaled values.
+
+---
+
+## Comparative Table: Classic MSE vs Friedman MSE
+
+| Feature                 | Classic MSE                 | Friedman MSE                           |
+|------------------------|-----------------------------|----------------------------------------|
+| Formula                | Variance reduction          | Sum of squares (per count)             |
+| Computational Cost     | Moderate                    | Low (sums and counts only)             |
+| Outlier Sensitivity    | Higher                      | Lower                                  |
+| Numerical Stability    | Moderate                    | High                                   |
+| Used In                | Small regression trees      | Gradient boosting, histogram trees     |
+
+---
+
+## Use Cases and When to Use Each
+
+While both methods aim to minimize prediction error via informative splits, they differ in suitability based on context:
+
+- **Small Datasets with Clean Targets**: Classic MSE works well and is interpretable.
+- **Large Tabular Datasets**: Friedman MSE is more scalable and efficient.
+- **Gradient Boosted Trees**: Friedman’s approach aligns with the boosting objective and is the default in most frameworks.
+- **High Cardinality Features**: Efficient computation via histograms makes Friedman MSE ideal.
+- **Presence of Noise or Outliers**: The robustness of Friedman MSE makes it a better default.
+
+In practice, if you're using libraries like **LightGBM**, **XGBoost**, or **scikit-learn’s HistGradientBoostingRegressor**, you’re already benefiting from Friedman MSE — often without realizing it.
+
+---
+
+## Historical Origins and Impact
+
+The method is named after **Jerome Friedman**, the statistician who introduced it while developing the **MART (Multiple Additive Regression Trees)** algorithm, which later evolved into what we know as Gradient Boosting Machines.
+
+By reformulating the split criterion to depend only on aggregate statistics, Friedman laid the foundation for fast, scalable, and robust boosting algorithms. This innovation, though mathematically simple, had a profound impact on how tree-based models are implemented today.
+
+---
+
+## Summary Insights
+
+Friedman MSE exemplifies how a clever mathematical simplification can drive both *speed* and *accuracy* in machine learning systems. While classic MSE is still valid and sometimes preferable in small-scale or academic scenarios, Friedman’s formulation dominates in real-world applications.
+
+By leveraging only sums and counts, it reduces computational overhead, increases numerical stability, and integrates seamlessly with histogram-based algorithms. It’s a powerful example of how understanding the internals of a model — even a minor detail like the split criterion — can help practitioners make more informed choices and build better-performing systems.
+
+---
+
+## References
+
+- Friedman, J. H. (2001). [Greedy Function Approximation: A Gradient Boosting Machine](https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation--A-gradient-boosting-machine/10.1214/aos/1013203451.full). *Annals of Statistics*.
+- [scikit-learn documentation: Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
+- [LightGBM documentation: Histogram-based algorithms](https://lightgbm.readthedocs.io/)
+- [XGBoost documentation: Tree Booster Parameters](https://xgboost.readthedocs.io/en/stable/)