fix: fix article

DiogoRibeiro7 · DiogoRibeiro7 · commit 89ae85f57c75 · 2025-06-18T17:49:23.000+01:00
diff --git a/_posts/2025-06-07-why_math_statistics_foundations_data_science.md b/_posts/2025-06-07-why_math_statistics_foundations_data_science.md
@@ -34,66 +34,67 @@ tags:
 title: Why Data Scientists Need Math and Statistics
 ---
 
-It’s tempting to think that mastering a handful of libraries—pandas, Scikit-Learn, TensorFlow—is the fast track to data science success. Yet tools are abstractions built atop deep mathematical and statistical theory. Without understanding **why** an algorithm works—its assumptions, convergence guarantees, or failure modes—practitioners risk producing brittle models and misinterpreting outputs. Libraries accelerate development, but the true power of data science lies in the ability to reason about algorithms at a theoretical level.
+It’s tempting—especially in fast-paced learning environments—to believe that knowing a few libraries like pandas, Scikit-Learn, or TensorFlow is enough to be a data scientist. I’ve seen students and even early-career professionals fall into this trap. But here’s the reality: these tools are just scaffolding. The actual structure is built on mathematics and statistics. Without understanding what’s happening under the hood, you’re not doing science—you’re executing recipes.
 
-## 2. Mathematical Foundations: Linear Algebra and Calculus
+## 1. More Than Code: Why Theory Matters
 
-At the heart of many predictive models are operations on vectors and matrices. Consider a data matrix $\mathbf{X}\in\mathbb{R}^{n\times p}$: understanding its **singular value decomposition**  
-$$
-\mathbf{X} = U\,\Sigma\,V^\top
-$$
-reveals principal directions of variance, which underpin techniques like Principal Component Analysis. Eigenvalues and eigenvectors provide insight into covariance structure, guiding feature extraction and dimensionality reduction.
+In practice, it’s easy to mistake familiarity with libraries for mastery of data science. You might be able to build a random forest or train a neural network—but what happens when things go wrong? What if the model overfits? What if convergence is painfully slow? These questions demand answers that only theory can provide. If you don’t understand the assumptions baked into an algorithm—how it generalizes, when it breaks—you’ll struggle to debug, optimize, or improve it.
 
-Calculus provides the language of change, enabling optimization of complex loss functions. Gradient-based methods update parameters $\theta$ via  
-$$
-\theta_{t+1} = \theta_t - \eta\,\nabla_\theta L(\theta_t),
-$$
-where $\eta$ is the learning rate and $\nabla_\theta L$ the gradient of the loss. Delving into second-order information—the Hessian matrix $H = \nabla^2_\theta L$—explains curvature and motivates algorithms like Newton’s method or quasi-Newton schemes (e.g., BFGS). These concepts illuminate why some problems converge slowly, why learning rates must be tuned, and how saddle points impede optimization.
+A data scientist who understands the mathematics isn’t just pushing buttons. They’re reasoning, experimenting, and innovating. And when things go off-track (as they often do), it’s theory that guides them back.
 
-## 3. Statistical Principles: Inference, Uncertainty, and Validation
+## 2. Linear Algebra and Calculus: The Engines Behind the Algorithms
 
-Data science inevitably grapples with uncertainty. Statistics offers the framework to quantify and manage it. A common task is estimating the mean of a population from a sample of size $n$. The **confidence interval** for a normally distributed estimator $\hat\mu$ with known variance $\sigma^2$ is  
-$$
-\hat\mu \pm z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}},
-$$
-where $z_{\alpha/2}$ corresponds to the desired coverage probability (e.g., $1.96$ for 95%). Hypothesis testing formalizes decision-making: by computing a $p$-value, one assesses the probability of observing data at least as extreme as the sample under a null hypothesis.
+Take linear algebra. It’s not just about matrix multiplication. When I teach Principal Component Analysis (PCA), I start with singular value decomposition. Why? Because understanding how data can be decomposed into principal directions of variance isn't just intellectually satisfying—it directly informs better feature engineering, dimensionality reduction, and model interpretation.
 
-Probability distributions—Bernoulli, Poisson, Gaussian—model data generation processes and inform likelihood-based methods. Maximum likelihood estimation (MLE) chooses parameters $\theta$ to maximize  
-$$
-\mathcal{L}(\theta) = \prod_{i=1}^n p(x_i \mid \theta),
-$$
-and its logarithm simplifies optimization to summing log-likelihoods. Statistical rigor guards against overfitting, data dredging, and false discoveries, ensuring that observed patterns reflect genuine signals rather than random noise.
+Similarly, eigenvalues and eigenvectors aren’t abstract constructs. They reveal the structure of your data’s covariance matrix and help explain why certain transformations work and others don’t. These aren’t bonus concepts; they’re critical tools.
 
-## 4. Theory in Action: Demystifying Algorithms
+Calculus, on the other hand, gives us the language of change. It underpins every optimization routine in machine learning. Gradient descent, for example, is everywhere—from linear regression to deep learning. But if you don’t understand gradients, partial derivatives, or what a Hessian matrix tells you about curvature, then tuning hyperparameters like learning rate becomes guesswork.
 
-Every algorithm embodies mathematical and statistical choices. A **linear regression** model  
-$$
-\hat y = X\beta + \varepsilon
-$$
-assumes that residuals $\varepsilon$ are independent, zero-mean, and homoscedastic. Violations—such as autocorrelation or heteroscedasticity—invalidate inference unless addressed. **Decision trees** rely on information‐theoretic splits, measuring impurity via entropy  
-$$
-H(S) = -\sum_{k} p_k \log p_k,
-$$
-and choosing splits that maximize information gain. **Neural networks** approximate arbitrary functions by composing affine transformations and nonlinear activations, with backpropagation systematically computing gradients via the chain rule.
+Let’s take Newton’s method: it’s beautiful in theory, and incredibly efficient when it works. But without understanding second-order derivatives, you’d never know when and why it might fail.
 
-Understanding these mechanics clarifies why certain models excel on specific data types and fail on others. It empowers practitioners to select or adapt algorithms—pruning trees to prevent overfitting, regularizing regression with an $L_1$ penalty to induce sparsity, or choosing appropriate activation functions to avoid vanishing gradients.
+## 3. Statistics: Measuring Uncertainty with Rigor
 
-## 5. Common Errors from Theoretical Gaps
+Every dataset is finite. Every conclusion you draw is uncertain to some degree. That’s where statistics steps in—not to complicate things, but to quantify your confidence. Whether you're calculating a confidence interval or running a hypothesis test, you're trying to understand how much trust to put in your results.
 
-Ignoring foundational theory leads to familiar pitfalls. Failing to standardize features in gradient‐based models can cause one dimension to dominate updates, slowing convergence. Overlooking multicollinearity in regression inflates variance of coefficient estimates, making interpretation meaningless. Misapplying hypothesis tests without correcting for multiple comparisons increases false positive rates. Blind reliance on automated pipelines may conceal data leakage—where test information inadvertently influences training—resulting in overly optimistic performance estimates.
+Let’s say you're estimating the mean income of a population. You take a sample, compute a mean, and construct a 95% confidence interval. If you don’t understand where that interval comes from, or what assumptions underlie its validity, your conclusions might mislead—even if the math is technically correct.
 
-## 6. Cultivating Analytical Intuition: Learning Strategies
+Maximum Likelihood Estimation (MLE) is another workhorse technique. It's not enough to know how to plug into a library function. Why does the log-likelihood simplify things? Why is it often convex in the parameters? These are the kinds of questions that separate competent modelers from algorithmic operators.
 
-Building fluency in mathematics and statistics need not be daunting. Effective approaches include:
+And then there’s model validation. Cross-validation isn't just a checklist item—it’s your safeguard against overfitting. But its effectiveness depends on your understanding of sampling, bias-variance tradeoff, and variance estimation. I always remind my students: good results don’t mean good models. They might mean your test data leaked into training.
 
-- **Structured Coursework**: Enroll in linear algebra and real analysis to master vector spaces, eigenvalues, and limits.  
-- **Applied Exercises**: Derive gradient descent updates by hand for simple models, then verify them in code.  
-- **Textbook Deep Dives**: Study “Linear Algebra and Its Applications” (Strang) and “Statistical Inference” (Casella & Berger) for rigorous yet accessible treatments.  
-- **Algorithm Implementations**: Recreate k-means clustering, logistic regression, or principal component analysis from first principles to internalize assumptions.  
-- **Peer Discussions**: Teach core concepts—Bayes’ theorem, eigen decomposition—to colleagues or study groups, reinforcing understanding through explanation.
+## 4. Algorithms as Mathematical Objects
 
-These practices foster the intuition that transforms abstract symbols into actionable insights.
+Every algorithm is built on theory—linear regression, decision trees, k-means clustering, support vector machines, neural networks. What changes is the mathematical lens through which we view them.
 
-## 7. Embracing Theory for Sustainable Data Science
+Linear regression isn’t just a line of best fit. It’s an estimator with assumptions: normality of errors, independence, constant variance. If those assumptions are violated, inference becomes unreliable. Decision trees optimize for purity, measured using information gain or Gini index. These are concepts from information theory—not arbitrary choices.
 
-A robust grounding in mathematics and statistics elevates data science from a toolkit of shortcuts to a discipline of informed reasoning. When practitioners grasp the language of vectors, gradients, probabilities, and tests, they become adept at diagnosing model behavior, innovating new methods, and communicating results with credibility. Investing time in these core disciplines yields dividends: faster debugging, more reliable models, and the ability to adapt as algorithms and data evolve. In the evolving landscape of data science, theory remains the constant that empowers us to turn data into dependable knowledge.
+Neural networks, especially deep architectures, apply linear transformations followed by nonlinear activations. But their real power comes from composition: layer after layer, they approximate complex functions. And all of it hinges on the chain rule and gradient-based optimization.
+
+## 5. Common Mistakes from Skipping the Theory
+
+I’ve seen teams spend weeks tuning models that never converge—only to realize their features weren’t standardized. Or overlook multicollinearity in regression and wonder why coefficients fluctuate wildly. These aren’t advanced mistakes; they’re avoidable with the right foundation.
+
+Data leakage is another common trap. If your training and testing processes aren’t truly separated, your model performance will look artificially inflated. A good theoretical foundation teaches you to spot these issues before they blow up in production.
+
+And then there's hypothesis testing. Run a dozen tests without correction, and you’ll almost certainly find a “significant” result—whether it’s real or not. Without understanding false discovery rates or Bonferroni correction, you might be reporting noise as signal.
+
+## 6. How to Build Mathematical Intuition
+
+Theory isn’t something you “know”; it’s something you internalize. That takes time, effort, and exposure.
+
+Here’s what I recommend:
+
+- Take courses in linear algebra, calculus, probability, and real analysis—not just applied data science.
+- Derive things by hand: backpropagation, MLE, entropy formulas. It helps more than you think.
+- Build from scratch: reimplement PCA, logistic regression, or k-means using only NumPy. No shortcuts.
+- Read widely but deeply: Strang’s *Linear Algebra* and Casella & Berger’s *Statistical Inference* remain gold standards.
+- Teach what you learn. Explaining Bayes’ theorem to a colleague will highlight gaps you didn’t know you had.
+- And don’t be afraid to struggle. Learning theory is often non-linear, and plateaus are part of the process.
+
+## 7. Why It All Matters: Sustainability in Practice
+
+It’s easy to get caught up in the latest frameworks or model architectures. But trends change. What lasts is understanding.
+
+Practitioners who invest in theory are the ones who debug models faster, build more robust systems, and adapt when tools or datasets shift. They’re also the ones who communicate results with precision—because they know what their models are really doing.
+
+Data science isn’t just about prediction. It’s about reasoning under uncertainty. And that means we need the mathematical and statistical tools to reason well.