diff --git a/README.md b/README.md index 906892e9..520eb354 100644 --- a/README.md +++ b/README.md @@ -1,91 +1,62 @@ -# [Minimal Mistakes Jekyll theme](https://mmistakes.github.io/minimal-mistakes/) +# Personal Blog on Minimal Mistakes -[![LICENSE](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/DiogoRibeiro7/DiogoRibeiro7.github.io/master/LICENSE) +[![LICENSE](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Hosted with GH Pages](https://img.shields.io/badge/Hosted_with-GitHub_Pages-blue?logo=github&logoColor=white)](https://pages.github.com/) [![Made with GH Actions](https://img.shields.io/badge/CI-GitHub_Actions-blue?logo=github-actions&logoColor=white)](https://github.com/features/actions) -[![Jekyll](https://img.shields.io/badge/jekyll-%3E%3D%204.3-blue.svg)](https://jekyllrb.com/) +This repository contains the source code for my website built with the +[Minimal Mistakes](https://mmistakes.github.io/minimal-mistakes/) Jekyll theme. +It also includes a few helper scripts to clean up Markdown front matter and +run tests. -[![Ruby Version](https://img.shields.io/badge/ruby-3.1-blue)](https://www.ruby-lang.org) -[![Ruby gem](https://img.shields.io/gem/v/minimal-mistakes-jekyll.svg)](https://rubygems.org/gems/minimal-mistakes-jekyll) +## Requirements - -Minimal Mistakes is a flexible two-column Jekyll theme, perfect for building personal sites, blogs, and portfolios. As the name implies, styling is purposely minimalistic to be enhanced and customized by you :smile:. - -## Setup - -This repository contains a few helper scripts for processing Markdown posts. -Install the Python dependencies listed in `requirements.txt` with: +The site relies on Ruby, Node and Python tooling. Install dependencies with: ```bash +# Python packages for helper scripts and tests pip install -r requirements.txt -``` -To work with the JavaScript that powers the theme you'll also need Node -dependencies. Install them with: - -```bash +# JavaScript dependencies npm install -``` - -This project uses **npm** for managing JavaScript dependencies and tracks -exact versions in `package-lock.json`. - -Bundled JavaScript is compiled from the source files in `assets/js/`. Run the -following to create `main.min.js` (minified with a banner) or watch for changes: - -```bash -npm run build:js # minify and add banner -npm run watch:js # optional: automatically rebuild on changes -``` - -## CSS linting - -Lint all SCSS files with [Stylelint](https://stylelint.io/): -```bash -npm run lint:css +# Ruby gems for Jekyll +bundle install ``` -## Local development +## Development -Install Ruby gems specified in the `Gemfile` with: +Use the following commands while working on the site: ```bash -bundle install -``` +# start a local server at http://localhost:4000/ +bundle exec jekyll serve -Serve the site locally with: +# rebuild JavaScript when files change +npm run watch:js -```bash -bundle exec jekyll serve +# lint stylesheets +npm run lint:css ``` +## Tests -## Running tests - -Install the Python dependencies and execute: +Front matter utilities are covered by a small `pytest` suite. Run the tests with: ```bash pytest ``` -GitHub Actions already runs these commands automatically during deployments. - -# ToDo - -~~Have a consistency in the font and font sizes (ideally you want to use 2 fonts. One for the header/subtitle and one for the text. You can use this kind of website https://fontjoy.com/ which allow you to pair fonts).~~ - -Choose a few main colours for your site (I would suggest black/white/grey but not in solid. You can also use this kind of site: https://coolors.co/palettes/popular/2a4849). -~~Reduce then size of the homepage top image (ideally you want your first articles to be visible on load and not hidden below the fold).~~ +GitHub Actions executes the same tests on every push. -~~Restyle your links (ideally the link should be back with no underline and you add a css style on hover)~~ +## Roadmap -~~Center pagination~~ +Planned improvements are organised as sprints in [ROADMAP.md](ROADMAP.md). +Highlights include: -~~Restyle your article detail page breadcrumbs. You want them to be less visible (I would suggest a light grey colour here)~~ +- refining typography and the colour palette +- restructuring the homepage with card‑style articles +- adding search and dark mode +- optimising performance and accessibility -Right now at the top of the detail page, you have your site breadcrumbs, a title then another title and the font sizes are a bit off and it is hard to understand the role of the second title. I would reorganise this to provide a better understanding to the reader -On the detail page, I would suggest you put the `You may also enjoy` straight at the end of the article. Right now it is after comments and you can lose engagement on your site. -I would suggest you remove your description from the detail page. I think having it on the home page is enough. You can have a smaller introduction if needed with a read more button or link that will take the reader to a full page description of yourself and your skillset. That will allow you to tell more about yourself and why you do what you do -I will create card article with a hover animation (add some shape and background colour and ideally a header image for the card. The graphs you show me last week for example.) +Contributions are welcome! diff --git a/_config.yml b/_config.yml index bcea374e..e6f3a9b1 100644 --- a/_config.yml +++ b/_config.yml @@ -86,7 +86,7 @@ facebook: username : app_id : publisher : -og_image : # Open Graph/Twitter default site image +og_image : /assets/images/data-science.png # For specifying social profiles # - https://developers.google.com/structured-data/customize/social-profiles social: @@ -196,7 +196,7 @@ exclude: - Rakefile - README - tmp - - /docs # ignore Minimal Mistakes /docs + # - /docs # ignore Minimal Mistakes /docs - /test # ignore Minimal Mistakes /test keep_files: - .git diff --git a/_data/navigation.yml b/_data/navigation.yml index 158f0dbb..ab598e32 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -19,4 +19,6 @@ main: # - title: "Sitemap" # url: /sitemap/ - title: "Code Download" - url: /code/ \ No newline at end of file + url: /code/ + - title: "Package Docs" + url: /docs/ diff --git a/_includes/archive-single.html b/_includes/archive-single.html index 37b8ec0e..5ebed744 100644 --- a/_includes/archive-single.html +++ b/_includes/archive-single.html @@ -11,12 +11,12 @@ {% endif %}
-
- {% if include.type == "grid" and teaser %} -
- {{ post.title }} -
- {% endif %} +
+{% if teaser %} +
+ {{ post.title }} +
+{% endif %}

{% if post.link %} {{ title }} Permalink diff --git a/_includes/head.html b/_includes/head.html index 2e166797..415a0a7f 100644 --- a/_includes/head.html +++ b/_includes/head.html @@ -20,7 +20,7 @@ - + {% if site.head_scripts %} {% for script in site.head_scripts %} diff --git a/_includes/page__hero.html b/_includes/page__hero.html index add43650..8d1b94ac 100644 --- a/_includes/page__hero.html +++ b/_includes/page__hero.html @@ -23,16 +23,16 @@ > {% if page.header.overlay_color or page.header.overlay_image %}
- +

{% if page.tagline %}

{{ page.tagline | markdownify | remove: "

" | remove: "

" }}

- {% elsif page.header.show_overlay_excerpt != false and page.excerpt %} + {% elsif page.header.show_overlay_excerpt != false and page.excerpt and page.collection != 'posts' %}

{{ page.excerpt | markdownify | remove: "

" | remove: "

" }}

{% endif %} {% include page__meta.html %} diff --git a/_layouts/single.html b/_layouts/single.html index ce76b699..47bca170 100644 --- a/_layouts/single.html +++ b/_layouts/single.html @@ -22,10 +22,12 @@ {% include sidebar.html %}
+ {% unless page.header.overlay_color or page.header.overlay_image %} {% if page.title %}

{% endif %} + {% endunless %} {% if page.subtitle %}

{{ page.subtitle }}

{% endif %} {% if page.excerpt %} @@ -108,4 +110,4 @@
\ No newline at end of file + diff --git a/_pages/code-download.md b/_pages/code-download.md new file mode 100644 index 00000000..9f7c64dc --- /dev/null +++ b/_pages/code-download.md @@ -0,0 +1,29 @@ +--- +layout: single +title: "Code Download" +permalink: /code/ +author_profile: true +--- + +This page provides convenient links to code examples referenced throughout the blog. The full source for this website, including these examples, is available on [GitHub](https://github.com/DiogoRibeiro7/DiogoRibeiro7.github.io). + +## Individual files + +You can download specific files directly from the repository: + +- [Michelson–Morley experiment visualization]({{ '/code/michelson_morley.py' | relative_url }}) +- [Example notebook]({{ '/code/Untitled.ipynb' | relative_url }}) + +## Repository download + +If you would like to obtain the entire collection of examples, clone the repository using Git or grab the automatic ZIP archive: + +```bash +git clone https://github.com/DiogoRibeiro7/DiogoRibeiro7.github.io.git +``` + +Or visit [the GitHub project page](https://github.com/DiogoRibeiro7/DiogoRibeiro7.github.io) and use **Download ZIP**. + +--- + +Feel free to reach out at [dfr@esmad.ipp.pt](mailto:dfr@esmad.ipp.pt) for professional enquiries or [diogo.debastos.ribeiro@gmail.com](mailto:diogo.debastos.ribeiro@gmail.com) for personal matters. My ORCID is [0009-0001-2022-7072](https://orcid.org/0009-0001-2022-7072) and I’m affiliated with **ESMAD - Instituto Politécnico do Porto**. diff --git a/_pages/package-docs.md b/_pages/package-docs.md new file mode 100644 index 00000000..a45cf297 --- /dev/null +++ b/_pages/package-docs.md @@ -0,0 +1,16 @@ +--- +layout: single +title: "Package Documentation" +permalink: /docs/ +author_profile: true +--- + +This page collects links to documentation for the packages I maintain. When available, documentation is hosted online. Otherwise it will be stored directly in this repository. + +## Packages + +- [**frontmatter**]({{ '/docs/frontmatter/' | relative_url }}) + +--- + +Feel free to reach out at [dfr@esmad.ipp.pt](mailto:dfr@esmad.ipp.pt) for professional enquiries or [diogo.debastos.ribeiro@gmail.com](mailto:diogo.debastos.ribeiro@gmail.com) for personal matters. My ORCID is [0009-0001-2022-7072](https://orcid.org/0009-0001-2022-7072) and I’m affiliated with **ESMAD - Instituto Politécnico do Porto**. diff --git a/_posts/2020-11-05-probability_theory_basics.md b/_posts/2020-11-05-probability_theory_basics.md index 4ecfa29e..34251028 100644 --- a/_posts/2020-11-05-probability_theory_basics.md +++ b/_posts/2020-11-05-probability_theory_basics.md @@ -4,7 +4,8 @@ categories: - Statistics classes: wide date: '2020-11-05' -excerpt: An introduction to probability theory concepts every data scientist should know. +excerpt: An introduction to probability theory concepts every data scientist should + know. header: image: /assets/images/data_science_10.jpg og_image: /assets/images/data_science_10.jpg @@ -17,15 +18,17 @@ keywords: - Random variables - Distributions - Data science -seo_description: Learn the core principles of probability theory, from random variables to common distributions, with practical examples for data science. -seo_title: 'Probability Theory Basics for Data Science' +seo_description: Learn the core principles of probability theory, from random variables + to common distributions, with practical examples for data science. +seo_title: Probability Theory Basics for Data Science seo_type: article -summary: This post reviews essential probability concepts like random variables, expectation, and common distributions, illustrating how they underpin data science workflows. +summary: This post reviews essential probability concepts like random variables, expectation, + and common distributions, illustrating how they underpin data science workflows. tags: - Probability - Statistics - Data science -title: 'Probability Theory Basics for Data Science' +title: Probability Theory Basics for Data Science --- Probability theory provides the mathematical foundation for modeling uncertainty. By understanding random variables and probability distributions, data scientists can quantify risks and make informed decisions. diff --git a/_posts/2020-11-10-simple_linear_regression_intro.md b/_posts/2020-11-10-simple_linear_regression_intro.md index eb1e5b6b..56fb748f 100644 --- a/_posts/2020-11-10-simple_linear_regression_intro.md +++ b/_posts/2020-11-10-simple_linear_regression_intro.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2020-11-10' -excerpt: Understand how simple linear regression models the relationship between two variables using a single predictor. +excerpt: Understand how simple linear regression models the relationship between two + variables using a single predictor. header: image: /assets/images/data_science_11.jpg og_image: /assets/images/data_science_11.jpg @@ -16,15 +17,17 @@ keywords: - Linear regression - Least squares - Data analysis -seo_description: Discover the mechanics of simple linear regression and how to interpret slope and intercept when fitting a straight line to data. -seo_title: 'A Primer on Simple Linear Regression' +seo_description: Discover the mechanics of simple linear regression and how to interpret + slope and intercept when fitting a straight line to data. +seo_title: A Primer on Simple Linear Regression seo_type: article -summary: This article introduces simple linear regression and the least squares method, showing how a single predictor explains variation in a response variable. +summary: This article introduces simple linear regression and the least squares method, + showing how a single predictor explains variation in a response variable. tags: - Regression - Statistics - Data science -title: 'A Primer on Simple Linear Regression' +title: A Primer on Simple Linear Regression --- Simple linear regression is a foundational technique for modeling the relationship between a predictor variable and a response variable. By fitting a straight line, we can quantify how changes in one variable are associated with changes in another. diff --git a/_posts/2020-11-20-bayesian_inference_basics.md b/_posts/2020-11-20-bayesian_inference_basics.md index f1e5057f..74213e87 100644 --- a/_posts/2020-11-20-bayesian_inference_basics.md +++ b/_posts/2020-11-20-bayesian_inference_basics.md @@ -4,7 +4,8 @@ categories: - Statistics classes: wide date: '2020-11-20' -excerpt: Explore the fundamentals of Bayesian inference and how prior beliefs combine with data to form posterior conclusions. +excerpt: Explore the fundamentals of Bayesian inference and how prior beliefs combine + with data to form posterior conclusions. header: image: /assets/images/data_science_12.jpg og_image: /assets/images/data_science_12.jpg @@ -17,15 +18,17 @@ keywords: - Priors - Posterior distributions - Data science -seo_description: An overview of Bayesian inference, demonstrating how to update prior beliefs with new evidence to make data-driven decisions. -seo_title: 'Bayesian Inference Explained' +seo_description: An overview of Bayesian inference, demonstrating how to update prior + beliefs with new evidence to make data-driven decisions. +seo_title: Bayesian Inference Explained seo_type: article -summary: Learn how Bayesian inference updates prior beliefs into posterior distributions, providing a flexible framework for reasoning under uncertainty. +summary: Learn how Bayesian inference updates prior beliefs into posterior distributions, + providing a flexible framework for reasoning under uncertainty. tags: - Bayesian - Inference - Statistics -title: 'Bayesian Inference Explained' +title: Bayesian Inference Explained --- Bayesian inference offers a powerful perspective on probability, treating unknown quantities as distributions that update when new evidence appears. diff --git a/_posts/2020-11-25-hypothesis_testing_real_world_applications.md b/_posts/2020-11-25-hypothesis_testing_real_world_applications.md index 94c20492..c8eee2a1 100644 --- a/_posts/2020-11-25-hypothesis_testing_real_world_applications.md +++ b/_posts/2020-11-25-hypothesis_testing_real_world_applications.md @@ -4,7 +4,8 @@ categories: - Statistics classes: wide date: '2020-11-25' -excerpt: See how hypothesis testing helps draw meaningful conclusions from data in practical scenarios. +excerpt: See how hypothesis testing helps draw meaningful conclusions from data in + practical scenarios. header: image: /assets/images/data_science_13.jpg og_image: /assets/images/data_science_13.jpg @@ -17,15 +18,18 @@ keywords: - P-values - Significance - Data science -seo_description: Learn how to apply hypothesis tests in real-world analyses and avoid common pitfalls when interpreting p-values and confidence levels. -seo_title: 'Applying Hypothesis Testing in the Real World' +seo_description: Learn how to apply hypothesis tests in real-world analyses and avoid + common pitfalls when interpreting p-values and confidence levels. +seo_title: Applying Hypothesis Testing in the Real World seo_type: article -summary: This post walks through frequentist hypothesis testing, showing how to formulate null and alternative hypotheses and interpret the results in practical data science tasks. +summary: This post walks through frequentist hypothesis testing, showing how to formulate + null and alternative hypotheses and interpret the results in practical data science + tasks. tags: - Hypothesis testing - Statistics - Experiments -title: 'Applying Hypothesis Testing in the Real World' +title: Applying Hypothesis Testing in the Real World --- Hypothesis testing allows data scientists to objectively assess whether an observed pattern is likely due to chance or reflects a genuine effect. diff --git a/_posts/2020-11-30-data_visualization_best_practices.md b/_posts/2020-11-30-data_visualization_best_practices.md index 6698e128..7caa210f 100644 --- a/_posts/2020-11-30-data_visualization_best_practices.md +++ b/_posts/2020-11-30-data_visualization_best_practices.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2020-11-30' -excerpt: Discover best practices for creating clear and compelling data visualizations that communicate insights effectively. +excerpt: Discover best practices for creating clear and compelling data visualizations + that communicate insights effectively. header: image: /assets/images/data_science_14.jpg og_image: /assets/images/data_science_14.jpg @@ -17,15 +18,17 @@ keywords: - Charts - Communication - Best practices -seo_description: Guidelines for selecting chart types, choosing colors, and avoiding clutter when visualizing data for stakeholders. -seo_title: 'Data Visualization Best Practices' +seo_description: Guidelines for selecting chart types, choosing colors, and avoiding + clutter when visualizing data for stakeholders. +seo_title: Data Visualization Best Practices seo_type: article -summary: Learn how to design effective visualizations by focusing on clarity, appropriate chart selection, and thoughtful use of color and labels. +summary: Learn how to design effective visualizations by focusing on clarity, appropriate + chart selection, and thoughtful use of color and labels. tags: - Visualization - Data science - Communication -title: 'Data Visualization Best Practices' +title: Data Visualization Best Practices --- Effective data visualization bridges the gap between complex datasets and human understanding. Following proven design principles ensures that your charts highlight the important messages without distractions. diff --git a/_posts/2021-10-05-data_preprocessing_pipelines.md b/_posts/2021-10-05-data_preprocessing_pipelines.md index 8c6e86ed..5ebd8890 100644 --- a/_posts/2021-10-05-data_preprocessing_pipelines.md +++ b/_posts/2021-10-05-data_preprocessing_pipelines.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2021-10-05' -excerpt: Learn how to design robust data preprocessing pipelines that prepare raw data for modeling. +excerpt: Learn how to design robust data preprocessing pipelines that prepare raw + data for modeling. header: image: /assets/images/data_science_6.jpg og_image: /assets/images/data_science_6.jpg @@ -17,10 +18,12 @@ keywords: - Pipelines - Data cleaning - Feature engineering -seo_description: Discover best practices for building reusable data preprocessing pipelines that handle missing values, encoding, and feature scaling. +seo_description: Discover best practices for building reusable data preprocessing + pipelines that handle missing values, encoding, and feature scaling. seo_title: Building Data Preprocessing Pipelines for Reliable Models seo_type: article -summary: This post outlines the key steps in constructing data preprocessing pipelines using tools like scikit-learn to ensure consistent model inputs. +summary: This post outlines the key steps in constructing data preprocessing pipelines + using tools like scikit-learn to ensure consistent model inputs. tags: - Data preprocessing - Machine learning diff --git a/_posts/2021-10-15-decision_tree_algorithms.md b/_posts/2021-10-15-decision_tree_algorithms.md index 303a4fdd..994eceec 100644 --- a/_posts/2021-10-15-decision_tree_algorithms.md +++ b/_posts/2021-10-15-decision_tree_algorithms.md @@ -4,7 +4,8 @@ categories: - Machine Learning classes: wide date: '2021-10-15' -excerpt: Understand how decision tree algorithms split data and how pruning improves generalization. +excerpt: Understand how decision tree algorithms split data and how pruning improves + generalization. header: image: /assets/images/data_science_7.jpg og_image: /assets/images/data_science_7.jpg @@ -17,10 +18,12 @@ keywords: - Classification - Tree pruning - Machine learning -seo_description: Learn the mechanics of decision tree algorithms, including entropy-based splits and pruning techniques that prevent overfitting. +seo_description: Learn the mechanics of decision tree algorithms, including entropy-based + splits and pruning techniques that prevent overfitting. seo_title: How Decision Trees Work and Why Pruning Matters seo_type: article -summary: This article walks through the basics of decision tree construction and explains common pruning methods to create better models. +summary: This article walks through the basics of decision tree construction and explains + common pruning methods to create better models. tags: - Decision trees - Classification diff --git a/_posts/2021-11-10-model_evaluation_metrics.md b/_posts/2021-11-10-model_evaluation_metrics.md index f8820a24..64172963 100644 --- a/_posts/2021-11-10-model_evaluation_metrics.md +++ b/_posts/2021-11-10-model_evaluation_metrics.md @@ -18,14 +18,16 @@ keywords: - Precision - Recall - Regression metrics -seo_description: A concise overview of essential metrics like precision, recall, F1-score, and RMSE for measuring model performance. +seo_description: A concise overview of essential metrics like precision, recall, F1-score, + and RMSE for measuring model performance. seo_title: Essential Metrics for Evaluating Machine Learning Models seo_type: article -summary: Learn how to interpret common classification and regression metrics to choose the best model for your data. +summary: Learn how to interpret common classification and regression metrics to choose + the best model for your data. tags: - Accuracy - F1-score -- RMSE +- Rmse title: A Guide to Model Evaluation Metrics --- diff --git a/_posts/2022-10-15-time_series_decomposition.md b/_posts/2022-10-15-time_series_decomposition.md index 5c8a0cc5..9859077b 100644 --- a/_posts/2022-10-15-time_series_decomposition.md +++ b/_posts/2022-10-15-time_series_decomposition.md @@ -5,7 +5,8 @@ categories: - Time Series classes: wide date: '2022-10-15' -excerpt: Learn how time series decomposition reveals trend, seasonality, and residual components for clearer forecasting insights. +excerpt: Learn how time series decomposition reveals trend, seasonality, and residual + components for clearer forecasting insights. header: image: /assets/images/data_science_12.jpg og_image: /assets/images/data_science_12.jpg @@ -19,10 +20,12 @@ keywords: - Seasonality - Forecasting - Decomposition -seo_description: Discover how to separate trend and seasonal patterns from a time series using additive or multiplicative decomposition. -seo_title: 'Time Series Decomposition Made Simple' +seo_description: Discover how to separate trend and seasonal patterns from a time + series using additive or multiplicative decomposition. +seo_title: Time Series Decomposition Made Simple seo_type: article -summary: This article explains how decomposing a time series helps isolate long-term trends and recurring seasonal effects so you can model data more effectively. +summary: This article explains how decomposing a time series helps isolate long-term + trends and recurring seasonal effects so you can model data more effectively. tags: - Time series - Forecasting diff --git a/_posts/2025-01-31-nonlinear_growth_models_macroeconomics.md b/_posts/2025-01-31-nonlinear_growth_models_macroeconomics.md index 294304f7..0e2edbc4 100644 --- a/_posts/2025-01-31-nonlinear_growth_models_macroeconomics.md +++ b/_posts/2025-01-31-nonlinear_growth_models_macroeconomics.md @@ -96,7 +96,7 @@ $$ \dot{A} = \phi A^\beta L_A $$ -Where \( \beta > 1 \) leads to accelerating technological growth, while \( \beta < 1 \) introduces convergence or stagnation risks. +Where $$ \beta > 1 $$ leads to accelerating technological growth, while $$ \beta < 1 $$ introduces convergence or stagnation risks. --- author_profile: false diff --git a/_posts/2025-02-02-time_series_forecasting_sarima_seasonal_arima_explained.md b/_posts/2025-02-02-time_series_forecasting_sarima_seasonal_arima_explained.md index bf6557aa..05547e8c 100644 --- a/_posts/2025-02-02-time_series_forecasting_sarima_seasonal_arima_explained.md +++ b/_posts/2025-02-02-time_series_forecasting_sarima_seasonal_arima_explained.md @@ -80,9 +80,9 @@ $$ Where: -- \( p \): Number of autoregressive terms -- \( d \): Number of differencing operations -- \( q \): Number of moving average terms +- $$ p $$: Number of autoregressive terms +- $$ d $$: Number of differencing operations +- $$ q $$: Number of moving average terms While ARIMA works well for many datasets, it does not explicitly model **seasonal structure**. For example, monthly sales data may show a 12-month cycle, which ARIMA cannot capture directly. @@ -96,9 +96,9 @@ $$ Where: -- \( p, d, q \): Non-seasonal ARIMA parameters -- \( P, D, Q \): Seasonal AR, differencing, and MA orders -- \( s \): Seasonality period (e.g., 12 for monthly data with yearly seasonality) +- $$ p, d, q $$: Non-seasonal ARIMA parameters +- $$ P, D, Q $$: Seasonal AR, differencing, and MA orders +- $$ s $$: Seasonality period (e.g., 12 for monthly data with yearly seasonality) For example: @@ -128,15 +128,15 @@ $$ \Phi(B^s) \phi(B) (1 - B)^d (1 - B^s)^D y_t = \Theta(B^s) \theta(B) \varepsilon_t $$ -Where \( \varepsilon_t \) is white noise. +Where $$ \varepsilon_t $$ is white noise. ## 5. Parameter Selection: Seasonal and Non-Seasonal -### Step 1: Seasonal Period \( s \) +### Step 1: Seasonal Period $$ s $$ Choose based on frequency (e.g., 12 for monthly). -### Step 2: Differencing \( d \), \( D \) +### Step 2: Differencing $$ d $$, $$ D $$ Use plots and ADF tests to determine. @@ -144,8 +144,8 @@ Use plots and ADF tests to determine. Use ACF and PACF plots to estimate: -- \( p, q \) for non-seasonal -- \( P, Q \) for seasonal +- $$ p, q $$ for non-seasonal +- $$ P, Q $$ for seasonal ### Step 4: Use Auto ARIMA (Python) diff --git a/_posts/2025-05-01-agentbased_models_abm_macroeconomics_mathematical_perspective.md b/_posts/2025-05-01-agentbased_models_abm_macroeconomics_mathematical_perspective.md index 505c4e85..23d10e3c 100644 --- a/_posts/2025-05-01-agentbased_models_abm_macroeconomics_mathematical_perspective.md +++ b/_posts/2025-05-01-agentbased_models_abm_macroeconomics_mathematical_perspective.md @@ -62,20 +62,20 @@ In macroeconomics, ABMs can simulate the evolution of the economy through the in Although agent-based models are primarily computational, they rest on well-defined mathematical components. A typical ABM can be formalized as a discrete-time dynamical system: -Let the system state at time \( t \) be denoted as: +Let the system state at time $$ t $$ be denoted as: $$ S_t = \{a_{1,t}, a_{2,t}, ..., a_{N,t}\} $$ -where \( a_{i,t} \) represents the state of agent \( i \) at time \( t \), and \( N \) is the total number of agents. +where $$ a_{i,t} $$ represents the state of agent $$ i $$ at time $$ t $$, and $$ N $$ is the total number of agents. ### 1. **Agent State and Behavior Functions** Each agent has: -- A **state vector** \( a_{i,t} \in \mathbb{R}^k \) representing variables such as wealth, consumption, productivity, etc. -- A **decision function** \( f_i: S_t \rightarrow \mathbb{R}^k \) that determines how the agent updates its state: +- A **state vector** $$ a_{i,t} \in \mathbb{R}^k $$ representing variables such as wealth, consumption, productivity, etc. +- A **decision function** $$ f_i: S_t \rightarrow \mathbb{R}^k $$ that determines how the agent updates its state: $$ a_{i,t+1} = f_i(a_{i,t}, \mathcal{E}_t, \mathcal{I}_{i,t}) @@ -83,8 +83,8 @@ $$ Where: -- \( \mathcal{E}_t \) is the macro environment (e.g., interest rates, inflation) -- \( \mathcal{I}_{i,t} \) is local information accessible to the agent +- $$ \mathcal{E}_t $$ is the macro environment (e.g., interest rates, inflation) +- $$ \mathcal{I}_{i,t} $$ is local information accessible to the agent ### 2. **Interaction Structure** @@ -94,7 +94,7 @@ Agents may interact through a **network topology**, such as: - Small-world or scale-free networks - Spatial lattices -These interactions define information flow and market exchanges. Let \( G = (V, E) \) be a graph with nodes \( V \) representing agents and edges \( E \) representing communication or trade links. +These interactions define information flow and market exchanges. Let $$ G = (V, E) $$ be a graph with nodes $$ V $$ representing agents and edges $$ E $$ representing communication or trade links. ### 3. **Environment and Aggregation** @@ -104,7 +104,7 @@ $$ \mathcal{E}_{t+1} = g(S_t) $$ -Where \( g \) is a function that computes macro variables (e.g., GDP, inflation, aggregate demand) from the microstate \( S_t \). This allows for **micro-to-macro feedback loops**. +Where $$ g $$ is a function that computes macro variables (e.g., GDP, inflation, aggregate demand) from the microstate $$ S_t $$. This allows for **micro-to-macro feedback loops**. ## Key Features of ABMs in Macroeconomics diff --git a/_posts/2025-06-06-exploratory_data_analysis_intro.md b/_posts/2025-06-06-exploratory_data_analysis_intro.md index 3f7f4690..d3b38a7b 100644 --- a/_posts/2025-06-06-exploratory_data_analysis_intro.md +++ b/_posts/2025-06-06-exploratory_data_analysis_intro.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2025-06-06' -excerpt: Discover the essential steps of Exploratory Data Analysis (EDA) and how to gain insights from your data before building models. +excerpt: Discover the essential steps of Exploratory Data Analysis (EDA) and how to + gain insights from your data before building models. header: image: /assets/images/data_science_5.jpg og_image: /assets/images/data_science_5.jpg @@ -18,16 +19,19 @@ keywords: - Python - Pandas - Data cleaning -seo_description: Learn the fundamentals of Exploratory Data Analysis using Python, including data cleaning, visualization, and summary statistics. -seo_title: "Beginner's Guide to Exploratory Data Analysis (EDA)" +seo_description: Learn the fundamentals of Exploratory Data Analysis using Python, + including data cleaning, visualization, and summary statistics. +seo_title: Beginner's Guide to Exploratory Data Analysis (EDA) seo_type: article -summary: This guide covers the core principles of Exploratory Data Analysis, demonstrating how to inspect, clean, and visualize datasets to uncover patterns and inform subsequent modeling steps. +summary: This guide covers the core principles of Exploratory Data Analysis, demonstrating + how to inspect, clean, and visualize datasets to uncover patterns and inform subsequent + modeling steps. tags: -- EDA +- Eda - Data science - Python - Visualization -title: "Exploratory Data Analysis: A Beginner's Guide" +title: 'Exploratory Data Analysis: A Beginner''s Guide' --- Exploratory Data Analysis (EDA) is the process of examining a dataset to understand its main characteristics before applying more formal statistical modeling or machine learning. By exploring your data upfront, you can identify patterns, spot anomalies, and test assumptions that might otherwise go unnoticed. diff --git a/_posts/2025-06-07-why_math_statistics_foundations_data_science.md b/_posts/2025-06-07-why_math_statistics_foundations_data_science.md index 9c4a84eb..4ce51307 100644 --- a/_posts/2025-06-07-why_math_statistics_foundations_data_science.md +++ b/_posts/2025-06-07-why_math_statistics_foundations_data_science.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2025-06-07' -excerpt: Mastering mathematics and statistics is essential for understanding data science algorithms and avoiding common pitfalls when building models. +excerpt: Mastering mathematics and statistics is essential for understanding data + science algorithms and avoiding common pitfalls when building models. header: image: /assets/images/data_science_10.jpg og_image: /assets/images/data_science_10.jpg @@ -17,41 +18,83 @@ keywords: - Statistics fundamentals - Machine learning theory - Algorithms -seo_description: Explore why a solid grasp of math and statistics is crucial for data scientists and how ignoring the underlying theory can lead to faulty models. +seo_description: Explore why a solid grasp of math and statistics is crucial for data + scientists and how ignoring the underlying theory can lead to faulty models. seo_title: 'Math and Statistics: The Bedrock of Data Science' seo_type: article -summary: To excel in data science, you need more than coding skills. This article explains how mathematics and statistics underpin popular algorithms and why understanding them prevents costly mistakes. +summary: To excel in data science, you need more than coding skills. This article + explains how mathematics and statistics underpin popular algorithms and why understanding + them prevents costly mistakes. tags: - Mathematics - Statistics - Machine learning - Data science - Algorithms -title: 'Why Data Scientists Need Math and Statistics' +title: Why Data Scientists Need Math and Statistics --- -A common misconception is that data science is mostly about applying libraries and frameworks. While tools are helpful, they cannot replace a solid understanding of **mathematics** and **statistics**. These disciplines provide the language and theory that power every algorithm behind the scenes. +It’s tempting—especially in fast-paced learning environments—to believe that knowing a few libraries like pandas, Scikit-Learn, or TensorFlow is enough to be a data scientist. I’ve seen students and even early-career professionals fall into this trap. But here’s the reality: these tools are just scaffolding. The actual structure is built on mathematics and statistics. Without understanding what’s happening under the hood, you’re not doing science—you’re executing recipes. -## The Role of Mathematics +## 1. More Than Code: Why Theory Matters -At the core of many machine learning algorithms are mathematical concepts such as **linear algebra** and **calculus**. Linear algebra explains how models handle vectors and matrices, enabling operations like matrix decomposition and gradient calculations. Calculus is vital for understanding optimization techniques that drive model training. Without these foundations, it is difficult to grasp how algorithms converge or why they sometimes fail to do so. +In practice, it’s easy to mistake familiarity with libraries for mastery of data science. You might be able to build a random forest or train a neural network—but what happens when things go wrong? What if the model overfits? What if convergence is painfully slow? These questions demand answers that only theory can provide. If you don’t understand the assumptions baked into an algorithm—how it generalizes, when it breaks—you’ll struggle to debug, optimize, or improve it. -## Why Statistics Matters +A data scientist who understands the mathematics isn’t just pushing buttons. They’re reasoning, experimenting, and innovating. And when things go off-track (as they often do), it’s theory that guides them back. -Statistics helps data scientists quantify uncertainty, draw reliable conclusions, and validate models. Techniques like **hypothesis testing**, **confidence intervals**, and **probability distributions** reveal whether observed patterns are significant or simply random noise. Lacking statistical insight can lead to overfitting or underestimating model errors. +## 2. Linear Algebra and Calculus: The Engines Behind the Algorithms -## Understanding Algorithms Beyond Code +Take linear algebra. It’s not just about matrix multiplication. When I teach Principal Component Analysis (PCA), I start with singular value decomposition. Why? Because understanding how data can be decomposed into principal directions of variance isn't just intellectually satisfying—it directly informs better feature engineering, dimensionality reduction, and model interpretation. -Popular algorithms—such as decision trees, regression models, and neural networks—are built on mathematical principles. Knowing the theory behind them clarifies their assumptions and limitations. Blindly applying a model without understanding its mechanics can produce misleading results, especially when the data violates those assumptions. +Similarly, eigenvalues and eigenvectors aren’t abstract constructs. They reveal the structure of your data’s covariance matrix and help explain why certain transformations work and others don’t. These aren’t bonus concepts; they’re critical tools. -## The Pitfalls of Ignoring Theory +Calculus, on the other hand, gives us the language of change. It underpins every optimization routine in machine learning. Gradient descent, for example, is everywhere—from linear regression to deep learning. But if you don’t understand gradients, partial derivatives, or what a Hessian matrix tells you about curvature, then tuning hyperparameters like learning rate becomes guesswork. -When the underlying mathematics is ignored, it becomes challenging to debug models, tune hyperparameters, or interpret outcomes. Relying solely on automated tools may produce working code, but it often masks fundamental issues like data leakage, improper scaling, or incorrect loss functions. These mistakes can have severe consequences in real-world applications. +Let’s take Newton’s method: it’s beautiful in theory, and incredibly efficient when it works. But without understanding second-order derivatives, you’d never know when and why it might fail. -## Building a Strong Foundation +## 3. Statistics: Measuring Uncertainty with Rigor -Learning the basics of calculus, linear algebra, and statistics does not require becoming a mathematician. However, dedicating time to these topics builds intuition about how models work. This deeper knowledge empowers data scientists to select appropriate algorithms, customize them for specific problems, and communicate results effectively. +Every dataset is finite. Every conclusion you draw is uncertain to some degree. That’s where statistics steps in—not to complicate things, but to quantify your confidence. Whether you're calculating a confidence interval or running a hypothesis test, you're trying to understand how much trust to put in your results. -## Conclusion +Let’s say you're estimating the mean income of a population. You take a sample, compute a mean, and construct a 95% confidence interval. If you don’t understand where that interval comes from, or what assumptions underlie its validity, your conclusions might mislead—even if the math is technically correct. -Data science thrives on a solid grounding in mathematics and statistics. Understanding the theory behind algorithms not only improves model performance but also safeguards against hidden errors. Investing in these fundamentals is essential for anyone aspiring to be a competent data scientist. +Maximum Likelihood Estimation (MLE) is another workhorse technique. It's not enough to know how to plug into a library function. Why does the log-likelihood simplify things? Why is it often convex in the parameters? These are the kinds of questions that separate competent modelers from algorithmic operators. + +And then there’s model validation. Cross-validation isn't just a checklist item—it’s your safeguard against overfitting. But its effectiveness depends on your understanding of sampling, bias-variance tradeoff, and variance estimation. I always remind my students: good results don’t mean good models. They might mean your test data leaked into training. + +## 4. Algorithms as Mathematical Objects + +Every algorithm is built on theory—linear regression, decision trees, k-means clustering, support vector machines, neural networks. What changes is the mathematical lens through which we view them. + +Linear regression isn’t just a line of best fit. It’s an estimator with assumptions: normality of errors, independence, constant variance. If those assumptions are violated, inference becomes unreliable. Decision trees optimize for purity, measured using information gain or Gini index. These are concepts from information theory—not arbitrary choices. + +Neural networks, especially deep architectures, apply linear transformations followed by nonlinear activations. But their real power comes from composition: layer after layer, they approximate complex functions. And all of it hinges on the chain rule and gradient-based optimization. + +## 5. Common Mistakes from Skipping the Theory + +I’ve seen teams spend weeks tuning models that never converge—only to realize their features weren’t standardized. Or overlook multicollinearity in regression and wonder why coefficients fluctuate wildly. These aren’t advanced mistakes; they’re avoidable with the right foundation. + +Data leakage is another common trap. If your training and testing processes aren’t truly separated, your model performance will look artificially inflated. A good theoretical foundation teaches you to spot these issues before they blow up in production. + +And then there's hypothesis testing. Run a dozen tests without correction, and you’ll almost certainly find a “significant” result—whether it’s real or not. Without understanding false discovery rates or Bonferroni correction, you might be reporting noise as signal. + +## 6. How to Build Mathematical Intuition + +Theory isn’t something you “know”; it’s something you internalize. That takes time, effort, and exposure. + +Here’s what I recommend: + +- Take courses in linear algebra, calculus, probability, and real analysis—not just applied data science. +- Derive things by hand: backpropagation, MLE, entropy formulas. It helps more than you think. +- Build from scratch: reimplement PCA, logistic regression, or k-means using only NumPy. No shortcuts. +- Read widely but deeply: Strang’s *Linear Algebra* and Casella & Berger’s *Statistical Inference* remain gold standards. +- Teach what you learn. Explaining Bayes’ theorem to a colleague will highlight gaps you didn’t know you had. +- And don’t be afraid to struggle. Learning theory is often non-linear, and plateaus are part of the process. + +## 7. Why It All Matters: Sustainability in Practice + +It’s easy to get caught up in the latest frameworks or model architectures. But trends change. What lasts is understanding. + +Practitioners who invest in theory are the ones who debug models faster, build more robust systems, and adapt when tools or datasets shift. They’re also the ones who communicate results with precision—because they know what their models are really doing. + +Data science isn’t just about prediction. It’s about reasoning under uncertainty. And that means we need the mathematical and statistical tools to reason well. diff --git a/_posts/2025-06-08-data_visualization_tools.md b/_posts/2025-06-08-data_visualization_tools.md deleted file mode 100644 index bbd77444..00000000 --- a/_posts/2025-06-08-data_visualization_tools.md +++ /dev/null @@ -1,46 +0,0 @@ ---- -author_profile: false -categories: -- Data Science -classes: wide -date: '2025-06-08' -excerpt: Explore top data visualization tools that help analysts turn raw numbers into compelling stories. -header: - image: /assets/images/data_science_11.jpg - og_image: /assets/images/data_science_11.jpg - overlay_image: /assets/images/data_science_11.jpg - show_overlay_excerpt: false - teaser: /assets/images/data_science_11.jpg - twitter_image: /assets/images/data_science_11.jpg -keywords: -- Data visualization tools -- Dashboards -- Charts -- Reporting -seo_description: Learn about popular data visualization tools and how they aid in communicating insights from complex datasets. -seo_title: 'Data Visualization Tools for Modern Analysts' -seo_type: article -summary: This article reviews leading data visualization platforms and libraries, highlighting their strengths for EDA and reporting. -tags: -- Visualization -- Dashboards -- Reporting -- Data science -title: 'Data Visualization Tools for Modern Data Science' ---- - -Data visualization bridges the gap between raw numbers and actionable insights. With the right toolset, analysts can transform spreadsheets and databases into engaging charts and dashboards that reveal hidden patterns. - -## 1. Matplotlib and Seaborn - -These Python libraries are the bread and butter of many data scientists. Matplotlib offers low-level control for creating virtually any chart, while Seaborn builds on top of it with sensible defaults and statistical plots. - -## 2. Plotly and Bokeh - -For interactive web-based visualizations, Plotly and Bokeh stand out. They enable dynamic charts that allow users to zoom, hover, and filter, making presentations more engaging and informative. - -## 3. Tableau and Power BI - -When you need to share results with non-technical stakeholders, business intelligence tools like Tableau and Power BI offer drag-and-drop interfaces and polished dashboards. They integrate well with various data sources and support advanced analytics extensions. - -Effective visualization helps convey complex analyses in a format that anyone can understand. Choosing the right tool depends on the audience, data type, and level of interactivity required. diff --git a/_posts/2025-06-09-feature_engineering_time_series.md b/_posts/2025-06-09-feature_engineering_time_series.md index c559a14c..a930720f 100644 --- a/_posts/2025-06-09-feature_engineering_time_series.md +++ b/_posts/2025-06-09-feature_engineering_time_series.md @@ -4,7 +4,8 @@ categories: - Machine Learning classes: wide date: '2025-06-09' -excerpt: Learn specialized feature engineering techniques to make time series data more predictive for machine learning models. +excerpt: Learn specialized feature engineering techniques to make time series data + more predictive for machine learning models. header: image: /assets/images/data_science_12.jpg og_image: /assets/images/data_science_12.jpg @@ -17,30 +18,198 @@ keywords: - Lag variables - Rolling windows - Seasonality -seo_description: Discover practical methods for crafting informative features from time series data, including lags, moving averages, and trend extraction. -seo_title: 'Feature Engineering for Time Series Data' +- Python +seo_description: Discover practical methods for crafting informative features from + time series data, including lags, moving averages, and trend extraction. +seo_title: Feature Engineering for Time Series Data seo_type: article -summary: This post explains how to engineer features such as lagged values, rolling statistics, and seasonal indicators to improve model performance on sequential data. +summary: This post explains how to engineer features such as lagged values, rolling + statistics, and seasonal indicators to improve model performance on sequential data. tags: - Feature engineering - Time series - Machine learning - Forecasting -title: 'Crafting Time Series Features for Better Models' +- Python +title: Crafting Time Series Features for Better Models --- -Time series data contains rich temporal information that standard tabular methods often overlook. Careful feature engineering can reveal trends and cycles that lead to more accurate predictions. +Time series data contains rich temporal dynamics—trends, seasonality, cycles, shocks, and evolving variance—that standard tabular features often miss. By crafting features that explicitly encode these patterns, you empower models (from ARIMA to gradient boosting and deep learning) to learn more nuanced signals and deliver significantly more accurate forecasts. This article dives deep into both classic and cutting-edge feature engineering techniques for time series, complete with conceptual explanations and code snippets in Python. ## 1. Lagged Variables -One of the simplest yet most effective techniques is creating lag features. By shifting the series backward in time, you supply the model with previous observations that may influence current values. +### 1.1 Basic Lag Features + +The simplest way to give your model “memory” is to include previous observations as new predictors. For a series $$y_t$$, you might create: + +```python +df['lag_1'] = df['y'].shift(1) +df['lag_7'] = df['y'].shift(7) # weekly lag +``` + +These features let the model learn autoregressive relationships: how yesterday or last week influences today. + +### 1.2 Distributed Lags + +Rather than picking arbitrary lags, you can create a range of lags and let the model pick which matter: + +```python +for lag in range(1, 15): + df[f'lag_{lag}'] = df['y'].shift(lag) +``` + +When used with regularized models (e.g., Lasso or tree-based methods), the model will zero-out irrelevant lags automatically. ## 2. Rolling Statistics -Moving averages and rolling standard deviations smooth the data and highlight short-term changes. They help capture momentum and seasonality without introducing noise. +Rolling (or moving) statistics smooth the data, revealing local trends and variability. + +### 2.1 Moving Averages + +A $k$-period rolling mean captures local trend: + +```python +df['roll_mean_7'] = df['y'].rolling(window=7).mean() +``` + +Experiment with short and long windows (7, 30, 90 days) to capture different granularities. + +### 2.2 Rolling Variance and Other Aggregations + +Volatility often matters as much as level. Rolling standard deviation, quantiles, min/max, and even custom functions can be computed: + +```python +df['roll_std_14'] = df['y'].rolling(14).std() +df['roll_max_30'] = df['y'].rolling(30).max() +df['roll_q25_30'] = df['y'].rolling(30).quantile(0.25) +``` + +These features help the model detect regime changes or anomalous behavior. ## 3. Seasonal Indicators -Adding flags for month, day of week, or other periodic markers enables models to recognize recurring patterns, improving forecasts for sales, web traffic, and more. +Even simple flags for calendar units can boost performance dramatically. + +### 3.1 Calendar Features + +Extract month, day of week, quarter, and more: + +```python +df['month'] = df.index.month +df['dow'] = df.index.dayofweek # 0=Monday, …, 6=Sunday +df['quarter'] = df.index.quarter +``` + +### 3.2 Cyclical Encoding + +Treating these as numeric can introduce artificial discontinuities at boundaries (e.g., December→January). Instead encode them cyclically with sine/cosine: + +```python +df['sin_month'] = np.sin(2*np.pi * df['month']/12) +df['cos_month'] = np.cos(2*np.pi * df['month']/12) +``` + +This preserves the circular nature of time features. + +## 4. Trend and Seasonality Decomposition + +Decompose the series into trend, seasonal, and residual components (e.g., with STL) and use them directly. + +```python +from statsmodels.tsa.seasonal import STL +stl = STL(df['y'], period=365) +res = stl.fit() +df['trend'] = res.trend +df['seasonal'] = res.seasonal +df['resid'] = res.resid +``` + +Feeding the trend and seasonal components separately lets your model focus on each pattern in isolation. + +## 5. Fourier and Spectral Features + +To capture complex periodicities without manually creating many dummies, build Fourier series terms: + +```python +def fourier_terms(series, period, K): + t = np.arange(len(series)) + terms = {} + for k in range(1, K+1): + terms[f'sin_{period}_{k}'] = np.sin(2*np.pi*k*t/period) + terms[f'cos_{period}_{k}'] = np.cos(2*np.pi*k*t/period) + return pd.DataFrame(terms, index=series.index) + +fourier_df = fourier_terms(df['y'], period=365, K=3) +df = pd.concat([df, fourier_df], axis=1) +``` + +This approach succinctly encodes multiple harmonics of yearly seasonality (or whatever period you choose). + +## 6. Date-Time Derived Features + +Beyond basic calendar units, derive: + +* Time since event: days since last promotion, hours since last maintenance. +* Cumulative counts: how many times a threshold was breached to date. +* Time to next event: days until holiday or known scheduled event. + +These features are domain-specific but often highly predictive. + +## 7. Holiday and Event Effects + +Special days often trigger spikes or drops. Incorporate known events: + +```python +import holidays +us_holidays = holidays.US() +df['is_holiday'] = df.index.to_series().apply(lambda d: d in us_holidays).astype(int) +``` + +You can also add “days until next holiday” and “days since last holiday” to capture lead/lag effects. + +## 8. Interaction and Lagged Interaction Features + +Combine time features with lagged values to model varying autocorrelation: + +```python +df['lag1_sin_month'] = df['lag_1'] * df['sin_month'] +``` + +Such interactions can help the model learn that the effect of yesterday’s value depends on the season or trend. + +## 9. Window-Based and Exponential Weighted Features + +Instead of fixed‐window rolling stats, use exponentially weighted moving averages (EWMA) to prioritize recent observations: + +```python +df['ewm_0.3'] = df['y'].ewm(alpha=0.3).mean() +``` + +Experiment with different decay rates to find the optimal memory length. + +## 10. Domain-Specific Signals + +In finance: technical indicators (RSI, MACD); in retail: days since last promotion; in IoT: time since device reboot. Leverage your domain knowledge to craft bespoke features that capture critical drivers. + +## 11. Feature Selection and Validation + +With hundreds of engineered features, guard against overfitting: + +* Correlation analysis: drop highly collinear features. +* Model-based importance: use tree-based methods to rank features. +* Regularization: L1/L2 penalties to zero out irrelevant predictors. +* Cross-validation: time-aware CV (e.g., expanding window) to test generalization. + +## 12. Integrating Engineered Features + +Finally, assemble your features into a single DataFrame (aligning on time index), handle missing values (common after shifts/rolls), and feed into your chosen model: + +```python +df.dropna(inplace=True) +X = df.drop(columns=['y']) +y = df['y'] +``` + +Use pipelines (e.g., scikit-learn’s Pipeline) to keep preprocessing, feature engineering, and modeling reproducible and version-controlled. -Combining these approaches can significantly enhance a time series model's predictive power, especially when paired with algorithms like ARIMA or gradient boosting. +By thoughtfully engineering temporal features—from simple lags to spectral and event-driven signals—you unlock hidden structures in your data. Paired with rigorous validation and domain expertise, these techniques can transform raw time series into powerful predictors, elevating model performance across forecasting tasks. diff --git a/_posts/2025-06-10-arima_forecasting_python.md b/_posts/2025-06-10-arima_forecasting_python.md index 44b731b9..ebe174c9 100644 --- a/_posts/2025-06-10-arima_forecasting_python.md +++ b/_posts/2025-06-10-arima_forecasting_python.md @@ -4,7 +4,8 @@ categories: - Statistics classes: wide date: '2025-06-10' -excerpt: A practical introduction to building ARIMA models in Python for reliable time series forecasting. +excerpt: A practical introduction to building ARIMA models in Python for reliable + time series forecasting. header: image: /assets/images/data_science_13.jpg og_image: /assets/images/data_science_13.jpg @@ -13,34 +14,137 @@ header: teaser: /assets/images/data_science_13.jpg twitter_image: /assets/images/data_science_13.jpg keywords: -- ARIMA +- Arima - Time series forecasting - Python - Statsmodels -seo_description: Learn how to fit ARIMA models using Python's statsmodels library, evaluate their performance, and avoid common pitfalls. -seo_title: 'ARIMA Forecasting with Python' +seo_description: Learn how to fit ARIMA models using Python's statsmodels library, + evaluate their performance, and avoid common pitfalls. +seo_title: ARIMA Forecasting with Python seo_type: article -summary: This tutorial walks through the basics of ARIMA modeling, from identifying parameters to validating forecasts on real data. +summary: This tutorial walks through the basics of ARIMA modeling, from identifying + parameters to validating forecasts on real data. tags: -- ARIMA +- Arima - Forecasting - Python - Time series title: 'ARIMA Modeling in Python: A Quick Start Guide' --- -ARIMA models remain a cornerstone of classical time series analysis. Python's `statsmodels` package makes it straightforward to specify, fit, and evaluate these models. +## Forecasting with ARIMA: Context and Rationale -## 1. Identifying the ARIMA Order +Time series forecasting underpins decision-making in domains from finance to supply-chain management. Although modern machine learning methods often make headlines, classical approaches such as ARIMA remain indispensable baselines. An ARIMA model—AutoRegressive Integrated Moving Average—captures three core behaviors: dependence on past observations, differencing to enforce stationarity, and smoothing of past forecast errors. When implemented carefully, ARIMA delivers interpretable forecasts, rigorous confidence intervals, and an established toolkit for diagnostic evaluation. -Plot the autocorrelation (ACF) and partial autocorrelation (PACF) to determine suitable values for the AR (p) and MA (q) terms. Differencing can help stabilize non-stationary series before fitting. +## The ARIMA(p, d, q) Model Formulation -## 2. Fitting the Model +An ARIMA(p, d, q) model can be expressed in operator notation. Let $$L$$ denote the lag operator, so that $$L\,x_t = x_{t-1}$$. Then the model satisfies -With parameters chosen, use `statsmodels.tsa.arima.model.ARIMA` to estimate the coefficients. Review summary statistics to ensure reasonable residual behavior. +$$ +\phi(L)\,(1 - L)^d\,x_t \;=\; \theta(L)\,\varepsilon_t, +$$ -## 3. Forecast Evaluation +where -Evaluate predictions using metrics like mean absolute error (MAE) or root mean squared error (RMSE). Cross-validation on rolling windows helps confirm that the model generalizes well. +$$ +\phi(L) \;=\; 1 - \phi_1 L - \phi_2 L^2 - \cdots - \phi_p L^p, +\quad +\theta(L) \;=\; 1 + \theta_1 L + \theta_2 L^2 + \cdots + \theta_q L^q, +$$ -While ARIMA is a classical technique, it remains a powerful baseline and a stepping stone toward more complex forecasting methods. +and $$\varepsilon_t$$ is white noise. The integer $$d$$ denotes the number of nonseasonal differences required to achieve stationarity. When $$d=0$$ and $$p,q>0$$, the model reduces to ARMA(p, q). Seasonal extensions augment this with seasonal autoregressive and moving-average polynomials at lag $$s$$. + +## Identifying Model Order via ACF and PACF + +Choosing suitable values for $$p$$, $$d$$, and $$q$$ begins with visualization. The autocorrelation function (ACF) plots $$\mathrm{Corr}(x_t, x_{t-k})$$ against lag $$k$$, while the partial autocorrelation function (PACF) isolates the correlation at lag $$k$$ after removing intermediate effects. A slowly decaying ACF suggests need for differencing; a sharp cutoff in the PACF after lag $$p$$ hints at an AR($$p$$) component, whereas a cutoff in the ACF after lag $$q$$ indicates an MA($$q$$) term. + +In practice, one may: + +- Plot the time series to check for trends or seasonal cycles. +- Apply first or seasonal differencing until the series appears stationary. +- Examine the ACF for significant spikes at lags up to 20 or 30. +- Inspect the PACF for single-lag cutoffs or exponential decay patterns. + +These heuristics guide the initial grid of candidate $$(p,d,q)$$ combinations to evaluate. + +## Stationarity, Differencing, and Seasonal Extensions + +Non-stationary behavior—trends or unit roots—violates ARIMA assumptions. The Augmented Dickey-Fuller (ADF) test offers a statistical check for a unit root and informs the choice of $$d$$. When seasonal patterns recur every $$s$$ observations (for example, $$s=12$$ for monthly data), applying a seasonal difference $$(1 - L^s)$$ yields the SARIMA(p, d, q)(P, D, Q)$$_s$$ model. Seasonal terms capture long-period dependencies that nonseasonal differencing cannot. + +Proper differencing preserves the underlying information while stabilizing variance and autocorrelation structure. Over-differencing should be avoided, as it can inflate model variance and distort forecasts. + +## Fitting ARIMA Models in Python with statsmodels + +Python’s `statsmodels` library exposes ARIMA fitting through the `ARIMA` class in `statsmodels.tsa.arima.model`. A typical workflow follows: + +```python +import pandas as pd +from statsmodels.tsa.arima.model import ARIMA + +# Load a time series (e.g., monthly airline passengers) +series = pd.read_csv('air_passengers.csv', index_col='Month', parse_dates=True) +y = series['Passengers'] + +# Specify and fit a nonseasonal ARIMA(1,1,1) +model = ARIMA(y, order=(1, 1, 1)) +result = model.fit() + +# View summary of estimated coefficients +print(result.summary()) +``` + +## Interpreting the Summary Report + +The `.summary()` report displays parameter estimates, their standard errors, and information criteria such as AIC and BIC, which facilitate model comparison. Lower AIC/BIC suggests a better balance of fit and parsimony. + +## Diagnostic Checking and Residual Analysis + +After fitting, verify that residuals behave like white noise. Key diagnostic checks include: + +- **Plotting standardized residuals** to look for non-random patterns. +- **Examining the residual ACF** to confirm absence of autocorrelation. +- **Conducting the Ljung–Box test** for serial correlation up to a chosen lag. +- **Checking normality** of residuals via QQ plots. + +If diagnostics reveal structure in the residuals, revisit the order selection, try alternative differencing, or incorporate seasonal terms. + +## Generating and Visualizing Forecasts + +With a validated model, forecasts and confidence intervals are easily obtained: + +```python +# Forecast next 12 periods +forecast = result.get_forecast(steps=12) +mean_forecast = forecast.predicted_mean +conf_int = forecast.conf_int(alpha=0.05) + +# Plot historical data and forecasts +import matplotlib.pyplot as plt + +plt.figure(figsize=(10, 4)) +plt.plot(y, label='Historical') +plt.plot(mean_forecast, label='Forecast', color='C1') +plt.fill_between(conf_int.index, + conf_int.iloc[:, 0], + conf_int.iloc[:, 1], + color='C1', alpha=0.2) +plt.legend() +plt.title('ARIMA Forecast with 95% CI') +plt.show() +``` + +## Visual Inspection of Forecast Intervals + +Visual inspection of forecast intervals helps gauge uncertainty and makes communicating results to stakeholders straightforward. + +## Forecast Evaluation with Rolling Cross-Validation + +Rather than relying on a single train-test split, employ rolling cross-validation to assess forecast stability. At each fold, fit the model on a growing window and forecast a fixed horizon, then compute error metrics such as mean absolute error (MAE) or root mean squared error (RMSE). Aggregating errors across folds yields robust estimates of out-of-sample performance and guards against overfitting to a particular period. + +## Advanced Topics: SARIMA and Automated Order Selection + +For series with strong seasonality, the Seasonal ARIMA extension (SARIMA) incorporates seasonal AR(P), I(D), and MA(Q) terms at lag _s_. Python users can leverage `pmdarima`’s `auto_arima` to automate differencing tests, grid-search orders, and select the model minimizing AIC. Under the hood, `auto_arima` performs unit-root tests, stepwise order search, and parallelizes fitting for efficiency. While convenient, automated routines should be paired with domain knowledge and diagnostic checks to ensure the chosen model aligns with real-world behavior. + +## Practical Tips and Best Practices + +Successful ARIMA modeling hinges on judicious preprocessing, thorough diagnostics, and transparent communication. Always visualize both the series and residuals. Document the rationale behind differencing and order choices. Compare multiple candidate models using AIC/BIC and cross-validation. Finally, present forecast intervals alongside point predictions to convey uncertainty. By integrating classical rigor with Python’s rich ecosystem, practitioners can deploy ARIMA models that remain reliable baselines and trustworthy forecasting tools for time-series challenges. diff --git a/_posts/2025-06-11-introduction_neural_networks.md b/_posts/2025-06-11-introduction_neural_networks.md index 3ed0d24b..907bde98 100644 --- a/_posts/2025-06-11-introduction_neural_networks.md +++ b/_posts/2025-06-11-introduction_neural_networks.md @@ -4,7 +4,8 @@ categories: - Machine Learning classes: wide date: '2025-06-11' -excerpt: Neural networks power many modern AI applications. This article introduces their basic structure and training process. +excerpt: Neural networks power many modern AI applications. This article introduces + their basic structure and training process. header: image: /assets/images/data_science_14.jpg og_image: /assets/images/data_science_14.jpg @@ -17,25 +18,127 @@ keywords: - Deep learning - Backpropagation - Activation functions -seo_description: Get a beginner-friendly overview of neural networks, covering layers, activation functions, and how training works via backpropagation. -seo_title: 'Neural Networks Explained Simply' +seo_description: Get a beginner-friendly overview of neural networks, covering layers, + activation functions, and how training works via backpropagation. +seo_title: Neural Networks Explained Simply seo_type: article -summary: This overview demystifies neural networks by highlighting how layered structures learn complex patterns from data. +summary: This overview demystifies neural networks by highlighting how layered structures + learn complex patterns from data. tags: - Neural networks - Deep learning - Machine learning -title: 'A Gentle Introduction to Neural Networks' +title: A Gentle Introduction to Neural Networks --- -At their core, neural networks consist of layers of interconnected nodes that learn to approximate complex functions. Each layer transforms its inputs through weights and activation functions, gradually building richer representations. +## Architectural Foundations of Neural Networks -## 1. Layers and Activations +At their essence, neural networks are parameterized, differentiable functions that map inputs to outputs by composing a sequence of simple transformations. Inspired by biological neurons, each computational node receives a weighted sum of inputs, applies a nonlinear activation, and passes its result forward. Chaining these nodes into layers allows the network to learn hierarchical representations: early layers extract basic features, while deeper layers combine them into increasingly abstract concepts. -A typical network starts with an input layer, followed by one or more hidden layers, and ends with an output layer. Activation functions like ReLU, sigmoid, or tanh introduce non-linearity, enabling the network to model complicated relationships. +A feed-forward network consists of an input layer that ingests raw features, one or more hidden layers where most of the representation learning occurs, and an output layer that produces predictions. For an input vector $$\mathbf{x} \in \mathbb{R}^n$$, the network’s output $$\mathbf{\hat{y}}$$ is given by the nested composition +$$ +\mathbf{\hat{y}} = f^{(L)}\bigl(W^{(L)} f^{(L-1)}\bigl(\dots f^{(1)}(W^{(1)} \mathbf{x} + \mathbf{b}^{(1)})\dots\bigr) + \mathbf{b}^{(L)}\bigr), +$$ +where $$L$$ denotes the number of layers, each $$W^{(l)}$$ is a weight matrix, $$\mathbf{b}^{(l)}$$ a bias vector, and $$f^{(l)}$$ an activation function. -## 2. Training via Backpropagation +## Layers and Nonlinear Activations -During training, the network makes predictions and measures how far they deviate from the true labels. The backpropagation algorithm computes gradients of the error with respect to each weight, allowing an optimizer such as gradient descent to adjust the network toward better performance. +Stacking linear transformations alone would collapse to a single linear mapping, no matter how many layers you use. The power of neural networks arises from nonlinear activation functions inserted between layers. Common choices include: -Neural networks underpin everything from image recognition to natural language processing. Understanding their basic mechanics is the first step toward exploring the broader world of deep learning. +- **Rectified Linear Unit (ReLU):** + $$ + \mathrm{ReLU}(z) = \max(0,\,z). + $$ + Its simplicity and sparsity‐inducing effect often speed up convergence. + +- **Sigmoid:** + $$ + \sigma(z) = \frac{1}{1 + e^{-z}}, + $$ + which squashes inputs into $$(0,1)$$, useful for binary outputs but prone to vanishing gradients. + +- **Hyperbolic Tangent (tanh):** + $$ + \tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}, + $$ + mapping to $$(-1,1)$$ and centering activations, but still susceptible to saturation. + +Choosing the right activation often depends on the task and network depth. Modern architectures frequently use variants such as Leaky ReLU or Swish to mitigate dead-neuron issues and improve gradient flow. + +## Mechanics of Forward Propagation + +In a forward pass, each layer computes its pre-activation output +$$ + \mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}, +$$ +and then applies the activation +$$ + \mathbf{a}^{(l)} = f^{(l)}\bigl(\mathbf{z}^{(l)}\bigr). +$$ +Starting with $$\mathbf{a}^{(0)} = \mathbf{x}$$, the network progressively transforms input features into decision‐ready representations. For classification, the final layer often uses a softmax activation +$$ + \mathrm{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}, +$$ +ensuring the outputs form a probability distribution over classes. + +## Training Dynamics: Backpropagation and Gradient Descent + +Learning occurs by minimizing a loss function $$ \mathcal{L}(\mathbf{\hat{y}}, \mathbf{y})$$, which quantifies the discrepancy between predictions $$\mathbf{\hat{y}}$$ and true labels $$\mathbf{y}$$. The quintessential example is cross-entropy for classification: +$$ + \mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i). +$$ + +Backpropagation efficiently computes the gradient of the loss with respect to each parameter by applying the chain rule through the network’s layered structure. For a single parameter $$W_{ij}^{(l)}$$, the gradient is +$$ + \frac{\partial \mathcal{L}}{\partial W_{ij}^{(l)}} = \delta_j^{(l)}\,a_i^{(l-1)}, +$$ +where the error term $$\delta^{(l)}$$ is defined recursively as +$$ + \delta^{(L)} = \nabla_{\mathbf{z}^{(L)}}\,\mathcal{L}, + \quad + \delta^{(l)} = \bigl(W^{(l+1)\,T} \delta^{(l+1)}\bigr) \odot f'^{(l)}\bigl(\mathbf{z}^{(l)}\bigr) + \quad\text{for } l < L. +$$ +Here $$\odot$$ denotes element-wise multiplication and $$f'$$ the derivative of the activation. Armed with these gradients, an optimizer such as stochastic gradient descent (SGD) updates parameters: +$$ + W^{(l)} \leftarrow W^{(l)} - \eta\,\frac{\partial \mathcal{L}}{\partial W^{(l)}}, + \quad + \mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta\,\delta^{(l)}, +$$ +where $$\eta$$ is the learning rate. Variants like Adam, RMSProp, and momentum incorporate adaptive step sizes and velocity terms to accelerate convergence and escape shallow minima. + +## Weight Initialization and Optimization Strategies + +Proper initialization of $$W^{(l)}$$ is critical to avoid vanishing or exploding signals. He initialization $$\bigl\lVert W_{ij} \bigr\rVert \sim \mathcal{N}(0,\,2/n_{\mathrm{in}})$$ suits ReLU activations, whereas Xavier/Glorot initialization $$\mathcal{N}(0,\,1/n_{\mathrm{in}} + 1/n_{\mathrm{out}})$$ balances forward and backward variances for sigmoidal functions. Batch normalization further stabilizes training by normalizing each layer’s pre-activations +$$ + \hat{z}_i = \frac{z_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, + \quad + z_i' = \gamma\,\hat{z}_i + \beta, +$$ +where $$\mu_{\mathcal{B}},\sigma_{\mathcal{B}}$$ are batch statistics and $$\gamma,\beta$$ learnable scale and shift parameters. Combining these techniques with well-tuned optimizers yields robust, fast-converging models. + +## Common Network Architectures + +Neural networks take many forms beyond simple feed-forward stacks. Convolutional neural networks (CNNs) apply learnable filters to exploit spatial locality in images; recurrent networks (RNNs) process sequential data by maintaining hidden states across time steps; and transformer architectures leverage attention mechanisms to capture long-range dependencies in text and vision. Each architecture builds on the same layer-and-activation principles, adapting connectivity and parameter sharing to domain-specific patterns. + +## Practical Considerations for Effective Training + +Training neural networks at scale demands careful attention to: + +- **Learning Rate Scheduling:** + Techniques such as exponential decay, cosine annealing, or warm restarts adjust $$\eta$$ over epochs to refine convergence. + +- **Regularization:** + Dropout randomly deactivates neurons during training, weight decay penalizes large parameters, and data augmentation expands datasets with synthetic variations. + +- **Batch Size Selection:** + Larger batches yield smoother gradient estimates but may generalize worse; smaller batches introduce noise that can regularize learning but slow throughput. + +- **Monitoring and Early Stopping:** + Tracking validation loss and accuracy guards against overfitting. Early stopping halts training when performance plateaus, preserving the model at its best epoch. + +By combining these practices, practitioners can navigate the trade-offs inherent in deep learning and harness the full expressive power of neural networks. + +## Looking Ahead: Extensions and Advanced Topics + +Understanding the mechanics of layers, activations, and backpropagation lays the groundwork for exploring advanced deep learning themes: residual connections that alleviate vanishing gradients in very deep models, attention mechanisms that dynamically weight inputs, generative models that synthesize realistic data, and meta-learning algorithms that learn to learn. As architectures evolve and hardware accelerates, mastering these fundamentals ensures that you can adapt to emerging innovations and architect solutions that solve complex real-world problems. diff --git a/_posts/2025-06-12-hyperparameter_tuning_strategies.md b/_posts/2025-06-12-hyperparameter_tuning_strategies.md index b313525f..c85896a0 100644 --- a/_posts/2025-06-12-hyperparameter_tuning_strategies.md +++ b/_posts/2025-06-12-hyperparameter_tuning_strategies.md @@ -4,7 +4,8 @@ categories: - Machine Learning classes: wide date: '2025-06-12' -excerpt: Hyperparameter tuning can drastically improve model performance. Explore common search strategies and tools. +excerpt: Hyperparameter tuning can drastically improve model performance. Explore + common search strategies and tools. header: image: /assets/images/data_science_15.jpg og_image: /assets/images/data_science_15.jpg @@ -17,26 +18,92 @@ keywords: - Grid search - Random search - Bayesian optimization -seo_description: Learn when to use grid search, random search, and Bayesian optimization to tune machine learning models effectively. -seo_title: 'Effective Hyperparameter Tuning Methods' +seo_description: Learn when to use grid search, random search, and Bayesian optimization + to tune machine learning models effectively. +seo_title: Effective Hyperparameter Tuning Methods seo_type: article -summary: This guide covers systematic approaches for searching the hyperparameter space, along with libraries that automate the process. +summary: This guide covers systematic approaches for searching the hyperparameter + space, along with libraries that automate the process. tags: - Hyperparameters - Model selection - Optimization - Machine learning -title: 'Hyperparameter Tuning Strategies' +title: Hyperparameter Tuning Strategies --- -Choosing the right hyperparameters can make or break a machine learning model. Because the search space is often large, systematic strategies are essential. +## The Importance of Hyperparameter Optimization -## 1. Grid and Random Search +Hyperparameters—settings that govern the training process and structure of a model—play a pivotal role in determining predictive performance, generalization ability, and computational cost. Examples include learning rates, regularization coefficients, network depths, and kernel parameters. Manually tuning these values by trial and error is laborious and rarely finds the true optimum, especially as models grow in complexity. A systematic approach to hyperparameter search transforms tuning from an art into a reproducible, quantifiable process. By intelligently exploring the search space, practitioners can achieve better model accuracy, faster convergence, and clearer insights into the sensitivity of their algorithms to key parameters. -Grid search exhaustively tests combinations of predefined parameter values. While thorough, it can be expensive. Random search offers a quicker alternative by sampling combinations at random, often finding good solutions faster. +## Grid Search: Exhaustive Exploration -## 2. Bayesian Optimization +Grid search enumerates every possible combination of specified hyperparameter values. If you define a grid over two parameters—for instance, learning rate $$\eta \in \{10^{-3},10^{-2},10^{-1}\}$$ and regularization strength $$\lambda \in \{10^{-4},10^{-3},10^{-2},10^{-1}\}$$—grid search will train and evaluate models at all 12 combinations. This exhaustive approach guarantees that the global optimum within the grid will be discovered, but its computational cost grows exponentially with the number of parameters and resolution of the grid. In low-dimensional spaces or when compute resources abound, grid search delivers reliable baselines and insights into parameter interactions. However, for high-dimensional or continuous domains, its inefficiency mandates alternative strategies. -Bayesian methods build a probabilistic model of the objective function and choose the next parameters to evaluate based on expected improvement. Libraries like Optuna and Hyperopt make this approach accessible. +## Random Search: Efficient Sampling -Automated tools can handle much of the heavy lifting, but understanding the underlying strategies helps you choose the best one for your problem and compute budget. +Random search addresses the curse of dimensionality by drawing hyperparameter configurations at random from predefined distributions. Contrary to grid search, it allocates trials uniformly across the search space, which statistically yields better coverage in high-dimensional settings. As shown by Bergstra and Bengio (2012), random search often finds near-optimal configurations with far fewer evaluations than grid search, especially when only a subset of hyperparameters critically influences performance. By sampling learning rates from a log-uniform distribution or selecting dropout rates uniformly between 0 and 0.5, random search streamlines experiments and uncovers promising regions more rapidly. It also adapts naturally to continuous parameters without requiring an arbitrary discretization. + +## Bayesian Optimization: Probabilistic Tuning + +Bayesian optimization constructs a surrogate probabilistic model—commonly a Gaussian process or Tree-structured Parzen Estimator (TPE)—to approximate the relationship between hyperparameters and objective metrics such as validation loss. At each iteration, it uses an acquisition function (e.g., Expected Improvement, Upper Confidence Bound) to balance exploration of untested regions and exploitation of known good areas. The Expected Improvement acquisition can be expressed as + +$$ +\alpha_{\mathrm{EI}}(x) = \mathbb{E}\bigl[\max\bigl(0,\,f(x) - f(x^+)\bigr)\bigr], +$$ + +where $$f(x^+)$$ is the best observed objective so far. This criterion quantifies the expected gain from sampling $$x$$, guiding resource allocation towards configurations with the greatest promise. Popular libraries such as Optuna, Hyperopt, and Scikit-Optimize abstract these concepts into user-friendly interfaces, enabling asynchronous parallel evaluations, pruning of unpromising trials, and automatic logging. + +## Multi-Fidelity Methods: Hyperband and Successive Halving + +While Bayesian methods focus on where to sample, multi-fidelity techniques emphasize how to allocate a fixed computational budget across many configurations. Successive Halving begins by training a large set of candidates for a small number of epochs or on a subset of data, then discards the bottom fraction and promotes the top performers to the next round with increased budget. Hyperband extends this idea by running multiple Successive Halving instances with different initial budgets, ensuring that both many cheap evaluations and fewer expensive ones are considered. By dynamically allocating resources to promising hyperparameters, Hyperband often outperforms fixed-budget strategies, particularly when training time is highly variable across configurations. + +## Evolutionary Algorithms and Metaheuristics + +Evolutionary strategies and other metaheuristic algorithms mimic natural selection to evolve hyperparameter populations over generations. A pool of candidate configurations undergoes mutation (random perturbations), crossover (recombining parameters from two candidates), and selection (retaining the highest-performing individuals). Frameworks like DEAP and TPOT implement genetic programming for both hyperparameter tuning and pipeline optimization. Although these methods can be computationally intensive, they excel at exploring complex, non-convex search landscapes and can adaptively shift search focus based on emerging performance trends. + +## Practical Considerations: Parallelization and Early Stopping + +Effective hyperparameter search leverages parallel computing—distributing trials across CPUs, GPUs, or cloud instances to accelerate discovery. Asynchronous execution ensures that faster trials do not wait for slower ones, maximizing cluster utilization. Early stopping mechanisms monitor intermediate metrics (e.g., validation loss) during training and terminate runs that underperform relative to their peers, salvaging resources for more promising experiments. Systems like Ray Tune, KubeTune, and Azure AutoML integrate these capabilities, automatically pruning trials and scaling across distributed environments. + +## Tooling and Frameworks + +A rich ecosystem of tools simplifies hyperparameter optimization: + +- **Scikit-Learn**: Offers `GridSearchCV` and `RandomizedSearchCV` for classical ML models. +- **Optuna**: Provides efficient Bayesian optimization with pruning and multi-objective support. +- **Hyperopt**: Implements Tree-structured Parzen Estimator search with Trials logging. +- **Ray Tune**: Enables scalable, distributed experimentation with support for Hyperband, Bayesian, random, and population-based searches. +- **Google Vizier & SigOpt**: Managed services for large-scale, enterprise-grade tuning. + +Choosing the right framework depends on project scale, desired search strategy, and infrastructure constraints. + +## Best Practices and Guidelines + +To maximize the effectiveness of hyperparameter optimization, consider the following guidelines: + +1. **Define a Realistic Search Space** + Prioritize parameters known to impact performance (e.g., learning rate, regularization) and constrain ranges based on prior experiments or domain knowledge. + +2. **Scale and Transform Appropriately** + Sample continuous parameters on logarithmic scales (e.g., $$\log_{10}$$ for learning rates) and encode categorical choices with one-hot or ordinal representations. + +3. **Allocate Budget Wisely** + Balance the number of trials with the compute time per trial. Favor a larger number of quick, low-fidelity runs early on, then refine with more thorough evaluations. + +4. **Maintain Reproducibility** + Log random seeds, hyperparameter values, code versions, and data splits. Use experiment tracking tools like MLflow, Weights & Biases, or Comet to record outcomes. + +5. **Leverage Warm Starting** + When tuning similar models or datasets, initialize the search using prior results or transfer learning techniques to accelerate convergence. + +6. **Monitor Convergence Trends** + Visualize performance across trials to detect plateaus or drastic improvements, then adjust search ranges or switch strategies if progress stalls. + +By adhering to these principles, practitioners can avoid wasted compute cycles and uncover robust hyperparameter settings efficiently. + +## Future Trends in Hyperparameter Tuning + +Emerging directions in model tuning include automated machine learning (AutoML) pipelines that integrate search strategies with feature engineering and neural architecture search. Meta-learning approaches aim to learn hyperparameter priors from historical experiments, reducing cold-start inefficiencies. Reinforcement learning agents that dynamically adjust hyperparameters during training—so-called “online tuning”—promise to further streamline workflows. Finally, advances in hardware acceleration and unified optimization platforms will continue to lower the barrier to large-scale hyperparameter exploration, making systematic tuning accessible to a broader range of practitioners. + +By combining these evolving techniques with sound engineering practices, the next generation of hyperparameter optimization will deliver models that not only perform at the state of the art but also adapt rapidly to new challenges and datasets. diff --git a/_posts/2025-06-13-model_deployment_best_practices.md b/_posts/2025-06-13-model_deployment_best_practices.md index 08a259e5..d388625c 100644 --- a/_posts/2025-06-13-model_deployment_best_practices.md +++ b/_posts/2025-06-13-model_deployment_best_practices.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2025-06-13' -excerpt: Deploying machine learning models to production requires planning and robust infrastructure. Here are key practices to ensure success. +excerpt: Deploying machine learning models to production requires planning and robust + infrastructure. Here are key practices to ensure success. header: image: /assets/images/data_science_16.jpg og_image: /assets/images/data_science_16.jpg @@ -14,31 +15,176 @@ header: twitter_image: /assets/images/data_science_16.jpg keywords: - Model deployment -- MLOps +- Mlops - Monitoring - Scalability -seo_description: Understand essential steps for taking models from development to production, including containerization, monitoring, and retraining. -seo_title: 'Best Practices for Model Deployment' +seo_description: Understand essential steps for taking models from development to + production, including containerization, monitoring, and retraining. +seo_title: Best Practices for Model Deployment seo_type: article -summary: This post outlines reliable approaches for serving machine learning models in production environments and keeping them up to date. +summary: This post outlines reliable approaches for serving machine learning models + in production environments and keeping them up to date. tags: - Deployment -- MLOps +- Mlops - Production - Data science title: 'Model Deployment: Best Practices and Tips' --- -A model is only as valuable as its impact in the real world. Deployment bridges the gap between experimental results and practical applications. +## Framing the Deployment Landscape -## 1. Containerization +Taking a machine learning model from the lab to a live environment involves more than just copying files. Production systems demand reliability, security, scalability, and maintainability. Teams must consider how models integrate with existing services, meet compliance requirements, evolve with data, and deliver consistent performance under varying loads. By viewing deployment as a multidisciplinary effort—spanning software engineering, data engineering, and operations—organizations can build robust pipelines that transform experimental artifacts into business-critical services. -Packaging models in containers such as Docker ensures consistent environments across development and production. This reduces dependency issues and simplifies scaling. +## Advanced Containerization and Orchestration -## 2. Monitoring and Logging +### Beyond Basic Dockerization -Once deployed, models must be monitored for performance degradation and data drift. Logging predictions and input data enables debugging and long-term analysis. +Packaging your model and its dependencies into a Docker image is just the beginning. To achieve enterprise-grade deployments: -## 3. Continuous Improvement +- **Multi-Stage Builds**: Use multi-stage `Dockerfile`s to keep images lean. Separate the build environment (compilation of native extensions, downloading large artifacts) from the runtime environment to minimize attack surface and startup time. +- **Image Scanning**: Incorporate vulnerability scanners (e.g., Trivy, Clair) into your CI pipeline. Automated scans on each push detect outdated libraries or misconfigurations before images reach production. +- **Immutable Tagging**: Avoid the `latest` tag in production. Instead, tag images with semantic versions or Git commit SHAs to guarantee that each deployment references a fixed, auditable artifact. -Retraining pipelines and automated rollback strategies help keep models accurate as data changes over time. MLOps tools streamline these processes, making deployments more robust. +### Kubernetes and Beyond + +Kubernetes has become the de-facto standard for orchestrating containerized models: + +1. **Helm Charts**: Define reusable, parameterizable templates for deploying model services, config maps, and ingress rules. +2. **Custom Resource Definitions (CRDs)**: Extend Kubernetes to manage ML-specific resources, such as `InferenceService` in KServe or `TFJob` in Kubeflow. +3. **Autoscaling**: Configure the Horizontal Pod Autoscaler (HPA) to scale based on CPU/GPU utilization or custom metrics (e.g., request latency), ensuring optimal resource usage during traffic spikes. +4. **Service Mesh Integration**: Leverage Istio or Linkerd to handle service discovery, circuit breaking, and mutual TLS, offloading networking concerns from your application code. + +By combining these orchestration primitives, teams achieve declarative, self-healing deployments that can withstand node failures and rolling upgrades. + +## Securing Models in Production + +Protecting your model and data is paramount. Consider these layers of defense: + +1. **Network Policies**: Enforce least-privilege communication between pods or microservices. Kubernetes NetworkPolicy objects can restrict which IP ranges or namespaces are allowed to query your inference endpoint. +2. **Authentication & Authorization**: Integrate OAuth/OIDC or mTLS to ensure only authorized clients (applications, users) can access prediction APIs. +3. **Secret Management**: Store credentials, API keys, and certificates in a secure vault (e.g., HashiCorp Vault, AWS Secrets Manager), and mount them as environment variables or encrypted volumes at runtime. +4. **Input Sanitization**: Validate incoming data against schemas (using JSON Schema, Protobuf, or custom validators) to guard against malformed inputs or adversarial payloads. + +Implementing robust security controls mitigates risks—from data exfiltration to model inversion attacks—and ensures compliance with regulations like GDPR or HIPAA. + +## Monitoring, Logging, and Drift Detection + +Reliable monitoring extends beyond uptime checks: + +- **Metrics Collection** + - *Latency & Throughput*: Track P90/P99 response times. + - *Resource Metrics*: GPU/CPU/memory usage. + - *Business KPIs*: Tie model predictions to downstream metrics such as conversion rate or churn. + +- **Drift Detection** + - *Data Drift*: Monitor feature distributions in production against training distributions. Techniques like population stability index (PSI) or KL divergence highlight when input data shifts. + - *Concept Drift*: Continuously evaluate live predictions against delayed ground truth to measure degradation in model accuracy. + +- **Logging Best Practices** + - Centralize logs with systems like ELK or Splunk. + - Log request IDs and correlate them across services for end-to-end tracing. + - Redact sensitive PII but retain hashed identifiers for troubleshooting. + +By proactively alerting on metric anomalies or drift, teams can intervene before business impact escalates. + +## Model Versioning and Governance + +When multiple versions of a model coexist or when regulatory audits demand traceability, governance is key: + +- **Model Registry**: Use tools such as MLflow Model Registry or NVIDIA Clara Deploy to catalog models, their metadata (training data, hyperparameters), and lineage (who approved, when). +- **Approval Workflows**: Codify review processes—data validation checks, performance benchmarks, security scans—so that only vetted models advance to production. +- **Audit Logs**: Maintain tamper-evident records of deployment events, retraining triggers, and rollback actions. This not only aids debugging but also satisfies compliance requirements. + +Establishing a clear governance framework reduces technical debt and aligns ML initiatives with organizational policy. + +## Continuous Improvement Pipelines + +Data changes, user behavior evolves, and model performance inevitably decays. A resilient pipeline should: + +1. **Automated Retraining**: + - Schedule periodic or triggered retraining jobs when drift detectors fire. + - Use data versioning platforms (e.g., DVC, Pachyderm) to snapshot datasets. + +2. **CI/CD for Models**: + - Integrate unit tests, data validation checks, and performance benchmarks into every pull request. + - Employ canary or blue–green strategies for rolling out new models with minimal risk. + +3. **Human-in-the-Loop**: + - Surface low-confidence or high-impact predictions to domain experts for labeling. + - Use active learning to prioritize the most informative samples, maximizing labeling efficiency. + +4. **Rollback Mechanisms**: + - Store the last known “good” model. + - Automate rollback within your orchestration system if key metrics (latency, accuracy) exceed error budgets. + +An end-to-end MLOps platform streamlines these steps, ensuring that models remain reliable and up-to-date without manual overhead. + +## Scaling and Performance Optimization + +High-throughput, low-latency requirements demand careful tuning: + +- **Batch vs. Online Inference**: + - Use batch endpoints for large volumes of data processed asynchronously. + - Opt for low-latency REST/gRPC services for real-time needs. + +- **Hardware Acceleration**: + - Leverage GPUs, TPUs, or inference accelerators (e.g., NVIDIA TensorRT, Intel OpenVINO) and profile your model to choose the optimal device. + +- **Concurrency and Threading**: + - Implement request batching within the service (e.g., NVIDIA Triton Inference Server) to aggregate requests and amortize overhead. + - Tune thread pools and async event loops (e.g., FastAPI with Uvicorn) to maximize CPU utilization. + +- **Caching**: + - For deterministic predictions, cache results based on input hashes to avoid redundant computation. + +Combining these techniques ensures that your deployment meets SLA requirements under varying loads. + +## Cost Management and Infrastructure Choices + +Balancing performance and budget is critical: + +- **Serverless vs. Provisioned**: + - Serverless platforms (AWS Lambda, Google Cloud Functions) eliminate server maintenance but may introduce cold-start latency and cost unpredictability. + - Provisioned clusters (EKS, GKE, on-prem) offer predictable pricing and control but require ongoing management. + +- **Spot Instances and Preemptible VMs**: + - For non-critical batch inference or retraining jobs, leverage discounted compute options to reduce spend. + +- **Resource Tagging and Budget Alerts**: + - Tag all ML resources with project, environment, and owner. + - Configure billing alerts to catch cost overruns early. + +By combining financial visibility with dynamic provisioning strategies, organizations can optimize ROI on their ML workloads. + +## Deployment Architectures: Edge, Cloud, and Hybrid + +Different use cases call for different topologies: + +- **Cloud-Native**: Centralized inference in scalable clusters—ideal for web applications with elastic demand. +- **Edge Deployment**: Containerized models running on IoT devices or mobile phones reduce latency and preserve data privacy. +- **Hybrid Models**: A two-tier pattern where lightweight on-device models handle preliminary filtering and route complex cases to cloud APIs for deeper analysis. + +Selecting the right architecture depends on factors such as connectivity, compliance, and latency requirements. + +## Case Study: Kubernetes-Based Model Deployment + +A fintech startup needed to serve fraud-detection predictions at sub-50 ms latency, processing 10,000 TPS at peak. Their solution included: + +1. **Dockerization**: Multi-stage build producing a 200 MB image with Python 3.10, PyTorch, and Triton Inference Server. +2. **Helm Chart**: Parameterized deployment with CPU/GPU node selectors, HPA rules scaling between 3 and 30 pods. +3. **Istio Service Mesh**: mTLS for in-cluster communications and circuit breakers to isolate failing pods. +4. **Prometheus & Grafana**: Custom exporters for inference latency and drift metrics, with Slack alerts on anomalies. +5. **MLflow Registry**: Automated promotion of models passing accuracy thresholds, triggering Helm upgrades via Jenkins pipelines. + +This architecture delivered high throughput, robust security, and an automated retraining loop that retrained models weekly, reducing false positives by 15% over three months. + +## Final Thoughts and Next Steps + +By embracing advanced containerization, strong security practices, comprehensive monitoring, and automated retraining pipelines, teams can operationalize machine learning models that drive real-world impact. As you refine your deployment processes, consider: + +- Investing in an end-to-end MLOps platform to unify tooling. +- Conducting periodic chaos engineering drills to validate rollback and disaster recovery. +- Fostering a culture of collaboration between data scientists, devops, and security teams. + +With these practices in place, model deployment becomes not a one-off project but a sustainable capability for continuous innovation and business value delivery. diff --git a/_posts/2025-06-14-data_ethics_machine_learning.md b/_posts/2025-06-14-data_ethics_machine_learning.md index f1bd7c11..018fc976 100644 --- a/_posts/2025-06-14-data_ethics_machine_learning.md +++ b/_posts/2025-06-14-data_ethics_machine_learning.md @@ -4,7 +4,8 @@ categories: - Data Science classes: wide date: '2025-06-14' -excerpt: Ethical considerations are critical when deploying machine learning systems that affect real people. +excerpt: Ethical considerations are critical when deploying machine learning systems + that affect real people. header: image: /assets/images/data_science_17.jpg og_image: /assets/images/data_science_17.jpg @@ -15,28 +16,97 @@ header: keywords: - Data ethics - Bias mitigation -- Responsible AI +- Responsible ai - Transparency -seo_description: Examine the ethical challenges of machine learning, from biased data to algorithmic transparency, and learn best practices for responsible AI. -seo_title: 'Data Ethics in Machine Learning' +seo_description: Examine the ethical challenges of machine learning, from biased data + to algorithmic transparency, and learn best practices for responsible AI. +seo_title: Data Ethics in Machine Learning seo_type: article -summary: This article discusses how to address fairness, accountability, and transparency when building machine learning solutions. +summary: This article discusses how to address fairness, accountability, and transparency + when building machine learning solutions. tags: - Ethics -- Responsible AI +- Responsible ai - Bias - Machine learning -title: 'Why Data Ethics Matters in Machine Learning' +title: Why Data Ethics Matters in Machine Learning --- -Machine learning models influence decisions in finance, healthcare, and beyond. Ignoring their ethical implications can lead to harmful outcomes and loss of trust. +## Context and Ethical Imperatives -## 1. Sources of Bias +Machine learning models now underlie critical decisions in domains as diverse as credit underwriting, medical diagnosis, and criminal justice. When these systems operate without ethical guardrails, they can perpetuate or even amplify societal inequities, undermine public trust, and expose organizations to legal and reputational risk. Addressing ethical considerations from the very beginning of the project lifecycle ensures that models do more than optimize statistical metrics—they contribute positively to the communities they serve. -Bias often enters through historical data that reflects social inequities. Careful data auditing and diverse datasets help reduce unfair outcomes. +## Sources of Bias in Machine Learning -## 2. Transparency and Accountability +Bias often creeps into models through the very data meant to teach them. Historical records may encode discriminatory practices—such as lending patterns that disadvantaged certain neighborhoods—or reflect sampling artifacts that under-represent minority groups. Data collection processes themselves can introduce skew: surveys that omit non-English speakers, sensors that fail under certain lighting conditions, or user engagement logs dominated by a vocal subset of the population. -Model interpretability techniques and transparent documentation allow stakeholders to understand how predictions are made and to challenge them when necessary. +Recognizing these sources requires systematic data auditing. By profiling feature distributions across demographic slices, teams can detect imbalances that might lead to unfair predictions. For example, examining loan approval rates by ZIP code or analyzing false positive rates in medical imaging by patient age and ethnicity reveals patterns that warrant deeper investigation. Only by identifying where and how bias arises can practitioners design interventions to reduce its impact. -By considering ethics from the outset, data scientists can create systems that not only perform well but also align with broader societal values. +## Mitigation Strategies for Unfair Outcomes + +Once bias sources are understood, a toolkit of mitigation strategies becomes available: + +- **Data Augmentation and Resampling** + Generating synthetic examples for under-represented groups or oversampling minority classes balances the training set. Care must be taken to avoid introducing artificial artifacts that distort real-world relationships. + +- **Fair Representation Learning** + Techniques that learn latent features invariant to protected attributes—such as adversarial debiasing—aim to strip sensitive information from the model’s internal representation while preserving predictive power. + +- **Post-Processing Adjustments** + Calibrating decision thresholds separately for different demographic groups can equalize error rates, ensuring that no subgroup bears a disproportionate share of misclassification. + +Each approach has trade-offs in complexity, interpretability, and potential impact on overall accuracy. A staged evaluation, combining quantitative fairness metrics with stakeholder review, guides the selection of appropriate measures. + +## Transparency and Model Interpretability + +Transparency transforms opaque algorithms into systems that stakeholders can inspect and challenge. Interpretability techniques yield human-readable explanations of individual predictions or global model behavior: + +- **Feature Attribution Methods** + Algorithms like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) quantify how much each input feature contributed to a given decision, enabling auditors to spot implausible drivers or confirm that the model relies on legitimate indicators. + +- **Counterfactual Explanations** + By asking “What minimal changes in input would alter this prediction?”, counterfactual methods provide actionable insights that resonate with end users—such as advising a loan applicant which factors to adjust for approval. + +- **Surrogate Models** + Training simpler, white-box models (e.g., decision trees) to approximate the behavior of a complex neural network offers a global view of decision logic, highlighting key decision rules even if exact fidelity is imperfect. + +Transparent documentation complements these techniques. Model cards or datasheets describe the intended use cases, performance across subgroups, training data provenance, and known limitations. Making this information publicly available cultivates trust among regulators, partners, and the broader community. + +## Accountability through Documentation and Governance + +Assigning clear ownership for ethical outcomes transforms good intentions into concrete action. A governance framework codifies roles, responsibilities, and review processes: + +1. **Ethics Review Board** + A cross-functional committee—comprising data scientists, legal counsel, domain experts, and ethicists—evaluates proposed models against organizational standards and legal requirements before deployment. + +2. **Approval Workflows** + Automated checkpoints in the CI/CD pipeline prevent models from advancing to production until they pass fairness, security, and performance tests. Audit logs record each decision, reviewer identity, and timestamp, ensuring traceability. + +3. **Ongoing Audits** + Periodic post-deployment assessments verify that models continue to meet ethical benchmarks. Drift detectors trigger re-evaluation when data distributions change, and user feedback channels capture real-world concerns that numeric metrics might miss. + +By embedding these governance structures into everyday workflows, organizations demonstrate a commitment to responsible AI and create clear escalation paths when ethical dilemmas arise. + +## Integrating Ethics into the ML Lifecycle + +Ethical considerations should permeate every stage of model development: + +- **Problem Definition** + Engage stakeholders—including those likely to bear the brunt of errors—to clarify objectives, define protected attributes, and establish fairness criteria. + +- **Data Engineering** + Instrument pipelines with lineage tracking so data transformations remain transparent. Apply schema validation and anonymization where necessary to protect privacy. + +- **Modeling and Evaluation** + Extend evaluation suites to include fairness metrics (e.g., demographic parity, equalized odds) alongside accuracy and latency. Use cross-validation stratified by demographic groups to ensure robust performance. + +- **Deployment and Monitoring** + Monitor real-time fairness indicators—such as disparate impact ratios—and trigger alerts when metrics stray beyond acceptable bounds. Provide dashboards for both technical teams and non-technical stakeholders to inspect model health. + +This holistic integration reduces the risk that ethical risks will be an afterthought or discovered only once harm has occurred. + +## Cultivating an Ethical AI Culture + +Technical measures alone cannot guarantee ethical outcomes. An organizational culture that values transparency, diversity, and continuous learning is essential. Leadership should champion ethics training, sponsor cross-team hackathons focused on bias detection, and reward contributions to open-source fairness tools. By celebrating successes and honestly confronting failures, data science teams reinforce the message that ethical AI is not merely compliance, but a strategic asset that builds long-term trust with users and regulators alike. + +Embedding ethics into machine learning transforms models from black-box decision engines into accountable, equitable systems. Through careful bias mitigation, transparent interpretability, rigorous governance, and a culture of responsibility, practitioners can harness AI’s potential while safeguarding the values that underpin a fair and just society. diff --git a/_posts/2025-06-15-smote_pitfalls.md b/_posts/2025-06-15-smote_pitfalls.md index d1dd7e48..3fb38686 100644 --- a/_posts/2025-06-15-smote_pitfalls.md +++ b/_posts/2025-06-15-smote_pitfalls.md @@ -4,7 +4,8 @@ categories: - Machine Learning classes: wide date: '2025-06-15' -excerpt: SMOTE generates synthetic samples to rebalance datasets, but using it blindly can create unrealistic data and biased models. +excerpt: SMOTE generates synthetic samples to rebalance datasets, but using it blindly + can create unrealistic data and biased models. header: image: /assets/images/data_science_18.jpg og_image: /assets/images/data_science_18.jpg @@ -13,39 +14,71 @@ header: teaser: /assets/images/data_science_18.jpg twitter_image: /assets/images/data_science_18.jpg keywords: -- SMOTE +- Smote - Oversampling - Imbalanced data - Machine learning pitfalls -seo_description: Understand the drawbacks of applying SMOTE for imbalanced datasets and why improper use may reduce model reliability. +seo_description: Understand the drawbacks of applying SMOTE for imbalanced datasets + and why improper use may reduce model reliability. seo_title: 'When SMOTE Backfires: Avoiding the Risks of Synthetic Oversampling' seo_type: article -summary: Synthetic Minority Over-sampling Technique (SMOTE) creates artificial examples to balance classes, but ignoring its assumptions can distort your dataset and harm model performance. +summary: Synthetic Minority Over-sampling Technique (SMOTE) creates artificial examples + to balance classes, but ignoring its assumptions can distort your dataset and harm + model performance. tags: -- SMOTE +- Smote - Class imbalance - Machine learning -title: "Why SMOTE Isn't Always the Answer" +title: Why SMOTE Isn't Always the Answer --- -Synthetic Minority Over-sampling Technique, or **SMOTE**, is a popular approach for handling imbalanced classification problems. By interpolating between existing minority-class instances, it produces new, synthetic samples that appear to boost model performance. +## The Imbalanced Classification Problem -## 1. Distorting the Data Distribution +In many real-world applications—from fraud detection to rare disease diagnosis—datasets exhibit severe class imbalance, where one category (the minority class) is vastly underrepresented. Standard training procedures on such skewed datasets tend to bias models toward predicting the majority class, resulting in poor recall or precision for the minority class. Addressing this imbalance is critical whenever the cost of missing a minority example far outweighs the cost of a false alarm. -SMOTE assumes that minority points can be meaningfully combined to create realistic examples. In many real-world datasets, however, minority observations may form discrete clusters or contain noise. Interpolating across these can introduce unrealistic patterns that do not actually exist in production data. +## How SMOTE Generates Synthetic Samples -## 2. Risk of Overfitting +The Synthetic Minority Over-sampling Technique (SMOTE) tackles class imbalance by creating new, synthetic minority-class instances rather than merely duplicating existing ones. For each minority sample $$x_i$$, SMOTE selects one of its $$k$$ nearest neighbors $$x_{\text{nn}}$$, computes the difference vector, scales it by a random factor $$\lambda \in [0,1]$$, and adds it back to $$x_i$$. Formally: -Adding synthetic samples increases the size of the minority class but does not add truly new information. Models may overfit to these artificial points, learning overly specific boundaries that fail to generalize when faced with genuine data. +$$ +x_{\text{new}} \;=\; x_i \;+\; \lambda \,\bigl(x_{\text{nn}} - x_i\bigr). +$$ -## 3. High-Dimensional Challenges +This interpolation process effectively spreads new points along the line segments joining minority samples, ostensibly enriching the decision regions for the underrepresented class. -In high-dimensional feature spaces, distances become less meaningful. SMOTE relies on nearest neighbors to generate new points, so as dimensionality grows, the synthetic samples may fall in regions that have little relevance to the real-world problem. +## Distorting the Data Distribution -## 4. Consider Alternatives +SMOTE’s assumption that nearby minority samples can be interpolated into realistic examples does not always hold. In domains where minority instances form several well-separated clusters—each corresponding to distinct subpopulations—connecting points across clusters yields synthetic observations that lie in regions devoid of genuine data. This distortion can mislead the classifier into learning decision boundaries around artifacts of the oversampling process rather than true patterns. Even within a single cluster, the presence of noise or mislabeled examples means that interpolation may amplify spurious features, embedding them deep within the augmented dataset. -Before defaulting to SMOTE, evaluate simpler techniques such as collecting more minority data, adjusting class weights, or using algorithms designed for imbalanced tasks. Sometimes, strategic undersampling or cost-sensitive learning yields better results without fabricating new observations. +## Risk of Overfitting to Artificial Points -## Conclusion +By bolstering the minority class with synthetic data, SMOTE increases sample counts but fails to contribute new information beyond what is already captured by existing examples. A model trained on the augmented set may lock onto the specific, interpolated directions introduced by SMOTE, fitting overly complex boundaries that separate synthetic points rather than underlying real-world structure. This overfitting manifests as excellent performance on cross-validation folds that include synthetic data, yet degrades sharply when confronted with out-of-sample real data. In effect, the model learns to “recognize” the synthetic signature of SMOTE rather than the authentic signal. -SMOTE can help balance datasets, but it should be applied with caution. Blindly generating synthetic data can mislead your models and mask deeper issues with class imbalance. Always validate whether the new samples make sense for your domain and explore alternative strategies first. +## High-Dimensional Feature Space Challenges + +As the number of features grows, the concept of “nearest neighbor” becomes increasingly unreliable: distances in high-dimensional spaces tend to concentrate, and local neighborhoods lose their discriminative power. When SMOTE selects nearest neighbors under such circumstances, it can create synthetic samples that fall far from any true sample’s manifold. These new points may inhabit regions where the model has no training experience, further exacerbating generalization errors. In domains like text or genomics—where feature vectors can easily exceed thousands of dimensions—naïvely applying SMOTE often does more harm than good. + +## Alternative Approaches to Handling Imbalance + +Before resorting to synthetic augmentation, it is prudent to explore other strategies. When feasible, collecting or labeling additional minority-class data addresses imbalance at its root. Adjusting class weights in the learning algorithm can penalize misclassification of the minority class more heavily, guiding the optimizer without altering the data distribution. Cost-sensitive learning techniques embed imbalance considerations into the loss function itself, while specialized algorithms—such as one-class SVMs or gradient boosting frameworks with built-in imbalance handling—often yield robust minority performance. In cases where data collection is infeasible, strategic undersampling of the majority class or hybrid methods (combining limited SMOTE with selective cleaning of noisy instances) can strike a balance between representation and realism. + +## Guidelines and Best Practices + +When SMOTE emerges as a necessary tool, practitioners should apply it judiciously: + +1. **Cluster-Aware Sampling** + Segment the minority class into coherent clusters before oversampling to avoid bridging unrelated subpopulations. +2. **Noise Filtering** + Remove or down-weight samples with anomalous feature values to prevent generating synthetic points around noise. +3. **Dimensionality Reduction** + Project data into a lower-dimensional manifold (e.g., via PCA or autoencoders) where nearest neighbors are more meaningful, perform SMOTE there, and map back to the original space if needed. +4. **Validation on Real Data** + Reserve a hold-out set of authentic minority examples to evaluate model performance, ensuring that gains are not driven by artificial points. +5. **Combine with Ensemble Methods** + Integrate SMOTE within ensemble learning pipelines—such as bagging or boosting—so that each base learner sees a slightly different augmented dataset, reducing the risk of overfitting to any single synthetic pattern. + +Following these practices helps preserve the integrity of the original data distribution while still mitigating class imbalance. + +## Final Thoughts + +SMOTE remains one of the most widely adopted tools for addressing imbalanced classification, thanks to its conceptual simplicity and ease of implementation. Yet, as with any data augmentation method, it carries inherent risks of distortion and overfitting, particularly in noisy or high-dimensional feature spaces. By understanding SMOTE’s underlying assumptions and combining it with noise mitigation, dimensionality reduction, and robust validation, practitioners can harness its benefits without succumbing to its pitfalls. When applied thoughtfully—and complemented by alternative imbalance-handling techniques—SMOTE can form one component of a comprehensive strategy for fair and accurate classification.``` diff --git a/_sass/minimal-mistakes.scss b/_sass/minimal-mistakes.scss index 3b252e56..644c92b0 100644 --- a/_sass/minimal-mistakes.scss +++ b/_sass/minimal-mistakes.scss @@ -1,40 +1,43 @@ -/*! +/*!\ * Minimal Mistakes Jekyll Theme 4.24.0 by Michael Rose * Copyright 2013-2020 Michael Rose - mademistakes.com | @mmistakes * Licensed under MIT (https://github.com/mmistakes/minimal-mistakes/blob/master/LICENSE) -*/ + */ /* Variables */ -@import "minimal-mistakes/variables"; +@forward "minimal-mistakes/variables"; /* Mixins and functions */ -@import "minimal-mistakes/vendor/breakpoint/breakpoint"; -@include breakpoint-set("to ems", true); -@import "minimal-mistakes/vendor/magnific-popup/magnific-popup"; // Magnific Popup -@import "minimal-mistakes/vendor/susy/susy"; -@import "minimal-mistakes/mixins"; +@forward "minimal-mistakes/vendor/breakpoint/breakpoint"; +@use "minimal-mistakes/vendor/breakpoint/breakpoint" as breakpoint; +@forward "minimal-mistakes/vendor/magnific-popup/magnific-popup"; +@forward "minimal-mistakes/vendor/susy/susy"; +@forward "minimal-mistakes/mixins"; /* Core CSS */ -@import "minimal-mistakes/reset"; -@import "minimal-mistakes/base"; -@import "minimal-mistakes/forms"; -@import "minimal-mistakes/tables"; -@import "minimal-mistakes/animations"; +@forward "minimal-mistakes/reset"; +@forward "minimal-mistakes/base"; +@forward "minimal-mistakes/forms"; +@forward "minimal-mistakes/tables"; +@forward "minimal-mistakes/animations"; /* Components */ -@import "minimal-mistakes/buttons"; -@import "minimal-mistakes/notices"; -@import "minimal-mistakes/masthead"; -@import "minimal-mistakes/navigation"; -@import "minimal-mistakes/footer"; -@import "minimal-mistakes/search"; -@import "minimal-mistakes/syntax"; +@forward "minimal-mistakes/buttons"; +@forward "minimal-mistakes/notices"; +@forward "minimal-mistakes/masthead"; +@forward "minimal-mistakes/navigation"; +@forward "minimal-mistakes/footer"; +@forward "minimal-mistakes/search"; +@forward "minimal-mistakes/syntax"; /* Utility classes */ -@import "minimal-mistakes/utilities"; +@forward "minimal-mistakes/utilities"; /* Layout specific */ -@import "minimal-mistakes/page"; -@import "minimal-mistakes/archive"; -@import "minimal-mistakes/sidebar"; -@import "minimal-mistakes/print"; +@forward "minimal-mistakes/page"; +@forward "minimal-mistakes/archive"; +@forward "minimal-mistakes/sidebar"; +@forward "minimal-mistakes/print"; + +// Configure Breakpoint after loading all modules +@include breakpoint.breakpoint-set("to ems", true); diff --git a/_sass/minimal-mistakes/_archive.scss b/_sass/minimal-mistakes/_archive.scss index 9f576323..957ce8f9 100644 --- a/_sass/minimal-mistakes/_archive.scss +++ b/_sass/minimal-mistakes/_archive.scss @@ -2,6 +2,10 @@ ARCHIVE ========================================================================== */ +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; + .archive { margin-top: 1em; margin-bottom: 2em; diff --git a/_sass/minimal-mistakes/_base.scss b/_sass/minimal-mistakes/_base.scss index 01c8a49f..d8ed5e83 100644 --- a/_sass/minimal-mistakes/_base.scss +++ b/_sass/minimal-mistakes/_base.scss @@ -2,6 +2,10 @@ BASE ELEMENTS ========================================================================== */ +@use "variables" as *; +@use "vendor/breakpoint/breakpoint" as *; +@use "mixins"; + html { /* sticky footer fix */ position: relative; diff --git a/_sass/minimal-mistakes/_buttons.scss b/_sass/minimal-mistakes/_buttons.scss index 9ef60a84..dd7478ac 100644 --- a/_sass/minimal-mistakes/_buttons.scss +++ b/_sass/minimal-mistakes/_buttons.scss @@ -1,3 +1,6 @@ +@use "sass:color"; +@use "variables" as *; +@use "mixins" as *; // Import mixins for color contrast helpers /* ========================================================================== BUTTONS ========================================================================== */ @@ -56,7 +59,7 @@ } &:hover { - @include yiq-contrasted(mix(#000, $color, 20%)); + @include yiq-contrasted(color.mix(#000, $color, 20%)); } } } diff --git a/_sass/minimal-mistakes/_footer.scss b/_sass/minimal-mistakes/_footer.scss index c0b0625b..15f4d153 100644 --- a/_sass/minimal-mistakes/_footer.scss +++ b/_sass/minimal-mistakes/_footer.scss @@ -2,6 +2,10 @@ FOOTER ========================================================================== */ +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; + .page__footer { @include clearfix; float: left; diff --git a/_sass/minimal-mistakes/_forms.scss b/_sass/minimal-mistakes/_forms.scss index 6c6fdfad..a0d6ac4b 100644 --- a/_sass/minimal-mistakes/_forms.scss +++ b/_sass/minimal-mistakes/_forms.scss @@ -1,4 +1,5 @@ @use 'sass:math'; +@use "variables" as *; /* ========================================================================== Forms ========================================================================== */ diff --git a/_sass/minimal-mistakes/_masthead.scss b/_sass/minimal-mistakes/_masthead.scss index 2dfefcce..c19b95c8 100644 --- a/_sass/minimal-mistakes/_masthead.scss +++ b/_sass/minimal-mistakes/_masthead.scss @@ -2,6 +2,10 @@ MASTHEAD ========================================================================== */ +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; + .masthead { position: relative; border-bottom: 1px solid $border-color; diff --git a/_sass/minimal-mistakes/_mixins.scss b/_sass/minimal-mistakes/_mixins.scss index 55ce8eb0..a0953c98 100644 --- a/_sass/minimal-mistakes/_mixins.scss +++ b/_sass/minimal-mistakes/_mixins.scss @@ -1,5 +1,6 @@ @use 'sass:math'; @use "sass:color"; +@use "variables" as *; /* ========================================================================== MIXINS ========================================================================== */ diff --git a/_sass/minimal-mistakes/_navigation.scss b/_sass/minimal-mistakes/_navigation.scss index 24d1b1b5..58333bcf 100644 --- a/_sass/minimal-mistakes/_navigation.scss +++ b/_sass/minimal-mistakes/_navigation.scss @@ -1,3 +1,7 @@ +@use "sass:color"; +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; /* ========================================================================== NAVIGATION ========================================================================== */ @@ -17,6 +21,16 @@ animation: $intro-transition; -webkit-animation-delay: 0.3s; animation-delay: 0.3s; + color: $muted-text-color; + + a { + color: inherit; + text-decoration: none; + + &:hover { + color: $text-color; + } + } @include breakpoint($x-large) { max-width: $x-large; @@ -43,6 +57,7 @@ .current { font-weight: bold; + color: $text-color; } } @@ -80,7 +95,7 @@ text-align: center; text-decoration: none; color: $muted-text-color; - border: 1px solid mix(#000, $border-color, 25%); + border: 1px solid color.mix(#000, $border-color, 25%); border-radius: 0; &:hover { @@ -129,7 +144,7 @@ text-align: center; text-decoration: none; color: $muted-text-color; - border: 1px solid mix(#000, $border-color, 25%); + border: 1px solid color.mix(#000, $border-color, 25%); border-radius: $border-radius; &:hover { @@ -384,7 +399,7 @@ &:hover { color: #fff; border-color: $gray; - background-color: mix(white, #000, 20%); + background-color: color.mix(white, #000, 20%); &:before, &:after { @@ -396,7 +411,7 @@ /* selected*/ input:checked + label { color: white; - background-color: mix(white, #000, 20%); + background-color: color.mix(white, #000, 20%); &:before, &:after { diff --git a/_sass/minimal-mistakes/_notices.scss b/_sass/minimal-mistakes/_notices.scss index 90570b01..a00ed2ec 100644 --- a/_sass/minimal-mistakes/_notices.scss +++ b/_sass/minimal-mistakes/_notices.scss @@ -1,3 +1,6 @@ +@use "sass:color"; +@use "variables" as *; +@use "mixins" as *; /* ========================================================================== NOTICE TEXT BLOCKS ========================================================================== */ @@ -17,7 +20,7 @@ font-family: $global-font-family; font-size: $type-size-6 !important; text-indent: initial; /* override*/ - background-color: mix($background-color, $notice-color, $notice-background-mix); + background-color: color.mix($background-color, $notice-color, $notice-background-mix); border-radius: $border-radius; box-shadow: 0 1px 1px rgba($notice-color, 0.25); @@ -46,19 +49,19 @@ } a { - color: mix(#000, $notice-color, 10%); + color: color.mix(#000, $notice-color, 10%); &:hover { - color: mix(#000, $notice-color, 50%); + color: color.mix(#000, $notice-color, 50%); } } @at-root #{selector-unify(&, "blockquote")} { - border-left-color: mix(#000, $notice-color, 10%); + border-left-color: color.mix(#000, $notice-color, 10%); } code { - background-color: mix($background-color, $notice-color, $code-notice-background-mix) + background-color: color.mix($background-color, $notice-color, $code-notice-background-mix) } pre code { diff --git a/_sass/minimal-mistakes/_page.scss b/_sass/minimal-mistakes/_page.scss index f8d0a50c..15a3ceb0 100644 --- a/_sass/minimal-mistakes/_page.scss +++ b/_sass/minimal-mistakes/_page.scss @@ -1,3 +1,7 @@ +@use "sass:color"; +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; /* ========================================================================== SINGLE PAGE/POST ========================================================================== */ @@ -347,7 +351,7 @@ body { margin-bottom: 8px; padding: 5px 10px; text-decoration: none; - border: 1px solid mix(#000, $border-color, 25%); + border: 1px solid color.mix(#000, $border-color, 25%); border-radius: $border-radius; &:hover { diff --git a/_sass/minimal-mistakes/_reset.scss b/_sass/minimal-mistakes/_reset.scss index 97c1733d..eb6559e4 100644 --- a/_sass/minimal-mistakes/_reset.scss +++ b/_sass/minimal-mistakes/_reset.scss @@ -2,6 +2,10 @@ STYLE RESETS ========================================================================== */ +@use "variables" as *; +@use "vendor/breakpoint/breakpoint" as *; +@use "mixins"; + * { box-sizing: border-box; } html { diff --git a/_sass/minimal-mistakes/_search.scss b/_sass/minimal-mistakes/_search.scss index fa7ee832..d3e30c2b 100644 --- a/_sass/minimal-mistakes/_search.scss +++ b/_sass/minimal-mistakes/_search.scss @@ -1,3 +1,7 @@ +@use "sass:color"; +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; /* ========================================================================== SEARCH ========================================================================== */ @@ -21,7 +25,7 @@ transition: 0.2s; &:hover { - color: mix(#000, $primary-color, 25%); + color: color.mix(#000, $primary-color, 25%); } } diff --git a/_sass/minimal-mistakes/_sidebar.scss b/_sass/minimal-mistakes/_sidebar.scss index 02b455b4..a188b138 100644 --- a/_sass/minimal-mistakes/_sidebar.scss +++ b/_sass/minimal-mistakes/_sidebar.scss @@ -2,6 +2,10 @@ SIDEBAR ========================================================================== */ +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; + /* Default ========================================================================== */ diff --git a/_sass/minimal-mistakes/_syntax.scss b/_sass/minimal-mistakes/_syntax.scss index 72652020..54ce805a 100644 --- a/_sass/minimal-mistakes/_syntax.scss +++ b/_sass/minimal-mistakes/_syntax.scss @@ -2,6 +2,8 @@ Syntax highlighting ========================================================================== */ +@use "variables" as *; + div.highlighter-rouge, figure.highlight { position: relative; diff --git a/_sass/minimal-mistakes/_tables.scss b/_sass/minimal-mistakes/_tables.scss index c270a775..a8032e83 100644 --- a/_sass/minimal-mistakes/_tables.scss +++ b/_sass/minimal-mistakes/_tables.scss @@ -1,3 +1,5 @@ +@use "sass:color"; +@use "variables" as *; /* ========================================================================== TABLES ========================================================================== */ @@ -18,7 +20,7 @@ table { thead { background-color: $border-color; - border-bottom: 2px solid mix(#000, $border-color, 25%); + border-bottom: 2px solid color.mix(#000, $border-color, 25%); } th { @@ -29,7 +31,7 @@ th { td { padding: 0.5em; - border-bottom: 1px solid mix(#000, $border-color, 25%); + border-bottom: 1px solid color.mix(#000, $border-color, 25%); } tr, diff --git a/_sass/minimal-mistakes/_utilities.scss b/_sass/minimal-mistakes/_utilities.scss index 1c127d36..7479cc7e 100644 --- a/_sass/minimal-mistakes/_utilities.scss +++ b/_sass/minimal-mistakes/_utilities.scss @@ -1,3 +1,7 @@ +@use "sass:color"; +@use "variables" as *; +@use "mixins" as *; +@use "vendor/breakpoint/breakpoint" as *; /* ========================================================================== UTILITY CLASSES ========================================================================== */ @@ -182,7 +186,9 @@ body:hover .visually-hidden button { .full { @include breakpoint($large) { - margin-right: -1 * span(2.5 of 12) !important; + // `span()` returns a `calc()` expression which cannot be multiplied. + // Passing a negative value directly avoids the undefined operation error. + margin-right: span(-2.5 of 12) !important; } } @@ -416,7 +422,7 @@ body:hover .visually-hidden button { .navicon, .navicon:before, .navicon:after { - background: mix(#000, $primary-color, 25%); + background: color.mix(#000, $primary-color, 25%); } &.close { @@ -516,12 +522,12 @@ body:hover .visually-hidden button { ========================================================================== */ .footnote { - color: mix(#fff, $gray, 25%); + color: color.mix(#fff, $gray, 25%); text-decoration: none; } .footnotes { - color: mix(#fff, $gray, 25%); + color: color.mix(#fff, $gray, 25%); ol, li, diff --git a/_sass/minimal-mistakes/_variables.scss b/_sass/minimal-mistakes/_variables.scss index 597f6a77..0e5c8e35 100644 --- a/_sass/minimal-mistakes/_variables.scss +++ b/_sass/minimal-mistakes/_variables.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Variables ========================================================================== */ @@ -32,8 +33,8 @@ $calisto: "Calisto MT", serif !default; $garamond: Garamond, serif !default; // Setting the fonts -$global-font-family: "Roboto", Helvetica, Arial, sans-serif; -$header-font-family: "Lora", "Times New Roman", serif; +$global-font-family: 'Inter', system-ui, sans-serif; +$header-font-family: 'Nunito Sans', system-ui, sans-serif; $caption-font-family: "Cardo, serif"; /* type scale */ @@ -59,27 +60,33 @@ $h-size-6: 1em !default; // ~16px ========================================================================== */ $gray: #7a8288 !default; -$dark-gray: mix(#000, $gray, 50%) !default; -$darker-gray: mix(#000, $gray, 60%) !default; -$light-gray: mix(#fff, $gray, 50%) !default; -$lighter-gray: mix(#fff, $gray, 90%) !default; +$dark-gray: color.mix(#000, $gray, 50%) !default; +$darker-gray: color.mix(#000, $gray, 60%) !default; +$light-gray: color.mix(#fff, $gray, 50%) !default; +$lighter-gray: color.mix(#fff, $gray, 90%) !default; $background-color: #fff !default; $code-background-color: #fafafa !default; $code-background-color-dark: $light-gray !default; $text-color: $dark-gray !default; -$muted-text-color: mix(#fff, $text-color, 20%) !default; +$muted-text-color: color.mix(#fff, $text-color, 20%) !default; $border-color: $lighter-gray !default; $form-background-color: $lighter-gray !default; $footer-background-color: $lighter-gray !default; -$primary-color: #6f777d !default; +$color-primary: #2a4849 !default; +$color-secondary: #6c757d !default; +$color-background: #f7f9fa !default; +$color-surface: #ffffff !default; +$color-accent: #ffca00 !default; + +$primary-color: $color-primary !default; $success-color: #3fa63f !default; $warning-color: #d67f05 !default; $danger-color: #ee5f5b !default; $info-color: #3b9cba !default; $focus-color: $primary-color !default; -$active-color: mix(#fff, $primary-color, 80%) !default; +$active-color: color.mix(#fff, $primary-color, 80%) !default; /* YIQ color contrast */ $yiq-contrasted-dark-default: $dark-gray !default; @@ -114,12 +121,12 @@ $youtube-color: #bb0000 !default; $xing-color: #006567 !default; /* links */ -$link-color: mix(#000, $info-color, 20%) !default; -$link-color-hover: mix(#000, $link-color, 25%) !default; -$link-color-visited: mix(#fff, $link-color, 15%) !default; +$link-color: color.mix(#000, $info-color, 20%) !default; +$link-color-hover: color.mix(#000, $link-color, 25%) !default; +$link-color-visited: color.mix(#fff, $link-color, 15%) !default; $masthead-link-color: $primary-color !default; -$masthead-link-color-hover: mix(#000, $primary-color, 25%) !default; -$navicon-link-color-hover: mix(#fff, $primary-color, 75%) !default; +$masthead-link-color-hover: color.mix(#000, $primary-color, 25%) !default; +$navicon-link-color-hover: color.mix(#fff, $primary-color, 75%) !default; /* notices */ $notice-background-mix: 80% !default; diff --git a/_sass/minimal-mistakes/skins/_air.scss b/_sass/minimal-mistakes/skins/_air.scss index 0e5360c3..8dff425e 100644 --- a/_sass/minimal-mistakes/skins/_air.scss +++ b/_sass/minimal-mistakes/skins/_air.scss @@ -1,18 +1,21 @@ +@use "sass:color"; +@use "../variables" as *; /* ========================================================================== Air skin ========================================================================== */ /* Colors */ -$background-color: #eeeeee !default; -$text-color: #222831 !default; -$muted-text-color: #393e46 !default; -$primary-color: #0092ca !default; -$border-color: mix(#fff, #393e46, 75%) !default; +$background-color: $color-background !default; +$text-color: #2a2a2a !default; +$muted-text-color: $color-secondary !default; +$primary-color: $color-primary !default; +$accent-color: $color-accent !default; +$border-color: color.mix(#fff, $text-color, 85%) !default; $footer-background-color: $primary-color !default; -$link-color: #393e46 !default; +$link-color: $primary-color !default; $masthead-link-color: $text-color !default; $masthead-link-color-hover: $text-color !default; -$navicon-link-color-hover: mix(#fff, $text-color, 80%) !default; +$navicon-link-color-hover: color.mix(#fff, $text-color, 80%) !default; .page__footer { color: #fff !important; // override diff --git a/_sass/minimal-mistakes/skins/_aqua.scss b/_sass/minimal-mistakes/skins/_aqua.scss index 7c3944e0..b58e4cb4 100644 --- a/_sass/minimal-mistakes/skins/_aqua.scss +++ b/_sass/minimal-mistakes/skins/_aqua.scss @@ -1,13 +1,14 @@ +@use "sass:color"; /* ========================================================================== Aqua skin ========================================================================== */ /* Colors */ $gray : #1976d2 !default; -$dark-gray : mix(#000, $gray, 40%) !default; -$darker-gray : mix(#000, $gray, 60%) !default; -$light-gray : mix(#fff, $gray, 50%) !default; -$lighter-gray : mix(#fff, $gray, 90%) !default; +$dark-gray : color.mix(#000, $gray, 40%) !default; +$darker-gray : color.mix(#000, $gray, 60%) !default; +$light-gray : color.mix(#fff, $gray, 50%) !default; +$lighter-gray : color.mix(#fff, $gray, 90%) !default; $body-color : #fff !default; $background-color : #f0fff0 !default; @@ -24,10 +25,10 @@ $info-color : #03a9f4 !default; /* links */ $link-color : $info-color !default; -$link-color-hover : mix(#000, $link-color, 25%) !default; -$link-color-visited : mix(#fff, $link-color, 25%) !default; +$link-color-hover : color.mix(#000, $link-color, 25%) !default; +$link-color-visited : color.mix(#fff, $link-color, 25%) !default; $masthead-link-color : $primary-color !default; -$masthead-link-color-hover : mix(#000, $primary-color, 25%) !default; +$masthead-link-color-hover : color.mix(#000, $primary-color, 25%) !default; /* notices */ $notice-background-mix: 90% !default; diff --git a/_sass/minimal-mistakes/skins/_contrast.scss b/_sass/minimal-mistakes/skins/_contrast.scss index 38283b8f..81c6755c 100644 --- a/_sass/minimal-mistakes/skins/_contrast.scss +++ b/_sass/minimal-mistakes/skins/_contrast.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Contrast skin ========================================================================== */ @@ -6,12 +7,12 @@ $text-color: #000 !default; $muted-text-color: $text-color !default; $primary-color: #ff0000 !default; -$border-color: mix(#fff, $text-color, 75%) !default; +$border-color: color.mix(#fff, $text-color, 75%) !default; $footer-background-color: #000 !default; $link-color: #0000ff !default; $masthead-link-color: $text-color !default; $masthead-link-color-hover: $text-color !default; -$navicon-link-color-hover: mix(#fff, $text-color, 80%) !default; +$navicon-link-color-hover: color.mix(#fff, $text-color, 80%) !default; /* contrast syntax highlighting (base16) */ $base00: #000000 !default; diff --git a/_sass/minimal-mistakes/skins/_dark.scss b/_sass/minimal-mistakes/skins/_dark.scss index 38053493..6ca28a2b 100644 --- a/_sass/minimal-mistakes/skins/_dark.scss +++ b/_sass/minimal-mistakes/skins/_dark.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Dark skin ========================================================================== */ @@ -6,17 +7,17 @@ $background-color: #252a34 !default; $text-color: #eaeaea !default; $primary-color: #00adb5 !default; -$border-color: mix(#fff, $background-color, 20%) !default; -$code-background-color: mix(#000, $background-color, 15%) !default; -$code-background-color-dark: mix(#000, $background-color, 20%) !default; -$form-background-color: mix(#000, $background-color, 15%) !default; -$footer-background-color: mix(#000, $background-color, 30%) !default; -$link-color: mix($primary-color, $text-color, 40%) !default; -$link-color-hover: mix(#fff, $link-color, 25%) !default; -$link-color-visited: mix(#000, $link-color, 25%) !default; +$border-color: color.mix(#fff, $background-color, 20%) !default; +$code-background-color: color.mix(#000, $background-color, 15%) !default; +$code-background-color-dark: color.mix(#000, $background-color, 20%) !default; +$form-background-color: color.mix(#000, $background-color, 15%) !default; +$footer-background-color: color.mix(#000, $background-color, 30%) !default; +$link-color: color.mix($primary-color, $text-color, 40%) !default; +$link-color-hover: color.mix(#fff, $link-color, 25%) !default; +$link-color-visited: color.mix(#000, $link-color, 25%) !default; $masthead-link-color: $text-color !default; -$masthead-link-color-hover: mix(#000, $text-color, 20%) !default; -$navicon-link-color-hover: mix(#000, $background-color, 30%) !default; +$masthead-link-color-hover: color.mix(#000, $text-color, 20%) !default; +$navicon-link-color-hover: color.mix(#000, $background-color, 30%) !default; .author__urls.social-icons i, .author__urls.social-icons .svg-inline--fa, diff --git a/_sass/minimal-mistakes/skins/_dirt.scss b/_sass/minimal-mistakes/skins/_dirt.scss index 5090f559..cdc4b0a3 100644 --- a/_sass/minimal-mistakes/skins/_dirt.scss +++ b/_sass/minimal-mistakes/skins/_dirt.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Dirt skin ========================================================================== */ @@ -12,7 +13,7 @@ $footer-background-color: #e9dcbe !default; $link-color: #343434 !default; $masthead-link-color: $text-color !default; $masthead-link-color-hover: $text-color !default; -$navicon-link-color-hover: mix(#fff, $text-color, 80%) !default; +$navicon-link-color-hover: color.mix(#fff, $text-color, 80%) !default; /* dirt syntax highlighting (base16) */ $base00: #231e18 !default; diff --git a/_sass/minimal-mistakes/skins/_mint.scss b/_sass/minimal-mistakes/skins/_mint.scss index 28557a3a..94a19620 100644 --- a/_sass/minimal-mistakes/skins/_mint.scss +++ b/_sass/minimal-mistakes/skins/_mint.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Mint skin ========================================================================== */ @@ -7,12 +8,12 @@ $background-color: #f3f6f6 !default; $text-color: #40514e !default; $muted-text-color: #40514e !default; $primary-color: #11999e !default; -$border-color: mix(#fff, #40514e, 75%) !default; +$border-color: color.mix(#fff, #40514e, 75%) !default; $footer-background-color: #30e3ca !default; $link-color: #11999e !default; $masthead-link-color: $text-color !default; $masthead-link-color-hover: $text-color !default; -$navicon-link-color-hover: mix(#fff, $text-color, 80%) !default; +$navicon-link-color-hover: color.mix(#fff, $text-color, 80%) !default; .page__footer { color: #fff !important; // override diff --git a/_sass/minimal-mistakes/skins/_neon.scss b/_sass/minimal-mistakes/skins/_neon.scss index a4f2ef5d..6aee71c6 100644 --- a/_sass/minimal-mistakes/skins/_neon.scss +++ b/_sass/minimal-mistakes/skins/_neon.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Neon skin ========================================================================== */ @@ -6,17 +7,17 @@ $background-color: #141010 !default; $text-color: #fff6fb !default; $primary-color: #f21368 !default; -$border-color: mix(#fff, $background-color, 20%) !default; -$code-background-color: mix(#000, $background-color, 15%) !default; -$code-background-color-dark: mix(#000, $background-color, 20%) !default; -$form-background-color: mix(#000, $background-color, 15%) !default; -$footer-background-color: mix($primary-color, #000, 10%) !default; +$border-color: color.mix(#fff, $background-color, 20%) !default; +$code-background-color: color.mix(#000, $background-color, 15%) !default; +$code-background-color-dark: color.mix(#000, $background-color, 20%) !default; +$form-background-color: color.mix(#000, $background-color, 15%) !default; +$footer-background-color: color.mix($primary-color, #000, 10%) !default; $link-color: $primary-color !default; -$link-color-hover: mix(#fff, $link-color, 25%) !default; -$link-color-visited: mix(#000, $link-color, 25%) !default; +$link-color-hover: color.mix(#fff, $link-color, 25%) !default; +$link-color-visited: color.mix(#000, $link-color, 25%) !default; $masthead-link-color: $text-color !default; -$masthead-link-color-hover: mix(#000, $text-color, 20%) !default; -$navicon-link-color-hover: mix(#000, $background-color, 30%) !default; +$masthead-link-color-hover: color.mix(#000, $text-color, 20%) !default; +$navicon-link-color-hover: color.mix(#000, $background-color, 30%) !default; /* notices */ $notice-background-mix: 90% !default; diff --git a/_sass/minimal-mistakes/skins/_plum.scss b/_sass/minimal-mistakes/skins/_plum.scss index defa69cd..2d72b82c 100644 --- a/_sass/minimal-mistakes/skins/_plum.scss +++ b/_sass/minimal-mistakes/skins/_plum.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Plum skin ========================================================================== */ @@ -6,17 +7,17 @@ $background-color: #521477 !default; $text-color: #fffd86 !default; $primary-color: #c327ab !default; -$border-color: mix(#fff, $background-color, 20%) !default; -$code-background-color: mix(#000, $background-color, 15%) !default; -$code-background-color-dark: mix(#000, $background-color, 20%) !default; -$form-background-color: mix(#000, $background-color, 15%) !default; -$footer-background-color: mix(#000, $background-color, 25%) !default; +$border-color: color.mix(#fff, $background-color, 20%) !default; +$code-background-color: color.mix(#000, $background-color, 15%) !default; +$code-background-color-dark: color.mix(#000, $background-color, 20%) !default; +$form-background-color: color.mix(#000, $background-color, 15%) !default; +$footer-background-color: color.mix(#000, $background-color, 25%) !default; $link-color: $primary-color !default; -$link-color-hover: mix(#fff, $link-color, 25%) !default; -$link-color-visited: mix(#000, $link-color, 25%) !default; +$link-color-hover: color.mix(#fff, $link-color, 25%) !default; +$link-color-visited: color.mix(#000, $link-color, 25%) !default; $masthead-link-color: $text-color !default; -$masthead-link-color-hover: mix(#000, $text-color, 20%) !default; -$navicon-link-color-hover: mix(#000, $background-color, 30%) !default; +$masthead-link-color-hover: color.mix(#000, $text-color, 20%) !default; +$navicon-link-color-hover: color.mix(#000, $background-color, 30%) !default; /* notices */ $notice-background-mix: 70% !default; diff --git a/_sass/minimal-mistakes/skins/_sunrise.scss b/_sass/minimal-mistakes/skins/_sunrise.scss index bc259f6d..6f9bb79e 100644 --- a/_sass/minimal-mistakes/skins/_sunrise.scss +++ b/_sass/minimal-mistakes/skins/_sunrise.scss @@ -1,3 +1,4 @@ +@use "sass:color"; /* ========================================================================== Sunrise skin ========================================================================== */ @@ -8,17 +9,17 @@ $background-color: #e8d5b7 !default; $text-color: #000 !default; $muted-text-color: $dark-gray !default; $primary-color: #fc3a52 !default; -$border-color: mix(#000, $background-color, 20%) !default; -$code-background-color: mix(#fff, $background-color, 20%) !default; -$code-background-color-dark: mix(#000, $background-color, 10%) !default; -$form-background-color: mix(#fff, $background-color, 15%) !default; +$border-color: color.mix(#000, $background-color, 20%) !default; +$code-background-color: color.mix(#fff, $background-color, 20%) !default; +$code-background-color-dark: color.mix(#000, $background-color, 10%) !default; +$form-background-color: color.mix(#fff, $background-color, 15%) !default; $footer-background-color: #f9b248 !default; -$link-color: mix(#000, $primary-color, 10%) !default; -$link-color-hover: mix(#fff, $link-color, 25%) !default; -$link-color-visited: mix(#000, $link-color, 25%) !default; +$link-color: color.mix(#000, $primary-color, 10%) !default; +$link-color-hover: color.mix(#fff, $link-color, 25%) !default; +$link-color-visited: color.mix(#000, $link-color, 25%) !default; $masthead-link-color: $text-color !default; -$masthead-link-color-hover: mix(#000, $text-color, 20%) !default; -$navicon-link-color-hover: mix(#000, $background-color, 30%) !default; +$masthead-link-color-hover: color.mix(#000, $text-color, 20%) !default; +$navicon-link-color-hover: color.mix(#000, $background-color, 30%) !default; /* notices */ $notice-background-mix: 75% !default; diff --git a/_sass/minimal-mistakes/vendor/magnific-popup/_magnific-popup.scss b/_sass/minimal-mistakes/vendor/magnific-popup/_magnific-popup.scss index 3573b22e..b362886e 100644 --- a/_sass/minimal-mistakes/vendor/magnific-popup/_magnific-popup.scss +++ b/_sass/minimal-mistakes/vendor/magnific-popup/_magnific-popup.scss @@ -1,6 +1,8 @@ /* Magnific Popup CSS */ @use 'sass:math'; -@import "settings"; +// Import theme variables so popup fonts match the site +@use "../../variables" as *; +@use "settings" as *; //////////////////////// // diff --git a/_sass/minimal-mistakes/vendor/magnific-popup/_settings.scss b/_sass/minimal-mistakes/vendor/magnific-popup/_settings.scss index b389a23c..3bfe5948 100644 --- a/_sass/minimal-mistakes/vendor/magnific-popup/_settings.scss +++ b/_sass/minimal-mistakes/vendor/magnific-popup/_settings.scss @@ -1,4 +1,8 @@ @use 'sass:math'; + +// Import theme variables so popup styles use the same fonts and colours. +@use "../../variables" as *; + //////////////////////// // Settings // //////////////////////// diff --git a/assets/css/main.scss b/assets/css/main.scss index 27ef9982..f9055e1a 100644 --- a/assets/css/main.scss +++ b/assets/css/main.scss @@ -5,8 +5,10 @@ search: false @charset "utf-8"; -@import "minimal-mistakes/skins/{{ site.minimal_mistakes_skin | default: 'default' }}"; // skin -@import "minimal-mistakes"; // main partials +@use "sass:color"; +@use "minimal-mistakes/variables" as *; +@use "minimal-mistakes/skins/{{ site.minimal_mistakes_skin | default: 'default' }}" as *; // skin +@use "minimal-mistakes" as *; // main partials /* Remove underline and set color to black */ a { @@ -134,3 +136,16 @@ a:hover { max-height: 40vh; object-fit: cover; } + +/* Card styles for archive items */ +.archive__item.card { + background: $background-color; + border-radius: $border-radius; + padding: 1em; + box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); + transition: transform 0.2s, box-shadow 0.2s; +} +.archive__item.card:hover { + transform: translateY(-4px); + box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2); +} diff --git a/assets/images/Causal-Inference-Hero.png b/assets/images/Causal-Inference-Hero.png new file mode 100644 index 00000000..ab8d7015 Binary files /dev/null and b/assets/images/Causal-Inference-Hero.png differ diff --git a/assets/images/statistics_teaser.jpg b/assets/images/statistics_teaser.jpg new file mode 100644 index 00000000..6a0c8660 Binary files /dev/null and b/assets/images/statistics_teaser.jpg differ diff --git a/docs/frontmatter/index.md b/docs/frontmatter/index.md new file mode 100644 index 00000000..2e692c81 --- /dev/null +++ b/docs/frontmatter/index.md @@ -0,0 +1,8 @@ +--- +layout: single +title: "frontmatter package" +parent: Package Documentation +nav_order: 1 +--- + +Detailed documentation for the **frontmatter** Python package will be provided here. diff --git a/robots.txt b/robots.txt index 406218e4..e67e070b 100644 --- a/robots.txt +++ b/robots.txt @@ -1,5 +1,5 @@ User-agent: * Disallow: -# Sitemap location -Sitemap: https://diogoribeiro7.github.io/sitemap.xml \ No newline at end of file +# Sitemap +Sitemap: https://diogoribeiro7.github.io/sitemap.xml diff --git a/tests/test_fix_date.py b/tests/test_fix_date.py index 33cecd00..b727fd54 100644 --- a/tests/test_fix_date.py +++ b/tests/test_fix_date.py @@ -1,13 +1,14 @@ import os import sys import tempfile -import frontmatter -import pytest - -# Add the project root to sys.path -sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))) +# Add the project root to sys.path so local modules can be imported when +# running the tests via the `pytest` entry point (which doesn't prepend the +# working directory to `sys.path`). +sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))) +import frontmatter +import pytest import fix_date