Add three new data science posts for 2021

DiogoRibeiro7 · DiogoRibeiro7 · commit 242b07ba951a · 2025-06-12T00:18:24.000+01:00
diff --git a/_posts/2021-10-05-data_preprocessing_pipelines.md b/_posts/2021-10-05-data_preprocessing_pipelines.md
@@ -0,0 +1,47 @@
+---
+author_profile: false
+categories:
+- Data Science
+classes: wide
+date: '2021-10-05'
+excerpt: Learn how to design robust data preprocessing pipelines that prepare raw data for modeling.
+header:
+  image: /assets/images/data_science_6.jpg
+  og_image: /assets/images/data_science_6.jpg
+  overlay_image: /assets/images/data_science_6.jpg
+  show_overlay_excerpt: false
+  teaser: /assets/images/data_science_6.jpg
+  twitter_image: /assets/images/data_science_6.jpg
+keywords:
+- Data preprocessing
+- Pipelines
+- Data cleaning
+- Feature engineering
+seo_description: Discover best practices for building reusable data preprocessing pipelines that handle missing values, encoding, and feature scaling.
+seo_title: Building Data Preprocessing Pipelines for Reliable Models
+seo_type: article
+summary: This post outlines the key steps in constructing data preprocessing pipelines using tools like scikit-learn to ensure consistent model inputs.
+tags:
+- Data preprocessing
+- Machine learning
+- Feature engineering
+title: Designing Effective Data Preprocessing Pipelines
+---
+
+Real-world datasets rarely come perfectly formatted for modeling. A well-designed **data preprocessing pipeline** ensures that you apply the same transformations consistently across training and production environments.
+
+## Handling Missing Values
+
+Start by assessing the extent of missing data. Common strategies include dropping incomplete rows, filling numeric columns with the mean or median, and using the most frequent category for categorical features.
+
+## Encoding Categorical Variables
+
+Many machine learning algorithms require numeric inputs. Techniques like **one-hot encoding** or **ordinal encoding** convert categories into numbers. Scikit-learn's `ColumnTransformer` allows you to apply different encoders to different columns in a single pipeline.
+
+## Scaling and Normalization
+
+Scaling features to a common range prevents variables with large magnitudes from dominating a model. Standardization (mean of zero, unit variance) is typical for linear models, while min-max scaling keeps values between 0 and 1.
+
+## Putting It All Together
+
+Use scikit-learn's `Pipeline` to chain preprocessing steps with your model. This approach guarantees that the exact same transformations are applied when predicting on new data, reducing the risk of data leakage and improving reproducibility.
diff --git a/_posts/2021-10-15-decision_tree_algorithms.md b/_posts/2021-10-15-decision_tree_algorithms.md
@@ -0,0 +1,43 @@
+---
+author_profile: false
+categories:
+- Machine Learning
+classes: wide
+date: '2021-10-15'
+excerpt: Understand how decision tree algorithms split data and how pruning improves generalization.
+header:
+  image: /assets/images/data_science_7.jpg
+  og_image: /assets/images/data_science_7.jpg
+  overlay_image: /assets/images/data_science_7.jpg
+  show_overlay_excerpt: false
+  teaser: /assets/images/data_science_7.jpg
+  twitter_image: /assets/images/data_science_7.jpg
+keywords:
+- Decision trees
+- Classification
+- Tree pruning
+- Machine learning
+seo_description: Learn the mechanics of decision tree algorithms, including entropy-based splits and pruning techniques that prevent overfitting.
+seo_title: How Decision Trees Work and Why Pruning Matters
+seo_type: article
+summary: This article walks through the basics of decision tree construction and explains common pruning methods to create better models.
+tags:
+- Decision trees
+- Classification
+- Overfitting
+title: Demystifying Decision Tree Algorithms
+---
+
+Decision trees are intuitive models that recursively split data into smaller groups based on feature values. Each split aims to maximize homogeneity within branches while separating different classes.
+
+## Choosing the Best Split
+
+Metrics like **Gini impurity** and **entropy** measure how mixed the classes are in each node. The algorithm searches over possible splits and selects the one that yields the largest reduction in impurity.
+
+## Preventing Overfitting
+
+A tree grown until every leaf is pure often memorizes the training data. **Pruning** removes branches that provide little predictive power, leading to a simpler tree that generalizes better to new samples.
+
+## When to Use Decision Trees
+
+Decision trees handle both numeric and categorical features and require minimal data preparation. They also serve as the building blocks for powerful ensemble methods like random forests and gradient boosting.
diff --git a/_posts/2021-11-10-model_evaluation_metrics.md b/_posts/2021-11-10-model_evaluation_metrics.md
@@ -0,0 +1,44 @@
+---
+author_profile: false
+categories:
+- Machine Learning
+classes: wide
+date: '2021-11-10'
+excerpt: Explore key metrics for evaluating classification and regression models.
+header:
+  image: /assets/images/data_science_8.jpg
+  og_image: /assets/images/data_science_8.jpg
+  overlay_image: /assets/images/data_science_8.jpg
+  show_overlay_excerpt: false
+  teaser: /assets/images/data_science_8.jpg
+  twitter_image: /assets/images/data_science_8.jpg
+keywords:
+- Model evaluation
+- Accuracy
+- Precision
+- Recall
+- Regression metrics
+seo_description: A concise overview of essential metrics like precision, recall, F1-score, and RMSE for measuring model performance.
+seo_title: Essential Metrics for Evaluating Machine Learning Models
+seo_type: article
+summary: Learn how to interpret common classification and regression metrics to choose the best model for your data.
+tags:
+- Accuracy
+- F1-score
+- RMSE
+title: A Guide to Model Evaluation Metrics
+---
+
+Choosing the right evaluation metric is critical for comparing models and selecting the best one for your problem.
+
+## Classification Metrics
+
+- **Accuracy** measures the fraction of correct predictions. It works well when classes are balanced but can be misleading with imbalanced datasets.
+- **Precision** and **recall** capture how well the model retrieves relevant instances without producing too many false positives or negatives. The **F1-score** provides a balance between the two.
+
+## Regression Metrics
+
+- **Mean Absolute Error (MAE)** evaluates the average magnitude of errors.
+- **Root Mean Squared Error (RMSE)** penalizes larger errors more heavily, making it useful when large deviations are particularly undesirable.
+
+Selecting evaluation metrics that align with business goals will help you make informed decisions about which model to deploy.