Predict-CO2-Emissions-in-Rwanda/what_i_learned.txt at main · yu-nakashima0/Predict-CO2-Emissions-in-Rwanda · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
Missing Values:
    -Mean, Median and Mode Imputation:
        df[i].fillna(df[i].mean())
        df[i].fillna(df[i].median())
        df[i].fillna(df[i].mode())
        +Advantages: Simple
        -Disadvantages: Inaccuracy
    -Forward and Backward Fill(It replaces missing values with the last observed non-missing value in the column):
        df[i].fillna(method='ffill')
        df[i].fillna(method='bfill')
        +Advantages: Simple and Intuitive, Preserves Patterns
        -Disadvantages: Assumption of Closeness, Potential Inaccuracy
    -Interpolation Techniques(estimate missing values based on the values of surrounding data points.):
        df[i].interpolate(method='linear')
        df[i].interpolate(method='quadratic')
        +Advantages: Preserves Data Relationships
        -Disadvantages: Complexity, Assumptions on Data may not always be true


Outliers Detection:
    -Z-Score Method(detects outliers based on how far a data point is from the mean, measured in terms of standard deviations.):
        z_scores = np.abs(stats.zscore(df))
        outliers_z = np.where(z_scores > 3)
        +Pros: Simple, fast, works well with normally distributed data.
        -Cons: Not reliable for skewed or non-normal distributions.
    -IQR Method (Interquartile Range) (Any value that falls below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is considered an outlier.):
        Q1 = df.quantile(0.25)
        Q3 = df.quantile(0.75)
        IQR = Q3 - Q1
        outliers_iqr = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
        +Pros: Robust to non-normal data, less influenced by extreme values.
        -Cons: Doesn’t adapt well to very skewed distributions.
    -Isolation Forest(Randomly select features and split values.Construct isolation trees. Compute average path length for each point. Shorter path = more likely outlier.):
        +Pros: Works well in high dimensions, efficient.
        -Cons: Requires choosing contamination
    -Local Outlier Factor (LOF) (If a point has significantly lower density than its neighbors, it is flagged as an outlier.):
        +Pros: Works well with clusters of varying density.
        -Cons: Sensitive to choice of k (neighbors).


Normalization and Standardization:
    Normalization(Scales values between [0, 1] or [-1, 1]):
        MinMaxScaler
        +Pros: useful when we don't know about the distribution
    Standardardization(not bounded to a certain range):
        StandardScaler
        +Pros: useful when the feature distribution is Normal or Gaussian


Handling Skewed Distribution:
    Logarithmic Transformation:
        when the data feature is right skewed or positively skewed
    Square Root Transformation:
        usually used in moderately right skewed data,
    Box-Cox Transformation:
        more suitable for right skewed data with data points either being positive or zero valued.
    Yeo-Johnson Transformation:
        it can work for both right as well as left skewed data feature,
        it can also work for either positive or negative value, which the box-cox transformation lacks.
    Qunatile Transformation:
        works both for right skewed as well as left skewed data variable.


Correlation:
    linear Relationship between feature and target value
    only linear
    -1 ~ 1


Mutual Information:
    Degree of information sharing between features and targets
    also non-linear
    0 ~ inf

Feature Importance:
    Which features the model actually prioritized
    also non-linear
    depending on the model


Feature Creation:
    Domain-specific:
        Created based on industry knowledge like business rules.
    Data-driven:
        Derived by recognizing patterns in data.
    Synthetic:
        Formed by combining existing features.
    Binning:
        Convert continuous variables into categories
    Polynomial or interaction features (PolynomialFeatures)
    Time-based features (lag, rolling mean, trends)
    Geographic features (distance, region clusters from lat/lon)
    Feature selection using SelectKBest, RFECV, or SHAP values


Feature Extraction:
    PCA:
        Combine correlations into one axis
    Aggregation:
        Group and take average/sum
    Combination:
        Addition/Subtraction/Ratio


Keras:
    1.define model:
    2.compile model:
    3.fit model:
    4.evaluate model:


Evaluation:
    Train Loss:
        How well the model learned the training data
    Validation Loss:
        How well the model can generalize to new, unseen data
    High training error and high test error -> Underfitting
    Low training error and higher test error -> Overfitting


Neural Network:
    -layer:
        input layer:
        hidden layer:They extract patterns and relationships from the data.
            activation function:Determines how the neuron output is transformed before passing to the next layer.
                                Helps the model learn non-linear relationships.
                sigmoid:(0, 1)
                        Often used in binary classification.
                tanh:(-1, 1)
                ReLU:max(0,x)
                        Most common activation in deep networks.
        output layer:
            Regression → 1 neuron (linear activation)
            Binary classification → 1 neuron (sigmoid)
            Multi-class classification → N neurons (softmax)
    -hyperparameters:
        number of epochs:
        batch size: Number of samples processed before updating weights.
            big -> Underfitting
            small -> Overfitting
    -loss function: Measures how far the model’s predictions are from the true values.
                    The model tries to minimize this value.
        Regression	-> MSE or MAE
        Binary classification	-> Binary Cross-Entropy
        Multiclass classification	-> Categorical Cross-Entropy
        Segmentation	-> Dice Loss, Cross-Entropy
        Sequence tasks	-> CTC Loss
    -weight: Each neuron connection has a weight that scales input features.
             During training, the optimizer updates these weights to minimize loss.
        initializers:
        regularizers:
        constraints:
    -optimizer: Algorithm that updates weights based on the loss function.
        SGD (Stochastic Gradient Descent)
        Adam : most widely used, adaptive learning rate.
        RMSprop : good for recurrent networks.
    -metrics: Used to measure performance of the model (not to train it).
        Accuracy → classification
        MAE / RMSE → regression
        Precision, Recall, F1-score → classification quality
    -early stopping: stop training automatically when validation loss stops improving.
                    → Prevents overfitting and saves time.
    -regularization: Methods to prevent overfitting by adding constraints.
        L1 regularization (Lasso) → adds absolute value of weights.
        L2 regularization (Ridge) → adds square of weights.
        Penalizes large weights to encourage simpler models.
    -memorization capacity: how much data the model can “memorize”.
        Too high → overfitting
    -dropout: Randomly turns off (sets to 0) some neurons during training.
        Helps prevent overfitting by forcing the network to learn robust features.
    -batch normalization: Normalizes activations in each mini-batch to stabilize learning.
        Helps speed up training and reduce sensitivity to initialization.


Make Preprocessing Reproducible with Pipelines:
    Pipeline / ColumnTransformer from scikit-learn
    interpolate, normalize, PowerTransformer
    FunctionTransformer / TransformerMixin


Model Benchmarking & Comparison:
    Benchmark neural nets vs.:
        RandomForest / XGBoost / LightGBM
        Linear or ElasticNet baseline


Better Evaluation & Interpretability
    Residual plots and error histograms
    SHAP values for feature importance
    Learning curves and validation curves
    Cross-validation with confidence intervals


MLOps & Reproducibility
    Use MLflow to track experiments and model versions
    Build Streamlit apps for stakeholders