-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathwhat_i_learned.txt
More file actions
196 lines (167 loc) · 7.9 KB
/
what_i_learned.txt
File metadata and controls
196 lines (167 loc) · 7.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
Missing Values:
-Mean, Median and Mode Imputation:
df[i].fillna(df[i].mean())
df[i].fillna(df[i].median())
df[i].fillna(df[i].mode())
+Advantages: Simple
-Disadvantages: Inaccuracy
-Forward and Backward Fill(It replaces missing values with the last observed non-missing value in the column):
df[i].fillna(method='ffill')
df[i].fillna(method='bfill')
+Advantages: Simple and Intuitive, Preserves Patterns
-Disadvantages: Assumption of Closeness, Potential Inaccuracy
-Interpolation Techniques(estimate missing values based on the values of surrounding data points.):
df[i].interpolate(method='linear')
df[i].interpolate(method='quadratic')
+Advantages: Preserves Data Relationships
-Disadvantages: Complexity, Assumptions on Data may not always be true
Outliers Detection:
-Z-Score Method(detects outliers based on how far a data point is from the mean, measured in terms of standard deviations.):
z_scores = np.abs(stats.zscore(df))
outliers_z = np.where(z_scores > 3)
+Pros: Simple, fast, works well with normally distributed data.
-Cons: Not reliable for skewed or non-normal distributions.
-IQR Method (Interquartile Range) (Any value that falls below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is considered an outlier.):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))
+Pros: Robust to non-normal data, less influenced by extreme values.
-Cons: Doesn’t adapt well to very skewed distributions.
-Isolation Forest(Randomly select features and split values.Construct isolation trees. Compute average path length for each point. Shorter path = more likely outlier.):
+Pros: Works well in high dimensions, efficient.
-Cons: Requires choosing contamination
-Local Outlier Factor (LOF) (If a point has significantly lower density than its neighbors, it is flagged as an outlier.):
+Pros: Works well with clusters of varying density.
-Cons: Sensitive to choice of k (neighbors).
Normalization and Standardization:
Normalization(Scales values between [0, 1] or [-1, 1]):
MinMaxScaler
+Pros: useful when we don't know about the distribution
Standardardization(not bounded to a certain range):
StandardScaler
+Pros: useful when the feature distribution is Normal or Gaussian
Handling Skewed Distribution:
Logarithmic Transformation:
when the data feature is right skewed or positively skewed
Square Root Transformation:
usually used in moderately right skewed data,
Box-Cox Transformation:
more suitable for right skewed data with data points either being positive or zero valued.
Yeo-Johnson Transformation:
it can work for both right as well as left skewed data feature,
it can also work for either positive or negative value, which the box-cox transformation lacks.
Qunatile Transformation:
works both for right skewed as well as left skewed data variable.
Correlation:
linear Relationship between feature and target value
only linear
-1 ~ 1
Mutual Information:
Degree of information sharing between features and targets
also non-linear
0 ~ inf
Feature Importance:
Which features the model actually prioritized
also non-linear
depending on the model
Feature Creation:
Domain-specific:
Created based on industry knowledge like business rules.
Data-driven:
Derived by recognizing patterns in data.
Synthetic:
Formed by combining existing features.
Binning:
Convert continuous variables into categories
Polynomial or interaction features (PolynomialFeatures)
Time-based features (lag, rolling mean, trends)
Geographic features (distance, region clusters from lat/lon)
Feature selection using SelectKBest, RFECV, or SHAP values
Feature Extraction:
PCA:
Combine correlations into one axis
Aggregation:
Group and take average/sum
Combination:
Addition/Subtraction/Ratio
Keras:
1.define model:
2.compile model:
3.fit model:
4.evaluate model:
Evaluation:
Train Loss:
How well the model learned the training data
Validation Loss:
How well the model can generalize to new, unseen data
High training error and high test error -> Underfitting
Low training error and higher test error -> Overfitting
Neural Network:
-layer:
input layer:
hidden layer:They extract patterns and relationships from the data.
activation function:Determines how the neuron output is transformed before passing to the next layer.
Helps the model learn non-linear relationships.
sigmoid:(0, 1)
Often used in binary classification.
tanh:(-1, 1)
ReLU:max(0,x)
Most common activation in deep networks.
output layer:
Regression → 1 neuron (linear activation)
Binary classification → 1 neuron (sigmoid)
Multi-class classification → N neurons (softmax)
-hyperparameters:
number of epochs:
batch size: Number of samples processed before updating weights.
big -> Underfitting
small -> Overfitting
-loss function: Measures how far the model’s predictions are from the true values.
The model tries to minimize this value.
Regression -> MSE or MAE
Binary classification -> Binary Cross-Entropy
Multiclass classification -> Categorical Cross-Entropy
Segmentation -> Dice Loss, Cross-Entropy
Sequence tasks -> CTC Loss
-weight: Each neuron connection has a weight that scales input features.
During training, the optimizer updates these weights to minimize loss.
initializers:
regularizers:
constraints:
-optimizer: Algorithm that updates weights based on the loss function.
SGD (Stochastic Gradient Descent)
Adam : most widely used, adaptive learning rate.
RMSprop : good for recurrent networks.
-metrics: Used to measure performance of the model (not to train it).
Accuracy → classification
MAE / RMSE → regression
Precision, Recall, F1-score → classification quality
-early stopping: stop training automatically when validation loss stops improving.
→ Prevents overfitting and saves time.
-regularization: Methods to prevent overfitting by adding constraints.
L1 regularization (Lasso) → adds absolute value of weights.
L2 regularization (Ridge) → adds square of weights.
Penalizes large weights to encourage simpler models.
-memorization capacity: how much data the model can “memorize”.
Too high → overfitting
-dropout: Randomly turns off (sets to 0) some neurons during training.
Helps prevent overfitting by forcing the network to learn robust features.
-batch normalization: Normalizes activations in each mini-batch to stabilize learning.
Helps speed up training and reduce sensitivity to initialization.
Make Preprocessing Reproducible with Pipelines:
Pipeline / ColumnTransformer from scikit-learn
interpolate, normalize, PowerTransformer
FunctionTransformer / TransformerMixin
Model Benchmarking & Comparison:
Benchmark neural nets vs.:
RandomForest / XGBoost / LightGBM
Linear or ElasticNet baseline
Better Evaluation & Interpretability
Residual plots and error histograms
SHAP values for feature importance
Learning curves and validation curves
Cross-validation with confidence intervals
MLOps & Reproducibility
Use MLflow to track experiments and model versions
Build Streamlit apps for stakeholders