@@ -463,100 +463,106 @@ plt.close()
463463Before wrapping up, let's discuss some important pitfalls to avoid when working on classification tasks:
464464
4654651 . ** Data Leakage** : Always split your data before any preprocessing or feature engineering
466- ``` python
467- # Wrong: Preprocessing before split
468- X_scaled = preprocessing.scale(X)
469- X_tr, X_te, y_tr, y_te = train_test_split(X_scaled, y)
470466
471- # Correct: Split first, then preprocess
472- X_tr, X_te, y_tr, y_te = train_test_split(X, y)
473- X_tr_scaled = preprocessing.scale(X_tr)
474- X_te_scaled = preprocessing.scale(X_te)
475- ```
467+ ``` python
468+ # Wrong: Preprocessing before split
469+ X_scaled = preprocessing.scale(X)
470+ X_tr, X_te, y_tr, y_te = train_test_split(X_scaled, y)
471+
472+ # Correct: Split first, then preprocess
473+ X_tr, X_te, y_tr, y_te = train_test_split(X, y)
474+ X_tr_scaled = preprocessing.scale(X_tr)
475+ X_te_scaled = preprocessing.scale(X_te)
476+ ```
476477
4774782 . ** Class Imbalance** : Always check your class distribution
478- ```python
479- # Using pandas for better visualization
480- import pandas as pd
481-
482- # Show absolute and relative frequencies
483- class_dist = pd.Series(y).value_counts(normalize = True )
484- print (" Class distribution (%):" )
485- print (class_dist.mul(100 ).round(2 ))
486-
487- # Visualize distribution
488- class_dist.plot(kind = ' bar' )
489- plt.title(' Class Distribution' )
490- plt.xlabel(' Class' )
491- plt.ylabel(' Frequency (%)' )
492- ```
479+
480+ ``` python
481+ # Using pandas for better visualization
482+ import pandas as pd
483+
484+ # Show absolute and relative frequencies
485+ class_dist = pd.Series(y).value_counts(normalize = True )
486+ print (" Class distribution (%):" )
487+ print (class_dist.mul(100 ).round(2 ))
488+
489+ # Visualize distribution
490+ class_dist.plot(kind = ' bar' )
491+ plt.title(' Class Distribution' )
492+ plt.xlabel(' Class' )
493+ plt.ylabel(' Frequency (%)' )
494+ ```
493495
4944963 . ** Overfitting** : Monitor these warning signs
495497 - Large gap between training and validation scores
496498 - Perfect training accuracy (like we saw with RandomForest)
497499 - Poor generalization to new data
498500
499- ```python
500- # Use cross-validation for robust estimates
501- from sklearn.model_selection import cross_val_score
501+ ``` python
502+ # Use cross-validation for robust estimates
503+ from sklearn.model_selection import cross_val_score
502504
503- scores = cross_val_score(clf, X_tr, y_tr, cv = 5 )
504- print (f " CV Scores: { scores} " )
505- print (f " Mean: { scores.mean():.3f } (± { scores.std()* 2 :.3f } ) " )
506- ```
505+ scores = cross_val_score(clf, X_tr, y_tr, cv = 5 )
506+ print (f " CV Scores: { scores} " )
507+ print (f " Mean: { scores.mean():.3f } (± { scores.std()* 2 :.3f } ) " )
508+ ```
507509
5085104 . ** Memory Management** : For large datasets, consider these approaches
509- ```python
510- # Use n_jobs parameter for parallel processing
511- rf = RandomForestClassifier(n_jobs = - 1 ) # Use all available cores
512511
513- # Or batch processing with random forests
514- rf = RandomForestClassifier(max_samples = 0.8 ) # Use 80% of samples per tree
515- ```
512+ ``` python
513+ # Use n_jobs parameter for parallel processing
514+ rf = RandomForestClassifier(n_jobs = - 1 ) # Use all available cores
515+
516+ # Or batch processing with random forests
517+ rf = RandomForestClassifier(max_samples = 0.8 ) # Use 80% of samples per tree
518+ ```
516519
5175205 . ** Feature Scaling** : Different algorithms have different scaling requirements
518- ```python
519- # SVM requires scaling, Random Forests don't
520- from sklearn.preprocessing import StandardScaler
521521
522- # For SVM
523- scaler = StandardScaler()
524- X_tr_scaled = scaler.fit_transform(X_tr)
525- X_te_scaled = scaler.transform(X_te)
522+ ``` python
523+ # SVM requires scaling, Random Forests don't
524+ from sklearn.preprocessing import StandardScaler
525+
526+ # For SVM
527+ scaler = StandardScaler()
528+ X_tr_scaled = scaler.fit_transform(X_tr)
529+ X_te_scaled = scaler.transform(X_te)
526530
527- # Random Forests can handle unscaled data
528- rf.fit(X_tr, y_tr) # No scaling needed
529- ```
531+ # Random Forests can handle unscaled data
532+ rf.fit(X_tr, y_tr) # No scaling needed
533+ ```
530534
5315356 . ** Model Selection Bias** : Don't use test set for model selection
532- ```python
533- # Wrong: Using test set for parameter tuning
534- for param in parameters:
535- clf.set_params(** param)
536- score = clf.fit(X_tr, y_tr).score(X_te, y_te) # Don't do this!
537-
538- # Correct: Use cross-validation
539- grid = GridSearchCV(clf, parameters, cv = 5 )
540- grid.fit(X_tr, y_tr)
541- # Only use test set for final evaluation
542- ```
536+
537+ ``` python
538+ # Wrong: Using test set for parameter tuning
539+ for param in parameters:
540+ clf.set_params(** param)
541+ score = clf.fit(X_tr, y_tr).score(X_te, y_te) # Don't do this!
542+
543+ # Correct: Use cross-validation
544+ grid = GridSearchCV(clf, parameters, cv = 5 )
545+ grid.fit(X_tr, y_tr)
546+ # Only use test set for final evaluation
547+ ```
543548
5445497 . ** Model Troubleshooting Tips**
545- ```python
546- # Check for data issues first
547- print (" Missing values:" , X.isnull().sum().sum())
548- print (" Infinite values:" , np.isinf(X.values).sum())
549-
550- # Verify predictions are valid
551- y_pred = clf.predict(X_te)
552- if len (np.unique(y_pred)) == 1 :
553- print (" Warning: Model predicting single class!" )
554-
555- # Check probability calibration
556- y_prob = clf.predict_proba(X_te)
557- if np.any(y_prob > 1.0 ) or np.any(y_prob < 0.0 ):
558- print (" Warning: Invalid probability predictions!" )
559- ```
550+
551+ ``` python
552+ # Check for data issues first
553+ print (" Missing values:" , X.isnull().sum().sum())
554+ print (" Infinite values:" , np.isinf(X.values).sum())
555+
556+ # Verify predictions are valid
557+ y_pred = clf.predict(X_te)
558+ if len (np.unique(y_pred)) == 1 :
559+ print (" Warning: Model predicting single class!" )
560+
561+ # Check probability calibration
562+ y_prob = clf.predict_proba(X_te)
563+ if np.any(y_prob > 1.0 ) or np.any(y_prob < 0.0 ):
564+ print (" Warning: Invalid probability predictions!" )
565+ ```
560566
5615678 . ** Common Error Messages and Solutions**
562568 - ` ValueError: Input contains NaN ` : Clean your data before training
@@ -585,4 +591,4 @@ In the next post, we'll tackle the same MNIST classification problem using Tenso
585591
586592In Part 2, we'll explore how neural networks approach the same problem using TensorFlow, introducing deep learning concepts and comparing the two approaches.
587593
588- [Continue to Part 2 →]({{ site.baseurl }}/ blog/ 2023 / 02_tensorflow_simple )
594+ [ Next: Deep Learning Fundamentals →] ({{ site.baseurl }}/blog/2023/02_tensorflow_simple)
0 commit comments