Skip to content

Commit b12d8f6

Browse files
author
M.Notter
committed
Figure updates
1 parent 61f42e4 commit b12d8f6

6 files changed

+224
-218
lines changed

_posts/2023-10-23-01_scikit_simple.md

Lines changed: 80 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -463,100 +463,106 @@ plt.close()
463463
Before wrapping up, let's discuss some important pitfalls to avoid when working on classification tasks:
464464

465465
1. **Data Leakage**: Always split your data before any preprocessing or feature engineering
466-
```python
467-
# Wrong: Preprocessing before split
468-
X_scaled = preprocessing.scale(X)
469-
X_tr, X_te, y_tr, y_te = train_test_split(X_scaled, y)
470466

471-
# Correct: Split first, then preprocess
472-
X_tr, X_te, y_tr, y_te = train_test_split(X, y)
473-
X_tr_scaled = preprocessing.scale(X_tr)
474-
X_te_scaled = preprocessing.scale(X_te)
475-
```
467+
```python
468+
# Wrong: Preprocessing before split
469+
X_scaled = preprocessing.scale(X)
470+
X_tr, X_te, y_tr, y_te = train_test_split(X_scaled, y)
471+
472+
# Correct: Split first, then preprocess
473+
X_tr, X_te, y_tr, y_te = train_test_split(X, y)
474+
X_tr_scaled = preprocessing.scale(X_tr)
475+
X_te_scaled = preprocessing.scale(X_te)
476+
```
476477

477478
2. **Class Imbalance**: Always check your class distribution
478-
```python
479-
# Using pandas for better visualization
480-
import pandas as pd
481-
482-
# Show absolute and relative frequencies
483-
class_dist = pd.Series(y).value_counts(normalize=True)
484-
print("Class distribution (%):")
485-
print(class_dist.mul(100).round(2))
486-
487-
# Visualize distribution
488-
class_dist.plot(kind='bar')
489-
plt.title('Class Distribution')
490-
plt.xlabel('Class')
491-
plt.ylabel('Frequency (%)')
492-
```
479+
480+
```python
481+
# Using pandas for better visualization
482+
import pandas as pd
483+
484+
# Show absolute and relative frequencies
485+
class_dist = pd.Series(y).value_counts(normalize=True)
486+
print("Class distribution (%):")
487+
print(class_dist.mul(100).round(2))
488+
489+
# Visualize distribution
490+
class_dist.plot(kind='bar')
491+
plt.title('Class Distribution')
492+
plt.xlabel('Class')
493+
plt.ylabel('Frequency (%)')
494+
```
493495

494496
3. **Overfitting**: Monitor these warning signs
495497
- Large gap between training and validation scores
496498
- Perfect training accuracy (like we saw with RandomForest)
497499
- Poor generalization to new data
498500

499-
```python
500-
# Use cross-validation for robust estimates
501-
from sklearn.model_selection import cross_val_score
501+
```python
502+
# Use cross-validation for robust estimates
503+
from sklearn.model_selection import cross_val_score
502504

503-
scores = cross_val_score(clf, X_tr, y_tr, cv=5)
504-
print(f"CV Scores: {scores}")
505-
print(f"Mean: {scores.mean():.3f}{scores.std()*2:.3f})")
506-
```
505+
scores = cross_val_score(clf, X_tr, y_tr, cv=5)
506+
print(f"CV Scores: {scores}")
507+
print(f"Mean: {scores.mean():.3f}{scores.std()*2:.3f})")
508+
```
507509

508510
4. **Memory Management**: For large datasets, consider these approaches
509-
```python
510-
# Use n_jobs parameter for parallel processing
511-
rf = RandomForestClassifier(n_jobs=-1) # Use all available cores
512511

513-
# Or batch processing with random forests
514-
rf = RandomForestClassifier(max_samples=0.8) # Use 80% of samples per tree
515-
```
512+
```python
513+
# Use n_jobs parameter for parallel processing
514+
rf = RandomForestClassifier(n_jobs=-1) # Use all available cores
515+
516+
# Or batch processing with random forests
517+
rf = RandomForestClassifier(max_samples=0.8) # Use 80% of samples per tree
518+
```
516519

517520
5. **Feature Scaling**: Different algorithms have different scaling requirements
518-
```python
519-
# SVM requires scaling, Random Forests don't
520-
from sklearn.preprocessing import StandardScaler
521521

522-
# For SVM
523-
scaler = StandardScaler()
524-
X_tr_scaled = scaler.fit_transform(X_tr)
525-
X_te_scaled = scaler.transform(X_te)
522+
```python
523+
# SVM requires scaling, Random Forests don't
524+
from sklearn.preprocessing import StandardScaler
525+
526+
# For SVM
527+
scaler = StandardScaler()
528+
X_tr_scaled = scaler.fit_transform(X_tr)
529+
X_te_scaled = scaler.transform(X_te)
526530

527-
# Random Forests can handle unscaled data
528-
rf.fit(X_tr, y_tr) # No scaling needed
529-
```
531+
# Random Forests can handle unscaled data
532+
rf.fit(X_tr, y_tr) # No scaling needed
533+
```
530534

531535
6. **Model Selection Bias**: Don't use test set for model selection
532-
```python
533-
# Wrong: Using test set for parameter tuning
534-
for param in parameters:
535-
clf.set_params(**param)
536-
score = clf.fit(X_tr, y_tr).score(X_te, y_te) # Don't do this!
537-
538-
# Correct: Use cross-validation
539-
grid = GridSearchCV(clf, parameters, cv=5)
540-
grid.fit(X_tr, y_tr)
541-
# Only use test set for final evaluation
542-
```
536+
537+
```python
538+
# Wrong: Using test set for parameter tuning
539+
for param in parameters:
540+
clf.set_params(**param)
541+
score = clf.fit(X_tr, y_tr).score(X_te, y_te) # Don't do this!
542+
543+
# Correct: Use cross-validation
544+
grid = GridSearchCV(clf, parameters, cv=5)
545+
grid.fit(X_tr, y_tr)
546+
# Only use test set for final evaluation
547+
```
543548

544549
7. **Model Troubleshooting Tips**
545-
```python
546-
# Check for data issues first
547-
print("Missing values:", X.isnull().sum().sum())
548-
print("Infinite values:", np.isinf(X.values).sum())
549-
550-
# Verify predictions are valid
551-
y_pred = clf.predict(X_te)
552-
if len(np.unique(y_pred)) == 1:
553-
print("Warning: Model predicting single class!")
554-
555-
# Check probability calibration
556-
y_prob = clf.predict_proba(X_te)
557-
if np.any(y_prob > 1.0) or np.any(y_prob < 0.0):
558-
print("Warning: Invalid probability predictions!")
559-
```
550+
551+
```python
552+
# Check for data issues first
553+
print("Missing values:", X.isnull().sum().sum())
554+
print("Infinite values:", np.isinf(X.values).sum())
555+
556+
# Verify predictions are valid
557+
y_pred = clf.predict(X_te)
558+
if len(np.unique(y_pred)) == 1:
559+
print("Warning: Model predicting single class!")
560+
561+
# Check probability calibration
562+
y_prob = clf.predict_proba(X_te)
563+
if np.any(y_prob > 1.0) or np.any(y_prob < 0.0):
564+
print("Warning: Invalid probability predictions!")
565+
```
560566

561567
8. **Common Error Messages and Solutions**
562568
- `ValueError: Input contains NaN`: Clean your data before training
@@ -585,4 +591,4 @@ In the next post, we'll tackle the same MNIST classification problem using Tenso
585591

586592
In Part 2, we'll explore how neural networks approach the same problem using TensorFlow, introducing deep learning concepts and comparing the two approaches.
587593

588-
[Continue to Part 2 →]({{ site.baseurl }}/blog/2023/02_tensorflow_simple)
594+
[Next: Deep Learning Fundamentals]({{ site.baseurl }}/blog/2023/02_tensorflow_simple)

_posts/2023-10-23-02_tensorflow_simple.md

Lines changed: 29 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -370,53 +370,53 @@ When starting with TensorFlow and neural networks, watch out for these common is
370370
- Check for missing or invalid values
371371
- Ensure consistent data types
372372

373-
```python
374-
# Example of proper data preparation
375-
x_train = x_train.astype('float32') / 255.0
376-
x_test = x_test.astype('float32') / 255.0
377-
```
373+
```python
374+
# Example of proper data preparation
375+
x_train = x_train.astype('float32') / 255.0
376+
x_test = x_test.astype('float32') / 255.0
377+
```
378378

379379
2. **Model Architecture**
380380
- Start simple, add complexity only if needed
381381
- Match output layer to your task (softmax for classification)
382382
- Use appropriate layer sizes
383383

384-
```python
385-
# Example of clear, progressive architecture
386-
model = keras.Sequential([
387-
layers.Conv2D(32, kernel_size=(3, 3), activation='relu'),
388-
layers.MaxPooling2D(pool_size=(2, 2)),
389-
layers.Flatten(),
390-
layers.Dense(10, activation='softmax') # 10 classes
391-
])
392-
```
384+
```python
385+
# Example of clear, progressive architecture
386+
model = keras.Sequential([
387+
layers.Conv2D(32, kernel_size=(3, 3), activation='relu'),
388+
layers.MaxPooling2D(pool_size=(2, 2)),
389+
layers.Flatten(),
390+
layers.Dense(10, activation='softmax') # 10 classes
391+
])
392+
```
393393

394394
3. **Training Issues**
395395
- Monitor training metrics (loss not decreasing)
396396
- Watch for overfitting (validation loss increasing)
397397
- Use appropriate batch sizes
398398

399-
```python
400-
# Add validation monitoring during training
401-
history = model.fit(
402-
x_train, y_train,
403-
validation_split=0.1,
404-
batch_size=128,
405-
epochs=10
406-
)
407-
```
399+
```python
400+
# Add validation monitoring during training
401+
history = model.fit(
402+
x_train, y_train,
403+
validation_split=0.1,
404+
batch_size=128,
405+
epochs=10
406+
)
407+
```
408408

409409
4. **Memory Management**
410410
- Clear unnecessary variables
411411
- Use appropriate data types
412412
- Watch batch sizes on limited hardware
413413

414-
```python
415-
# Free memory after training
416-
import gc
417-
gc.collect()
418-
keras.backend.clear_session()
419-
```
414+
```python
415+
# Free memory after training
416+
import gc
417+
gc.collect()
418+
keras.backend.clear_session()
419+
```
420420

421421
## Summary and Next Steps
422422

0 commit comments

Comments
 (0)