Skip to content

Commit c569820

Browse files
Update 02-regression.md
1 parent 7b4e13f commit c569820

1 file changed

Lines changed: 6 additions & 5 deletions

File tree

episodes/02-regression.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -414,7 +414,7 @@ dataset = load_dataset('penguins')
414414
dataset = dataset.dropna(subset=['body_mass_g', 'bill_depth_mm', 'species'])
415415

416416
# Define predictors and target
417-
X = dataset[['body_mass_g', 'species']]
417+
X = dataset[['body_mass_g', 'species']] # conventionally, we use capital X when there are multiple predictors
418418
y = dataset['bill_depth_mm']
419419
```
420420

@@ -423,20 +423,21 @@ Since the species column is coded as a string, we need to convert it into a nume
423423
By default, we drop the first category to avoid multicollinearity—this means the omitted category serves as the reference group when interpreting model coefficients.
424424
```python
425425
# One-hot encode species (drop_first avoids multicollinearity)
426-
X = pd.get_dummies(X, columns=['species'], drop_first=True)
426+
X_dummies = pd.get_dummies(X, columns=['species'], drop_first=True)
427+
X_dummies
427428
```
428429

429430
We can than train/fit and evaluate our model as usual.
430431
```python
431432
# Train/test split
432-
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
433+
X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.2, random_state=0)
433434

434435
# Fit a linear regression model
435436
model = LinearRegression()
436-
model.fit(x_train, y_train)
437+
model.fit(X_train, y_train)
437438

438439
# Predict and evaluate
439-
y_pred = model.predict(x_test)
440+
y_pred = model.predict(X_test)
440441
rmse = mean_squared_error(y_test, y_pred)
441442
print(f"RMSE with species as a predictor: {rmse:.2f}")
442443

0 commit comments

Comments
 (0)