In this homework, we’ll work with categorical variables, build our first machine learning models (Decision Trees), and perform hyperparameter tuning.
💡 Your goal is to read through the provided code and make slight extensions. Specifically:
- Add more columns to the dataframe.
- Define new “handcrafted” rules to predict growth.
- Analyze whether the ML model produces unique predictions (i.e., cases where only the ML model is correct compared to other “hand” rules).
- Tune the Decision Tree by optimizing its complexity (i.e., the depth hyperparameter).
Please use the Colab Module 3 for all tasks to ensure you have the same dataframe used for the Modeling part, as covered during the lecture.
HINT: If you want to avoid data truncation in GitHub's UI, try either of the following options:
- Open the notebook in Colab, using the GitHub link to the notebook.
- Clone the repository to a local folder and open the notebook in Jupyter Notebook.
What is the ABSOLUTE CORRELATION VALUE of the most correlated dummy variable _w<week_of_month> with the binary outcome is_positive_growth_30d_future?
From the correlation analysis and modeling, you may have observed that October and November are potentially important seasonal months. In this task, you'll go further by generating dummy variables for both the Month and Week-of-Month (starting from 1). For example, the first week of October should be coded as: 'October_w1'.
Once you've generated these new variables, identify the one with the highest absolute correlation with is_positive_growth_30d_future, and round the result to three decimal places.
- Use this StackOverflow reference to compute the week of the month using the following formula:
(d.day - 1) // 7 + 1-
Create a new string variable that combines the month name and week of the month. Example: 'October_w1', 'November_w2', etc.
-
Add the new variable (e.g.,
month_wom) to your set of categorical features.Your updated categorical feature list should include:
'Month''Weekday''Ticker''ticker_type''month_wom'
-
Use
pandas.get_dummies()to generate dummy variables for all categorical features.This should result in approximately 115 dummy variables, including around 60 for the
month_womfeature (12 months × up to 5 weeks). -
Use
DataFrame.corr()to compute the correlation between each feature and the target variableis_positive_growth_30d_future. -
Filter the correlation results to include only the dummy variables generated from
month_wom. -
Create a new column named
abs_corrin the correlation results that stores the absolute value of the correlations. -
Sort the correlation results by
abs_corrin descending order. -
Identify and report the highest absolute correlation value among the
month_womdummy variables, rounded to three decimal places.
NOTE: new dummies will be used as features in the next tasks, please leave them in the dataset.
What is the precision score for the best of the NEW predictions (pred3 or pred4), rounded to 3 digits after the comma?
In this task, you'll apply insights from the visualized decision tree (clf10) (see Code Snippet 5: 1.4.4 Visualisation) to manually define and evaluate new predictive rules.
-
Define two new 'hand' rules based on branches that lead to 'positive' predictions in the tree:
pred3_manual_dgs10_5:(DGS10 <= 4) & (DGS5 <= 1)
pred4_manual_dgs10_fedfunds:(DGS10 > 4) & (FEDFUNDS <= 4.795)
Hint: This is not exactly the same condition as in the estimated tree (original:
(DGS10 <= 4.825) & (DGS5 <= 0.745);(DGS10 > 4.825) & (FEDFUNDS <= 4.795)), since in that case, there are no true positive predictions for both variables. Consider why this might be the case. -
Extend Code Snippet 3 (Manual "hand rule" predictions):
- Implement and apply the above two rules (
pred3,pred4) to your dataset. - Add the resulting predictions as new columns in your dataframe (e.g.,
new_df).
- Implement and apply the above two rules (
-
Compute precision:
- For the rule that does make positive predictions on the TEST set, compute its precision score.
- Use standard precision metrics (
TP / (TP + FP)). - Round the precision score to three decimal places.
Example: If your result is0.57897, your final answer should be:0.579.
Hint: This should already be visible in the code output, as the
IS_CORRECTandPREDICTIONSsets should automatically include the new columns.
What is the total number of records in the TEST dataset where the new prediction pred5_clf_10 is correct, while all 'hand' rule predictions (pred0 to pred4) are incorrect?
To ensure reproducibility, please include the following parameter in the Decision Tree Classifier:
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42) - Initialize a Decision Tree Classifier with a maximum depth of 10 and set
random_state=42for reproducibility. - Fit the classifier on the combined TRAIN and VALIDATION datasets.
- Use the trained model to predict on the entire dataset (TRAIN + VALIDATION + TEST).
- Store these predictions in a new column named
pred5_clf_10within your main dataframe. - Hint: When predicting on the entire dataset, it's easy to join the predictions with the full DataFrame, since the number of records and their order remain the same. You will need to define X_all and y_all and apply the same cleaning steps used previously for X_train, y_train, X_test, and y_test. This makes it straightforward to define a new column, for example:
df['pred5_clf_10'] = <predictions vector from clf10.predict(X_all)>
- Create a new boolean column,
only_pred5_is_correct, that isTrueonly when:- The prediction from
pred5_clf_10is correct (i.e., matches the true label). - All other hand rule predictions (
pred0throughpred4) are incorrect.
- The prediction from
- Convert the
only_pred5_is_correctcolumn from boolean to integer. - Filter the dataframe for records belonging to the TEST dataset.
- Count how many records in the TEST set have
only_pred5_is_correctequal to 1. - Report this count as your final answer.
- To generalize this for many prediction columns (e.g.,
pred0topred99), define a function that can be applied to an entire dataframe row. - This function should identify whether a specific prediction (
predX) is uniquely correct (correct while all others are incorrect). - This approach avoids hardcoding conditions for each predictor and scales easily.
- For examples of how to apply functions to rows in pandas, see this helpful resource:
Pandas apply function to every row
What is the optimal tree depth (from 1 to 20) for a DecisionTreeClassifier?
NOTE: please include random_state=42 to the Decision Tree Classifier initialization (e.g., clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)) to ensure consistency in results.
- Iterate through
max_depthvalues from 1 to 20. - For each
max_depth:- Train a Decision Tree Classifier with the current
max_depthon the combined TRAIN+VALIDATION dataset.
- Train a Decision Tree Classifier with the current
- Optionally, visualize how the 'head' (top levels) of each fitted tree changes with increasing tree depth. You can use:
sklearn.tree.plot_tree()for graphical visualization, or- The compact textual approach with
export_text()function. For example:from sklearn.tree import export_text tree_rules = export_text(model, feature_names=list(X_train), max_depth=3) print(tree_rules)
- Calculate the precision score on the TEST dataset for each fitted tree. You may also track precision on the VALIDATION dataset to observe signs of overfitting.
- Identify the optimal
max_depthwhere the precision score on the TEST dataset is highest. This value is your best_max_depth. - Using best_max_depth, retrain the Decision Tree Classifier on the combined TRAIN+VALIDATION set.
- Predict on the entire dataset (TRAIN + VALIDATION + TEST) and add the predictions as a new column
pred6_clf_bestin your dataframenew_df. - Compare the precision score of the tuned tree with previous predictions (
pred0topred5). You should observe an improvement, ideally achieving precision > 0.58, indicating the tuned tree outperforms earlier models.
- Plot the precision (or accuracy) scores against the
max_depthvalues to detect saturation or overfitting trends. - Observe the trade-off between model complexity (deeper trees) and generalization capability.
- For more information, consult the scikit-learn Decision Trees documentation.
Now that you have gained insights from the correlation analysis and Decision Tree results regarding the most influential variables, suggest new indicators you would like to include in the dataset and explain your reasoning.
Alternatively, you may propose a completely different approach based on your intuition, provided it remains relevant to the shared dataset of the largest stocks from India, the EU, and the US. If you choose this route, please also specify the data source.
Form for submitting: https://courses.datatalks.club/sma-zoomcamp-2025/homework/hw03
Leaderboard link: https://courses.datatalks.club/sma-zoomcamp-2025/leaderboard