Skip to content

Latest commit

 

History

History
187 lines (127 loc) · 10.7 KB

File metadata and controls

187 lines (127 loc) · 10.7 KB

Module 3 Homework (2025 Cohort)

In this homework, we’ll work with categorical variables, build our first machine learning models (Decision Trees), and perform hyperparameter tuning.

💡 Your goal is to read through the provided code and make slight extensions. Specifically:

  • Add more columns to the dataframe.
  • Define new “handcrafted” rules to predict growth.
  • Analyze whether the ML model produces unique predictions (i.e., cases where only the ML model is correct compared to other “hand” rules).
  • Tune the Decision Tree by optimizing its complexity (i.e., the depth hyperparameter).

Please use the Colab Module 3 for all tasks to ensure you have the same dataframe used for the Modeling part, as covered during the lecture.

HINT: If you want to avoid data truncation in GitHub's UI, try either of the following options:


Question 1: Dummies for Month and Week-of-Month

What is the ABSOLUTE CORRELATION VALUE of the most correlated dummy variable _w<week_of_month> with the binary outcome is_positive_growth_30d_future?

From the correlation analysis and modeling, you may have observed that October and November are potentially important seasonal months. In this task, you'll go further by generating dummy variables for both the Month and Week-of-Month (starting from 1). For example, the first week of October should be coded as: 'October_w1'.

Once you've generated these new variables, identify the one with the highest absolute correlation with is_positive_growth_30d_future, and round the result to three decimal places.

Suggested Steps

  1. Use this StackOverflow reference to compute the week of the month using the following formula:
(d.day - 1) // 7 + 1
  1. Create a new string variable that combines the month name and week of the month. Example: 'October_w1', 'November_w2', etc.

  2. Add the new variable (e.g., month_wom) to your set of categorical features.

    Your updated categorical feature list should include:

    • 'Month'
    • 'Weekday'
    • 'Ticker'
    • 'ticker_type'
    • 'month_wom'
  3. Use pandas.get_dummies() to generate dummy variables for all categorical features.

    This should result in approximately 115 dummy variables, including around 60 for the month_wom feature (12 months × up to 5 weeks).

  4. Use DataFrame.corr() to compute the correlation between each feature and the target variable is_positive_growth_30d_future.

  5. Filter the correlation results to include only the dummy variables generated from month_wom.

  6. Create a new column named abs_corr in the correlation results that stores the absolute value of the correlations.

  7. Sort the correlation results by abs_corr in descending order.

  8. Identify and report the highest absolute correlation value among the month_wom dummy variables, rounded to three decimal places.

NOTE: new dummies will be used as features in the next tasks, please leave them in the dataset.


Question 2: Define New "Hand" Rules on Macro and Technical Indicator Variables

What is the precision score for the best of the NEW predictions (pred3 or pred4), rounded to 3 digits after the comma?

In this task, you'll apply insights from the visualized decision tree (clf10) (see Code Snippet 5: 1.4.4 Visualisation) to manually define and evaluate new predictive rules.

  1. Define two new 'hand' rules based on branches that lead to 'positive' predictions in the tree:

    • pred3_manual_dgs10_5:
      (DGS10 <= 4) & (DGS5 <= 1)
    • pred4_manual_dgs10_fedfunds:
      (DGS10 > 4) & (FEDFUNDS <= 4.795)

    Hint: This is not exactly the same condition as in the estimated tree (original: (DGS10 <= 4.825) & (DGS5 <= 0.745); (DGS10 > 4.825) & (FEDFUNDS <= 4.795)), since in that case, there are no true positive predictions for both variables. Consider why this might be the case.

  2. Extend Code Snippet 3 (Manual "hand rule" predictions):

    • Implement and apply the above two rules (pred3, pred4) to your dataset.
    • Add the resulting predictions as new columns in your dataframe (e.g., new_df).
  3. Compute precision:

    • For the rule that does make positive predictions on the TEST set, compute its precision score.
    • Use standard precision metrics (TP / (TP + FP)).
    • Round the precision score to three decimal places.
      Example: If your result is 0.57897, your final answer should be: 0.579.

Hint: This should already be visible in the code output, as the IS_CORRECT and PREDICTIONS sets should automatically include the new columns.


Question 3: Unique Correct Predictions from a 10-Level Decision Tree Classifier (pred5_clf_10)

What is the total number of records in the TEST dataset where the new prediction pred5_clf_10 is correct, while all 'hand' rule predictions (pred0 to pred4) are incorrect?

To ensure reproducibility, please include the following parameter in the Decision Tree Classifier:

clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42) 

Step 1: Train the Decision Tree and Generate Predictions

  • Initialize a Decision Tree Classifier with a maximum depth of 10 and set random_state=42 for reproducibility.
  • Fit the classifier on the combined TRAIN and VALIDATION datasets.
  • Use the trained model to predict on the entire dataset (TRAIN + VALIDATION + TEST).
  • Store these predictions in a new column named pred5_clf_10 within your main dataframe.
  • Hint: When predicting on the entire dataset, it's easy to join the predictions with the full DataFrame, since the number of records and their order remain the same. You will need to define X_all and y_all and apply the same cleaning steps used previously for X_train, y_train, X_test, and y_test. This makes it straightforward to define a new column, for example:
    df['pred5_clf_10'] = <predictions vector from clf10.predict(X_all)>

Step 2: Identify Unique Correct Predictions by pred5_clf_10

  • Create a new boolean column, only_pred5_is_correct, that is True only when:
    • The prediction from pred5_clf_10 is correct (i.e., matches the true label).
    • All other hand rule predictions (pred0 through pred4) are incorrect.

Step 3: Count Unique Correct Predictions on the TEST Set

  • Convert the only_pred5_is_correct column from boolean to integer.
  • Filter the dataframe for records belonging to the TEST dataset.
  • Count how many records in the TEST set have only_pred5_is_correct equal to 1.
  • Report this count as your final answer.

Advanced (Optional)

  • To generalize this for many prediction columns (e.g., pred0 to pred99), define a function that can be applied to an entire dataframe row.
  • This function should identify whether a specific prediction (predX) is uniquely correct (correct while all others are incorrect).
  • This approach avoids hardcoding conditions for each predictor and scales easily.
  • For examples of how to apply functions to rows in pandas, see this helpful resource:
    Pandas apply function to every row

Question 4: Hyperparameter tuning for a Decision Tree

What is the optimal tree depth (from 1 to 20) for a DecisionTreeClassifier?

NOTE: please include random_state=42 to the Decision Tree Classifier initialization (e.g., clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)) to ensure consistency in results.

Instructions:

  • Iterate through max_depth values from 1 to 20.
  • For each max_depth:
    • Train a Decision Tree Classifier with the current max_depth on the combined TRAIN+VALIDATION dataset.
  • Optionally, visualize how the 'head' (top levels) of each fitted tree changes with increasing tree depth. You can use:
    • sklearn.tree.plot_tree() for graphical visualization, or
    • The compact textual approach with export_text() function. For example:
      from sklearn.tree import export_text
      tree_rules = export_text(model, feature_names=list(X_train), max_depth=3)
      print(tree_rules)
  • Calculate the precision score on the TEST dataset for each fitted tree. You may also track precision on the VALIDATION dataset to observe signs of overfitting.
  • Identify the optimal max_depth where the precision score on the TEST dataset is highest. This value is your best_max_depth.
  • Using best_max_depth, retrain the Decision Tree Classifier on the combined TRAIN+VALIDATION set.
  • Predict on the entire dataset (TRAIN + VALIDATION + TEST) and add the predictions as a new column pred6_clf_best in your dataframe new_df.
  • Compare the precision score of the tuned tree with previous predictions (pred0 to pred5). You should observe an improvement, ideally achieving precision > 0.58, indicating the tuned tree outperforms earlier models.

Advanced (Optional)

  • Plot the precision (or accuracy) scores against the max_depth values to detect saturation or overfitting trends.
  • Observe the trade-off between model complexity (deeper trees) and generalization capability.
  • For more information, consult the scikit-learn Decision Trees documentation.

[EXPLORATORY] Question 5: What data is missing?

Now that you have gained insights from the correlation analysis and Decision Tree results regarding the most influential variables, suggest new indicators you would like to include in the dataset and explain your reasoning.

Alternatively, you may propose a completely different approach based on your intuition, provided it remains relevant to the shared dataset of the largest stocks from India, the EU, and the US. If you choose this route, please also specify the data source.


Submitting the solutions

Form for submitting: https://courses.datatalks.club/sma-zoomcamp-2025/homework/hw03


Leaderboard

Leaderboard link: https://courses.datatalks.club/sma-zoomcamp-2025/leaderboard