Skip to content

Implement configurable options for comps algorithm methodology#449

Merged
wagnerlmichael merged 34 commits intomasterfrom
405-persist-all-possible-significant-sales-algorithms-to-the-code
Feb 26, 2026
Merged

Implement configurable options for comps algorithm methodology#449
wagnerlmichael merged 34 commits intomasterfrom
405-persist-all-possible-significant-sales-algorithms-to-the-code

Conversation

@wagnerlmichael
Copy link
Member

@wagnerlmichael wagnerlmichael commented Feb 18, 2026

Over the last year, we tested a number of algorithm variations for comps. This PR centralizes them all and allows us to choose which methodology we run through the params.yaml configuration file.

I've done moderate testing locally for all 4 methods and extract_tree_weights runs, producing comps that pass the eyeball test.

The four options:

  • unweighted
  • unweighted_with_error_reduction
  • error_reduction
  • prediction_variance

Closes #405

@wagnerlmichael wagnerlmichael linked an issue Feb 18, 2026 that may be closed by this pull request
message("First 5 weights:")
print(head(tree_weights, 5))
if (is.matrix(tree_weights)) {
if (!all(rowSums(tree_weights) %in% c(0, 1))) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the negation here because unless I'm reading incorrectly, I think that we had this backwards?

@wagnerlmichael wagnerlmichael changed the title [WIP] Implement 4 config options for comps methodology Implement 4 configurable options for comps algorithm methodology Feb 20, 2026
@wagnerlmichael wagnerlmichael changed the title Implement 4 configurable options for comps algorithm methodology Implement configurable options for comps algorithm methodology Feb 20, 2026
@wagnerlmichael wagnerlmichael marked this pull request as ready for review February 20, 2026 21:01
R/helpers.R Outdated
)

# ---------------------------------------------------------
# unweighted (vector with 1/n_trees for each tree)
Copy link
Contributor

@Damonamajor Damonamajor Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna standardize the comments once review is done to name: Description.

@Damonamajor
Copy link
Contributor

Damonamajor commented Feb 20, 2026

And this ahead of the .R function is outdated

# Helper function to return weights for comps
# Computes per-tree weights from cumulative leaf node values.

# Basic Steps
# For every observation, map its assigned leaf index in
# each tree to the corresponding leaf value.
# Compute the row-wise cumulative sums of these
# leaf values (stand-in for training data predictions).
# Calculate the absolute prediction error.
# Compute the reduction in error.
# Normalize these improvements so that row-weights sum to 1.

Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks you two! Some small comments below, but nothing that I'm super concerned about overall. Once final code changes are in, I'm going to kick off a few test runs to make sure each algorithm works.

@wagnerlmichael are you down to take a stab at extending the Python tests to confirm these new changes work? A few cases I think we should test:

  • Add additional parameterized test cases to test_get_comps to make sure that the weights indexing works as expected when the weights are a vector instead of a matrix
  • Add additional parameterized test cases to test_get_comps_raises_on_invalid_inputs to make sure we properly raise errors in the 1-D case

Pytest parameterized tests can be a bit tricky if you're not familiar with them, so let me know if you want any help figuring out how to do this.

Also, I noticed that the GitHub workflow that runs the Python tests was disabled, so we're not running these tests as part of our CI checks. I just enabled that workflow again, so hopefully the next time you push a commit, GitHub will run your tests; if that doesn't happen, let me know and we can continue troubleshooting.

@wagnerlmichael
Copy link
Member Author

This is great, thanks you two! Some small comments below, but nothing that I'm super concerned about overall. Once final code changes are in, I'm going to kick off a few test runs to make sure each algorithm works.

@wagnerlmichael are you down to take a stab at extending the Python tests to confirm these new changes work? A few cases I think we should test:

  • Add additional parameterized test cases to test_get_comps to make sure that the weights indexing works as expected when the weights are a vector instead of a matrix
  • Add additional parameterized test cases to test_get_comps_raises_on_invalid_inputs to make sure we properly raise errors in the 1-D case

Pytest parameterized tests can be a bit tricky if you're not familiar with them, so let me know if you want any help figuring out how to do this.

Also, I noticed that the GitHub workflow that runs the Python tests was disabled, so we're not running these tests as part of our CI checks. I just enabled that workflow again, so hopefully the next time you push a commit, GitHub will run your tests; if that doesn't happen, let me know and we can continue troubleshooting.

Yes, sounds good! I'll add some tests

wagnerlmichael and others added 2 commits February 23, 2026 09:01
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
wagnerlmichael and others added 2 commits February 23, 2026 11:47
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wagnerlmichael Tiny little nit I just noticed while testing -- since we're adding a new parameter to params.yaml, we should make sure we save it to the model.metadata output table in the finalize stage:

comp_enable = comp_enable,
comp_num_comps = params$comp$num_comps,

@Damonamajor
Copy link
Contributor

Damonamajor commented Feb 24, 2026

I added it, do we need documentation for it anywhere?

@jeancochrane
Copy link
Member

I added it, do we need documentation for it anywhere?

Great question -- we probably should document the fields in model.metadata in our data catalog, but currently we don't, so no action currently needed.

…o-the-code' of github.com:ccao-data/model-res-avm into 405-persist-all-possible-significant-sales-algorithms-to-the-code

Merge
f"(n_comparisons, n_trees), got {weights.ndim}-D"
)

# Avoid editing the df in-place
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add more extensive documentation about the reason for adding this here? @jeancochrane

Copy link
Member

@jeancochrane jeancochrane Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so. I actually don't even understand why we need this. Do we mutate the observation dataframe later on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following chunk, when get_comps runs

    # Test with matrix weights (error_reduction style)
    tree_weights_matrix = np.asarray(
        [np.random.dirichlet(np.ones(num_trees)) for _ in range(num_comparisons)]
    )
    start = time.time()
    get_comps(leaf_nodes, training_leaf_nodes, tree_weights_matrix)
    end = time.time()
    print(f"get_comps (matrix weights) runtime: {end - start}s")

this code that creates (in-place) a new column in the observation_df within the function also edits the leaf_nodes data frame out of the scope of the function

# Chunk the observations so that the script can periodically report progress
observation_df["chunk"] = pd.cut(
    observation_df.index, bins=num_chunks, labels=False
)

such that when we finish the matrix test and move onto the vector test, the leaf_nodes object column number has been increased to 500 to 501, which causes our value error tests to catch a dimension mismatch

  # Test with vector weights (unweighted / prediction_variance style)
  tree_weights_vector = np.random.dirichlet(np.ones(num_trees))
  start = time.time()
  get_comps(leaf_nodes, training_leaf_nodes, tree_weights_vector)
  end = time.time()
  print(f"get_comps (vector weights) runtime: {end - start}s")

Here is a reproducible isolated example that I think replicates the behaviour:

import pandas as pd
import numpy as np

# Create toy dataframes
leaf_nodes = pd.DataFrame(np.random.randint(0, 10, size=[5, 3]))
print("Before:", leaf_nodes.shape)  # (5, 3)

def add_chunk_column(observation_df):
    observation_df["chunk"] = [0, 0, 1, 1, 1]

add_chunk_column(leaf_nodes)
print("After:", leaf_nodes.shape)   # (5, 4)
print(leaf_nodes)

Does this make sense? I feel like pandas inplace trickiness always gets me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right! Thanks for the clear explanation. I see now that the mutation happens literally on the next line lol, my bad for missing it 🤦🏻‍♀️ Since we perform the mutation immediately after this copy operation, I don't actually think we need to document the decision any more thoroughly than this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prob! Got it, sounds good

Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go. Nice work you two!

@wagnerlmichael wagnerlmichael merged commit a0a8f81 into master Feb 26, 2026
6 checks passed
@wagnerlmichael wagnerlmichael deleted the 405-persist-all-possible-significant-sales-algorithms-to-the-code branch February 26, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist all possible significant sales algorithms to the code

3 participants