Implement configurable options for comps algorithm methodology#449
Conversation
| message("First 5 weights:") | ||
| print(head(tree_weights, 5)) | ||
| if (is.matrix(tree_weights)) { | ||
| if (!all(rowSums(tree_weights) %in% c(0, 1))) { |
There was a problem hiding this comment.
Added the negation here because unless I'm reading incorrectly, I think that we had this backwards?
…-algorithms-to-the-code
R/helpers.R
Outdated
| ) | ||
|
|
||
| # --------------------------------------------------------- | ||
| # unweighted (vector with 1/n_trees for each tree) |
There was a problem hiding this comment.
I'm gonna standardize the comments once review is done to name: Description.
|
And this ahead of the .R function is outdated |
jeancochrane
left a comment
There was a problem hiding this comment.
This is great, thanks you two! Some small comments below, but nothing that I'm super concerned about overall. Once final code changes are in, I'm going to kick off a few test runs to make sure each algorithm works.
@wagnerlmichael are you down to take a stab at extending the Python tests to confirm these new changes work? A few cases I think we should test:
- Add additional parameterized test cases to
test_get_compsto make sure that the weights indexing works as expected when the weights are a vector instead of a matrix - Add additional parameterized test cases to
test_get_comps_raises_on_invalid_inputsto make sure we properly raise errors in the 1-D case
Pytest parameterized tests can be a bit tricky if you're not familiar with them, so let me know if you want any help figuring out how to do this.
Also, I noticed that the GitHub workflow that runs the Python tests was disabled, so we're not running these tests as part of our CI checks. I just enabled that workflow again, so hopefully the next time you push a commit, GitHub will run your tests; if that doesn't happen, let me know and we can continue troubleshooting.
Yes, sounds good! I'll add some tests |
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
…-algorithms-to-the-code
…o-the-code' of github.com:ccao-data/model-res-avm into 405-persist-all-possible-significant-sales-algorithms-to-the-code Merge changes`
…t-all-possible-significant-sales-algorithms-to-the-code Merge in fix for failing tests workflow
…-algorithms-to-the-code
There was a problem hiding this comment.
@wagnerlmichael Tiny little nit I just noticed while testing -- since we're adding a new parameter to params.yaml, we should make sure we save it to the model.metadata output table in the finalize stage:
model-res-avm/pipeline/05-finalize.R
Lines 94 to 95 in 64a3b60
|
I added it, do we need documentation for it anywhere? |
Great question -- we probably should document the fields in |
…o-the-code' of github.com:ccao-data/model-res-avm into 405-persist-all-possible-significant-sales-algorithms-to-the-code Merge
| f"(n_comparisons, n_trees), got {weights.ndim}-D" | ||
| ) | ||
|
|
||
| # Avoid editing the df in-place |
There was a problem hiding this comment.
Would you add more extensive documentation about the reason for adding this here? @jeancochrane
There was a problem hiding this comment.
Yeah, I think so. I actually don't even understand why we need this. Do we mutate the observation dataframe later on?
There was a problem hiding this comment.
In the following chunk, when get_comps runs
# Test with matrix weights (error_reduction style)
tree_weights_matrix = np.asarray(
[np.random.dirichlet(np.ones(num_trees)) for _ in range(num_comparisons)]
)
start = time.time()
get_comps(leaf_nodes, training_leaf_nodes, tree_weights_matrix)
end = time.time()
print(f"get_comps (matrix weights) runtime: {end - start}s")this code that creates (in-place) a new column in the observation_df within the function also edits the leaf_nodes data frame out of the scope of the function
# Chunk the observations so that the script can periodically report progress
observation_df["chunk"] = pd.cut(
observation_df.index, bins=num_chunks, labels=False
)such that when we finish the matrix test and move onto the vector test, the leaf_nodes object column number has been increased to 500 to 501, which causes our value error tests to catch a dimension mismatch
# Test with vector weights (unweighted / prediction_variance style)
tree_weights_vector = np.random.dirichlet(np.ones(num_trees))
start = time.time()
get_comps(leaf_nodes, training_leaf_nodes, tree_weights_vector)
end = time.time()
print(f"get_comps (vector weights) runtime: {end - start}s")Here is a reproducible isolated example that I think replicates the behaviour:
import pandas as pd
import numpy as np
# Create toy dataframes
leaf_nodes = pd.DataFrame(np.random.randint(0, 10, size=[5, 3]))
print("Before:", leaf_nodes.shape) # (5, 3)
def add_chunk_column(observation_df):
observation_df["chunk"] = [0, 0, 1, 1, 1]
add_chunk_column(leaf_nodes)
print("After:", leaf_nodes.shape) # (5, 4)
print(leaf_nodes)Does this make sense? I feel like pandas inplace trickiness always gets me
There was a problem hiding this comment.
Ahh right! Thanks for the clear explanation. I see now that the mutation happens literally on the next line lol, my bad for missing it 🤦🏻♀️ Since we perform the mutation immediately after this copy operation, I don't actually think we need to document the decision any more thoroughly than this.
There was a problem hiding this comment.
No prob! Got it, sounds good
jeancochrane
left a comment
There was a problem hiding this comment.
This is good to go. Nice work you two!
Over the last year, we tested a number of algorithm variations for comps. This PR centralizes them all and allows us to choose which methodology we run through the
params.yamlconfiguration file.I've done moderate testing locally for all 4 methods and
extract_tree_weightsruns, producing comps that pass the eyeball test.The four options:
unweightedunweighted_with_error_reductionerror_reductionprediction_varianceCloses #405