Consolidate significant sales qc analysis doc#151
Conversation
| description: | ||
| - "Unweighted" | ||
| - "Unweighted (error-reducing trees only)" | ||
| - "Error reduction" |
There was a problem hiding this comment.
Error Reduction and Error reduction (random) appear to be identical.
There was a problem hiding this comment.
Actually it appears that there are differences for multi-card properties only and appear to be inversed.
billy <- dbGetQuery(
conn,
glue("
select
*
from model.comp
where run_id = '2025-04-25-fancy-free-billy'
")
)
bowen <- dbGetQuery(
conn,
glue("
select
*
from model.comp
where run_id = '2025-03-26-vibrant-bowen'
")
)
data <- rbind(billy, bowen)
comp_cols <- paste0("comp_pin_", 1:5)
diffs <- data %>%
group_by(pin, card) %>%
mutate(
across(all_of(comp_cols),
~n_distinct(.x, na.rm = TRUE) > 1,
.names = "diff_{col}")
) %>%
filter(if_any(starts_with("diff_"), ~ .x)) %>%
ungroup()
There was a problem hiding this comment.
[Suggestion, required] Oops, you're right! I think I just misread the run descriptions for two of these runs. Instead of 2025-03-26-vibrant-bowen, we want to use 2025-02-11-charming-eric for the "Error reduction (semi-random)" run.
| - "Error reduction" | ||
| - "Error reduction (semi-random)" | ||
| - "Prediction variance" | ||
| keep_top_n_comps: 5 |
There was a problem hiding this comment.
Moved these to params even though it wasn't specified since it seems better than hardcoding the first line.
| } | ||
| ) | ||
|
|
||
| comps <- map(data, "comps") %>% |
There was a problem hiding this comment.
I tried filtering outputs by triad, but it resulted in NA values, I'm presuming from the preds_pin preds_card, etc. It keeps in the filtering around line 387.
There was a problem hiding this comment.
[Thought, non-blocking] I think it's expected that we would see null values if we tried to filter the comps dataframe by triad, because it's possible for comps to be in different triads than their target properties. However, all of our targets should be in the reassessment triad, so I would expect that filtering by triad should work for preds_pin and preds_card. That being said, we filter targets by triad on line 385 below, so there's no particular reason to do it here unless we're facing memory problems and want to be even more strict about our memory use.
| {agg_log_char_sql('avg', 'char_land_sf', floor = 1)}, | ||
| stddev(cast(char_yrblt as double)) as agg_stddev_yrblt, | ||
| {agg_log_char_sql('stddev', 'char_bldg_sf', floor = 1)}, | ||
| {agg_log_char_sql('stddev', 'char_land_sf', floor = 1)} |
There was a problem hiding this comment.
Removes all beds and baths from entirity of the report.
| ) | ||
| ) %>% | ||
| # Filter for only cards in the selected tri | ||
| filter(town_get_triad(target_township_code) == triad) |
There was a problem hiding this comment.
for some reason prefixing this with ccao:: doesn't work. Don't know if that has to do with positron, but just flagging it.
There was a problem hiding this comment.
[Question, non-blocking] Hmm, we shouldn't need the ccao:: prefix because we explicitly load the package via a library() call on line 39, but it should still work with the prefix. It seems to work fine when I mess around with it locally. If you're interested in debugging this, can you share the error message you're seeing? I don't think it's a huge deal either way though, because leaving out the prefix here is perfectly acceptable (and is in fact the idiomatic thing to do given that we're explicitly loading the package).
There was a problem hiding this comment.
I ran it again and it works for me too. The only thing I can think of is that I think renv::snapshot may have updated the package and it was between two iterations since I believe there were some modifications in it recently.
| # Take a sample of target properties with sales to plot using plotly. Sample | ||
| # because using all properties makes the plots too large to render | ||
| # Select the same sample for all runs | ||
| sample_pins <- comps_by_pin_sales_agg %>% |
There was a problem hiding this comment.
Use the same sample for each run.
There was a problem hiding this comment.
I checked the card numbers, there are 7million card 1's and just 10,000 other card values. I don't know if we want to have them joined based on card or filter multi-cards out or just leave it as is since it seems like a tiny amount.
There was a problem hiding this comment.
[Thought, non-blocking] That's an interesting point, though I say we leave them in for now for the sake of convenience.
| "Target Sale Price", | ||
| "Avg. Comp Sale Price", | ||
| "Class", | ||
| id_prefix = "sale_price_class" |
There was a problem hiding this comment.
id_prefix is used to separatre the html tags so they don't interact with each other. Don't exactly understand it, but it makes it work.
There was a problem hiding this comment.
[Thought, non-blocking] I think the reason that the IDs are important is that Bootstrap uses element IDs to determine which elements a tab should control, so if you reuse IDs, the tabs won't switch properly. Not 100% confident in that answer, but either way, setting different IDs for different tabsets is a good idea!
jeancochrane
left a comment
There was a problem hiding this comment.
Great work here! A couple small suggestions below, but nothing serious.
| ) | ||
|
|
||
| s3_bucket <- "s3://ccao-model-results-us-east-1" | ||
| run_ids <- params$run_id |
There was a problem hiding this comment.
[Nitpick, optional] Doesn't seem like we're using this variable, so we might as well scrap it for the sake of simplicity:
| run_ids <- params$run_id |
| ) | ||
| ) %>% | ||
| # Filter for only cards in the selected tri | ||
| filter(town_get_triad(target_township_code) == triad) |
There was a problem hiding this comment.
[Question, non-blocking] Hmm, we shouldn't need the ccao:: prefix because we explicitly load the package via a library() call on line 39, but it should still work with the prefix. It seems to work fine when I mess around with it locally. If you're interested in debugging this, can you share the error message you're seeing? I don't think it's a huge deal either way though, because leaving out the prefix here is perfectly acceptable (and is in fact the idiomatic thing to do given that we're explicitly loading the package).
| } | ||
| ) | ||
|
|
||
| comps <- map(data, "comps") %>% |
There was a problem hiding this comment.
[Thought, non-blocking] I think it's expected that we would see null values if we tried to filter the comps dataframe by triad, because it's possible for comps to be in different triads than their target properties. However, all of our targets should be in the reassessment triad, so I would expect that filtering by triad should work for preds_pin and preds_card. That being said, we filter targets by triad on line 385 below, so there's no particular reason to do it here unless we're facing memory problems and want to be even more strict about our memory use.
| # Take a sample of target properties with sales to plot using plotly. Sample | ||
| # because using all properties makes the plots too large to render | ||
| # Select the same sample for all runs | ||
| sample_pins <- comps_by_pin_sales_agg %>% |
There was a problem hiding this comment.
[Thought, non-blocking] That's an interesting point, though I say we leave them in for now for the sake of convenience.
| "Target Sale Price", | ||
| "Avg. Comp Sale Price", | ||
| "Class", | ||
| id_prefix = "sale_price_class" |
There was a problem hiding this comment.
[Thought, non-blocking] I think the reason that the IDs are important is that Bootstrap uses element IDs to determine which elements a tab should control, so if you reuse IDs, the tabs won't switch properly. Not 100% confident in that answer, but either way, setting different IDs for different tabsets is a good idea!
| description: | ||
| - "Unweighted" | ||
| - "Unweighted (error-reducing trees only)" | ||
| - "Error reduction" |
There was a problem hiding this comment.
[Suggestion, required] Oops, you're right! I think I just misread the run descriptions for two of these runs. Instead of 2025-03-26-vibrant-bowen, we want to use 2025-02-11-charming-eric for the "Error reduction (semi-random)" run.
| } | ||
| ``` | ||
|
|
||
| ## Topline Aggregate Stats |
There was a problem hiding this comment.
[Nitpick, required] In these charts, could we use a more descriptive name than description for the header on the column that displays the algorithm that the run used? Even just Algorithm would be clearer in my opinion.
| @@ -0,0 +1,1132 @@ | |||
| --- | |||
| title: "Algorithm-Comparison" | |||
There was a problem hiding this comment.
[Nitpick, required] Let's make it clear up front what kind of algorithm we're comparing:
| title: "Algorithm-Comparison" | |
| title: "Comps Algorithm Comparison" |
| } | ||
| ``` | ||
|
|
||
| ## Topline Aggregate Stats |
There was a problem hiding this comment.
[Suggestion, optional] Since these charts require horizontal scrolling, it would be super handy to freeze the leftmost identifier columns so that we can keep track of which rows we're comparing. For each chart, I think those columns should be:
- Overall
- Algorithm (AKA description)
- By Township
- Triad
- Township
- Algorithm
- By Class
- Class
- Algorithm
This modifies the existing sales_qc doc to compare multiple runs. It is currently set to use 2025 run id's, but that can be modified to use the 2024 runs by uncommenting out the params.
It groups the output table by run_id and creates tabsets for the charts of the characteristics. These are modified to use the same sample of 2000 pins for each run_id.