Skip to content

Consolidate significant sales qc analysis doc#151

Merged
Damonamajor merged 19 commits intomainfrom
146-consolidate-significant-sales-qc-analysis-doc
Dec 5, 2025
Merged

Consolidate significant sales qc analysis doc#151
Damonamajor merged 19 commits intomainfrom
146-consolidate-significant-sales-qc-analysis-doc

Conversation

@Damonamajor
Copy link
Contributor

@Damonamajor Damonamajor commented Nov 26, 2025

This modifies the existing sales_qc doc to compare multiple runs. It is currently set to use 2025 run id's, but that can be modified to use the 2024 runs by uncommenting out the params.

It groups the output table by run_id and creates tabsets for the charts of the characteristics. These are modified to use the same sample of 2000 pins for each run_id.

@Damonamajor Damonamajor linked an issue Nov 26, 2025 that may be closed by this pull request
description:
- "Unweighted"
- "Unweighted (error-reducing trees only)"
- "Error reduction"
Copy link
Contributor Author

@Damonamajor Damonamajor Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error Reduction and Error reduction (random) appear to be identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it appears that there are differences for multi-card properties only and appear to be inversed.


billy <- dbGetQuery(
  conn,
  glue("
  select
   *
  from model.comp
  where run_id = '2025-04-25-fancy-free-billy'
  ")
)


bowen <- dbGetQuery(
  conn,
  glue("
  select
   *
  from model.comp
  where run_id = '2025-03-26-vibrant-bowen'
  ")
)

data <- rbind(billy, bowen)

comp_cols <- paste0("comp_pin_", 1:5)

diffs <- data %>%
  group_by(pin, card) %>%
  mutate(
    across(all_of(comp_cols),
           ~n_distinct(.x, na.rm = TRUE) > 1,
           .names = "diff_{col}")
  ) %>%
  filter(if_any(starts_with("diff_"), ~ .x)) %>%
  ungroup()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, required] Oops, you're right! I think I just misread the run descriptions for two of these runs. Instead of 2025-03-26-vibrant-bowen, we want to use 2025-02-11-charming-eric for the "Error reduction (semi-random)" run.

- "Error reduction"
- "Error reduction (semi-random)"
- "Prediction variance"
keep_top_n_comps: 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these to params even though it wasn't specified since it seems better than hardcoding the first line.

}
)

comps <- map(data, "comps") %>%
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried filtering outputs by triad, but it resulted in NA values, I'm presuming from the preds_pin preds_card, etc. It keeps in the filtering around line 387.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] I think it's expected that we would see null values if we tried to filter the comps dataframe by triad, because it's possible for comps to be in different triads than their target properties. However, all of our targets should be in the reassessment triad, so I would expect that filtering by triad should work for preds_pin and preds_card. That being said, we filter targets by triad on line 385 below, so there's no particular reason to do it here unless we're facing memory problems and want to be even more strict about our memory use.

{agg_log_char_sql('avg', 'char_land_sf', floor = 1)},
stddev(cast(char_yrblt as double)) as agg_stddev_yrblt,
{agg_log_char_sql('stddev', 'char_bldg_sf', floor = 1)},
{agg_log_char_sql('stddev', 'char_land_sf', floor = 1)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removes all beds and baths from entirity of the report.

)
) %>%
# Filter for only cards in the selected tri
filter(town_get_triad(target_township_code) == triad)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason prefixing this with ccao:: doesn't work. Don't know if that has to do with positron, but just flagging it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Hmm, we shouldn't need the ccao:: prefix because we explicitly load the package via a library() call on line 39, but it should still work with the prefix. It seems to work fine when I mess around with it locally. If you're interested in debugging this, can you share the error message you're seeing? I don't think it's a huge deal either way though, because leaving out the prefix here is perfectly acceptable (and is in fact the idiomatic thing to do given that we're explicitly loading the package).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it again and it works for me too. The only thing I can think of is that I think renv::snapshot may have updated the package and it was between two iterations since I believe there were some modifications in it recently.

# Take a sample of target properties with sales to plot using plotly. Sample
# because using all properties makes the plots too large to render
# Select the same sample for all runs
sample_pins <- comps_by_pin_sales_agg %>%
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same sample for each run.

Copy link
Contributor Author

@Damonamajor Damonamajor Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the card numbers, there are 7million card 1's and just 10,000 other card values. I don't know if we want to have them joined based on card or filter multi-cards out or just leave it as is since it seems like a tiny amount.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] That's an interesting point, though I say we leave them in for now for the sake of convenience.

"Target Sale Price",
"Avg. Comp Sale Price",
"Class",
id_prefix = "sale_price_class"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id_prefix is used to separatre the html tags so they don't interact with each other. Don't exactly understand it, but it makes it work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] I think the reason that the IDs are important is that Bootstrap uses element IDs to determine which elements a tab should control, so if you reuse IDs, the tabs won't switch properly. Not 100% confident in that answer, but either way, setting different IDs for different tabsets is a good idea!

@Damonamajor Damonamajor self-assigned this Nov 26, 2025
@Damonamajor Damonamajor marked this pull request as ready for review December 1, 2025 18:04
Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here! A couple small suggestions below, but nothing serious.

)

s3_bucket <- "s3://ccao-model-results-us-east-1"
run_ids <- params$run_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, optional] Doesn't seem like we're using this variable, so we might as well scrap it for the sake of simplicity:

Suggested change
run_ids <- params$run_id

)
) %>%
# Filter for only cards in the selected tri
filter(town_get_triad(target_township_code) == triad)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Hmm, we shouldn't need the ccao:: prefix because we explicitly load the package via a library() call on line 39, but it should still work with the prefix. It seems to work fine when I mess around with it locally. If you're interested in debugging this, can you share the error message you're seeing? I don't think it's a huge deal either way though, because leaving out the prefix here is perfectly acceptable (and is in fact the idiomatic thing to do given that we're explicitly loading the package).

}
)

comps <- map(data, "comps") %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] I think it's expected that we would see null values if we tried to filter the comps dataframe by triad, because it's possible for comps to be in different triads than their target properties. However, all of our targets should be in the reassessment triad, so I would expect that filtering by triad should work for preds_pin and preds_card. That being said, we filter targets by triad on line 385 below, so there's no particular reason to do it here unless we're facing memory problems and want to be even more strict about our memory use.

# Take a sample of target properties with sales to plot using plotly. Sample
# because using all properties makes the plots too large to render
# Select the same sample for all runs
sample_pins <- comps_by_pin_sales_agg %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] That's an interesting point, though I say we leave them in for now for the sake of convenience.

"Target Sale Price",
"Avg. Comp Sale Price",
"Class",
id_prefix = "sale_price_class"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] I think the reason that the IDs are important is that Bootstrap uses element IDs to determine which elements a tab should control, so if you reuse IDs, the tabs won't switch properly. Not 100% confident in that answer, but either way, setting different IDs for different tabsets is a good idea!

description:
- "Unweighted"
- "Unweighted (error-reducing trees only)"
- "Error reduction"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, required] Oops, you're right! I think I just misread the run descriptions for two of these runs. Instead of 2025-03-26-vibrant-bowen, we want to use 2025-02-11-charming-eric for the "Error reduction (semi-random)" run.

}
```

## Topline Aggregate Stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, required] In these charts, could we use a more descriptive name than description for the header on the column that displays the algorithm that the run used? Even just Algorithm would be clearer in my opinion.

@@ -0,0 +1,1132 @@
---
title: "Algorithm-Comparison"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, required] Let's make it clear up front what kind of algorithm we're comparing:

Suggested change
title: "Algorithm-Comparison"
title: "Comps Algorithm Comparison"

}
```

## Topline Aggregate Stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, optional] Since these charts require horizontal scrolling, it would be super handy to freeze the leftmost identifier columns so that we can keep track of which rows we're comparing. For each chart, I think those columns should be:

  • Overall
    • Algorithm (AKA description)
  • By Township
    • Triad
    • Township
    • Algorithm
  • By Class
    • Class
    • Algorithm

@Damonamajor Damonamajor merged commit 9bb395b into main Dec 5, 2025
1 check passed
@Damonamajor Damonamajor deleted the 146-consolidate-significant-sales-qc-analysis-doc branch December 5, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate significant sales QC analysis doc

2 participants