Skip to content

Commit 0fba962

Browse files
committed
xg update
1 parent f94dfb5 commit 0fba962

15 files changed

Lines changed: 382 additions & 11 deletions

DESCRIPTION

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: hockeyR
22
Title: Collect and Clean Hockey Stats
3-
Version: 1.0.0
3+
Version: 1.1.0
44
Authors@R:
55
person(given = "Daniel",
66
family = "Morse",
@@ -33,7 +33,9 @@ Imports:
3333
tidyr,
3434
tidyselect,
3535
utils,
36-
zoo
36+
zoo,
37+
stats,
38+
xgboost
3739
Suggests:
3840
ggimage,
3941
ggplot2,

NAMESPACE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Generated by roxygen2: do not edit by hand
22

33
export("%>%")
4+
export(calculate_xg)
45
export(get_current_rosters)
56
export(get_draft_class)
67
export(get_game_ids)
@@ -19,4 +20,5 @@ export(scrape_day)
1920
export(scrape_game)
2021
export(scrape_season)
2122
importFrom(magrittr,"%>%")
23+
importFrom(stats,predict)
2224
importFrom(utils,type.convert)

NEWS.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
1-
# hockeyR 1.0.0.1000
1+
# hockeyR 1.1.0
22

3+
## New addition:
4+
* Play-by-play data loaded through `load_pbp` includes new column for expected goals
5+
* Details on & code to create the hockeyR expected goals model can be found [here](https://github.com/danmorse314/hockeyR-models)
6+
* The `scrape_game` function has been adjusted to automatically add expected goals to the output
7+
8+
## New function:
9+
* `calculate_xg` adds expected goals column to pbp data (used inside `scrape_game`, not necessary to use this to get expected goal values)
10+
11+
## Fixes:
312
* Changed the `player_id` column in `get_draft_class` to `prospect_id` - proper NHL `player_id` column only returns with `player_details` set to `TRUE`
413

514
# hockeyR 1.0.0

R/calculate_xg.R

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#' Calculate hockeyR expected goals (xG)
2+
#'
3+
#' @description Uses the hockeyR expected goals model to calculate xG for any pbp data frame generated by hockeyR
4+
#'
5+
#' @param pbp A play-by-play data frame, previously returned by hockeyR::scrape_game
6+
#'
7+
#' @return The original supplied play-by-play data with a column for expected goals appended
8+
#' @export
9+
#'
10+
#' @examples
11+
#' \dontrun{
12+
#' pbp <- load_pbp(2022) |> dplyr::select(-xg)
13+
#' pbp_preds <- calculate_xg(pbp)
14+
#' }
15+
calculate_xg <- function(pbp){
16+
17+
# get model features
18+
model_data <- prepare_xg_data(pbp)
19+
20+
# make 5v5 predictions
21+
preds_5v5 <- stats::predict(
22+
xg_model_5v5,
23+
xgboost::xgb.DMatrix(
24+
data = model_data |>
25+
dplyr::filter(strength_state == "5v5") |>
26+
dplyr::select(dplyr::all_of(xg_model_5v5$feature_names)) |>
27+
data.matrix(),
28+
label = model_data |>
29+
dplyr::filter(strength_state == "5v5") |>
30+
dplyr::select(goal) |>
31+
data.matrix()
32+
)
33+
) |>
34+
dplyr::as_tibble() |>
35+
dplyr::rename(xg = value) |>
36+
dplyr::bind_cols(
37+
dplyr::select(
38+
dplyr::filter(model_data, strength_state == "5v5"),
39+
event_id)
40+
)
41+
42+
# make ST predictions
43+
preds_st <- stats::predict(
44+
xg_model_st,
45+
xgboost::xgb.DMatrix(
46+
data = model_data |>
47+
dplyr::filter(strength_state != "5v5") |>
48+
dplyr::select(dplyr::all_of(xg_model_st$feature_names)) |>
49+
data.matrix(),
50+
label = model_data |>
51+
dplyr::filter(strength_state != "5v5") |>
52+
dplyr::select(goal) |>
53+
data.matrix()
54+
)
55+
) |>
56+
dplyr::as_tibble() |>
57+
dplyr::rename(xg = value) |>
58+
dplyr::bind_cols(
59+
dplyr::select(
60+
dplyr::filter(model_data, strength_state != "5v5"),
61+
event_id)
62+
)
63+
64+
# combine
65+
preds <- dplyr::bind_rows(preds_5v5, preds_st) |>
66+
# attach xg column to original pbp data
67+
dplyr::right_join(pbp, by = "event_id") |>
68+
# fix penalty shots
69+
dplyr::mutate(xg = ifelse(
70+
secondary_type != "Penalty Shot" | is.na(secondary_type), xg, xg_model_ps
71+
)) |>
72+
dplyr::arrange(event_id)
73+
74+
return(preds)
75+
}

R/globals.R

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ utils::globalVariables(
3434
"amateurLeague","amateurTeam","full_name","full_team_name","jersey_number",
3535
"person","picks","prospect","prospectCategory","ranks","roundNumber",
3636
"rounds","shortName","year","duration_seconds","home_abbreviation",
37-
"away_abbreviation","home_id","away_id","prospect_id","nhl_player_id"
37+
"away_abbreviation","home_id","away_id","prospect_id","nhl_player_id",
38+
"away_skaters","empty_net","event_team_skaters","event_zone",
39+
"feature","goal","home_skaters","last_event_team","last_event_type",
40+
"last_event_zone","last_value","last_x","last_y","na","opponent_team_skaters",
41+
"period_type","shot_type","time_since_last","type_value","val","value","xg"
3842
)
3943
)

R/hockeyR-package.R

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#' @keywords internal
2+
"_PACKAGE"
3+
4+
## usethis namespace: start
5+
#' @importFrom stats predict
6+
## usethis namespace: end
7+
NULL

R/prepare_xg_data.R

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
#' Prepare xG data
2+
#'
3+
#' @description Helper function to prepare hockeyR pbp data for xG calculations
4+
#'
5+
#' @param x A play-by-play data frame generated by hockeyR before xG is calculated
6+
#'
7+
#' @return A tibble; pbp data with xG model mutations along with identifiers for game and strength state
8+
#'
9+
#' @examples
10+
#' \dontrun{
11+
#' pbp <- load_pbp(2022) |> dplyr::select(-xg)
12+
#' model_data <- prepare_xg_data(pbp)
13+
#' }
14+
prepare_xg_data <- function(x){
15+
16+
model_df <- x |>
17+
# filter out shootouts
18+
dplyr::filter(period_type != "SHOOTOUT") |>
19+
# remove penalty shots
20+
dplyr::filter(secondary_type != "Penalty Shot" | is.na(secondary_type)) |>
21+
# remove shift change events, which were excluded from model
22+
dplyr::filter(event_type != "CHANGE") |>
23+
# add model feature variables
24+
dplyr::group_by(game_id) |>
25+
dplyr::mutate(
26+
last_event_type = dplyr::lag(event_type),
27+
last_event_team = dplyr::lag(event_team),
28+
time_since_last = game_seconds - dplyr::lag(game_seconds),
29+
last_x = dplyr::lag(x),
30+
last_y = dplyr::lag(y),
31+
distance_from_last = round(sqrt(((y - last_y)^2) + ((x - last_x)^2)),1),
32+
event_zone = dplyr::case_when(
33+
x >= -25 & x <= 25 ~ "NZ",
34+
(x_fixed < -25 & event_team == home_name) |
35+
(x_fixed > 25 & event_team == away_name) ~ "DZ",
36+
(x_fixed > 25 & event_team == home_name) |
37+
(x_fixed < -25 & event_team == away_name) ~ "OZ"
38+
),
39+
last_event_zone = dplyr::lag(event_zone)
40+
) |>
41+
dplyr::ungroup() |>
42+
# filter to only unblocked shots
43+
dplyr::filter(event_type %in% c("SHOT","MISSED_SHOT","GOAL")) |>
44+
# get rid off oddball last_events
45+
# ie "EARLY_INTERMISSION_START"
46+
dplyr::filter(
47+
last_event_type %in% c(
48+
"FACEOFF","GIVEAWAY","TAKEAWAY","BLOCKED_SHOT","HIT",
49+
"MISSED_SHOT","SHOT","STOP","PENALTY","GOAL"
50+
)
51+
) |>
52+
# add more feature variables
53+
dplyr::mutate(
54+
era_2011_2013 = ifelse(
55+
season %in% c("20102011","20112012","20122013"),
56+
1, 0
57+
),
58+
era_2014_2018 = ifelse(
59+
season %in% c("20132014","20142015","20152016","20162017","20172018"),
60+
1, 0
61+
),
62+
era_2019_2021 = ifelse(
63+
season %in% c("20182019","20192020","20202021"),
64+
1, 0
65+
),
66+
era_2022_on = ifelse(
67+
as.numeric(season) > 20202021, 1, 0
68+
),
69+
# these are only for the ST model
70+
event_team_skaters = ifelse(event_team == home_name, home_skaters, away_skaters),
71+
opponent_team_skaters = ifelse(event_team == home_name, away_skaters, home_skaters),
72+
total_skaters_on = event_team_skaters + opponent_team_skaters,
73+
event_team_advantage = event_team_skaters - opponent_team_skaters,
74+
# these are in 5v5 model
75+
rebound = ifelse(last_event_type %in% c("SHOT","MISSED_SHOT","GOAL") & time_since_last <= 2, 1, 0),
76+
rush = ifelse(last_event_zone %in% c("NZ","DZ") & time_since_last <= 4, 1, 0),
77+
cross_ice_event = ifelse(
78+
# indicates goalie had to move from one post to the other
79+
last_event_zone == "OZ" &
80+
((last_y > 3 & y < -3) | (last_y < -3 & y > 3)) &
81+
# need some sort of time frame here to indicate shot was quick after goalie had to move
82+
time_since_last <= 2, 1, 0
83+
),
84+
# fix missing empty net vars
85+
empty_net = ifelse(is.na(empty_net) | empty_net == FALSE, FALSE, TRUE),
86+
shot_type = secondary_type,
87+
goal = ifelse(event_type == "GOAL", 1, 0)
88+
) |>
89+
dplyr::select(season, game_id, event_id, strength_state, shot_distance, shot_angle, empty_net, last_event_type:goal) |>
90+
# one-hot encode some categorical vars
91+
dplyr::mutate(type_value = 1, last_value = 1) |>
92+
tidyr::pivot_wider(names_from = shot_type, values_from = type_value, values_fill = 0) |>
93+
tidyr::pivot_wider(
94+
names_from = last_event_type, values_from = last_value, values_fill = 0, names_prefix = "last_"
95+
) |>
96+
janitor::clean_names() |>
97+
dplyr::select(
98+
-last_event_team, -event_zone, -last_event_zone, -event_team_skaters, -opponent_team_skaters
99+
)
100+
101+
if("na" %in% names(model_df)){
102+
model_df <- dplyr::select(model_df, -na)
103+
}
104+
105+
`%not_in%` <- purrr::negate(`%in%`)
106+
107+
missing_feats <- dplyr::tibble(
108+
feature = xg_model_5v5$feature_names
109+
) |>
110+
dplyr::filter(feature %not_in% names(model_df)) |>
111+
dplyr::mutate(val = 0) |>
112+
tidyr::pivot_wider(names_from = feature, values_from = val)
113+
114+
if(length(missing_feats) > 0){
115+
model_df <- dplyr::bind_cols(model_df, missing_feats)
116+
}
117+
118+
return(model_df)
119+
120+
}

R/scrape_game.R

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
#' @return A tibble containing event-based play-by-play data for an individual
88
#' NHL game. The resulting data will have columns for:
99
#' \describe{
10+
#' \item{xg}{Numeric expected goal value for unblocked shot events}
1011
#' \item{event}{String defining the event}
1112
#' \item{event_type}{String with alternate event definition; in all caps}
1213
#' \item{secondary_type}{String defining secondary event type}
@@ -63,6 +64,7 @@
6364
#' \item{away_goalie}{String name of away goalie on ice}
6465
#' \item{game_id}{Integer value of assigned game ID}
6566
#' \item{event_idx}{Numeric index for event}
67+
#' \item{event_id}{Numeric id for event -- more specified than event_idx}
6668
#' \item{event_player_1_id}{Integer value of the player ID for the primary event player}
6769
#' \item{event_player_1_link}{String value of the NHL.com player link for the primary event player}
6870
#' \item{event_player_1_season_total}{Integer value for the total events for the primary event player this season}
@@ -863,6 +865,20 @@ scrape_game <- function(game_id){
863865

864866
}
865867

868+
# add event_id
869+
pbp_full <- pbp_full |>
870+
dplyr::mutate(
871+
event_idx = stringr::str_pad(event_idx, width = 4, side = "left", pad = 0),
872+
event_id = as.numeric(paste0(game_id,event_idx)),
873+
secondary_type = ifelse(
874+
stringr::str_detect(dplyr::lead(description), "PS -") &
875+
event_type %in% c("SHOT","MISSED_SHOT","GOAL"),
876+
"Penalty Shot", secondary_type
877+
)
878+
)
879+
# add xg
880+
pbp_full <- calculate_xg(pbp_full)
881+
866882
return(pbp_full)
867883

868884
}

R/sysdata.rda

1.08 MB
Binary file not shown.

README.Rmd

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,28 @@ All variables available in the raw play-by-play data are included, along with a
7070

7171
The `shot_distance` and `shot_angle` are measured in feet and degrees, respectively. The variables `x_fixed` and `y_fixed` are transformations of the `x` and `y` event coordinates such that the home team is always shooting to the right and the away team is always shooting to the left. For full details on the included variables, see the [`scrape_game`](https://github.com/danmorse314/hockeyR/blob/master/R/scrape_game.R) documentation.
7272

73-
As mentioned above, an easy way to create a shot plot is through the [sportyR](https://github.com/sportsdataverse/sportyR) package. You can also use the included `team_colors_logos` data to add color and team logos to your plots.
73+
#### NEW in hockeyR v1.1.0: Expected Goals
74+
75+
As of `hockeyR` v1.1.0, a new column has been added to the play-by-play data: Expected goals! The `hockeyR` package now includes its own public expected goals model, and every unblocked shot in the play-by-play data now has an `xg` value. A full description of the model, including the code used to construct it and the testing results, can be found in the [hockeyR-models](https://github.com/danmorse314/hockeyR-models) repository. Users can now investigate additional statistics, such as player goals above expectation without having to create their own entire model.
76+
77+
```{r xg-example}
78+
pbp %>%
79+
filter(event_type %in% c("SHOT","MISSED_SHOT","GOAL")) %>%
80+
group_by(player = event_player_1_name, id = event_player_1_id) %>%
81+
summarize(
82+
team = last(event_team_abbr),
83+
goals = sum(event_type == "GOAL"),
84+
xg = round(sum(xg, na.rm = TRUE),1),
85+
gax = goals - xg,
86+
.groups = "drop"
87+
) |>
88+
arrange(-xg) |>
89+
slice(1:10)
90+
```
91+
92+
#### Shot Plots
93+
94+
An easy way to create a shot plot is through the [sportyR](https://github.com/sportsdataverse/sportyR) package. You can also use the included `team_colors_logos` data to add color and team logos to your plots.
7495

7596
```{r shot-plot-example}
7697
# get single game

0 commit comments

Comments
 (0)