This repository has been archived by the owner on Feb 5, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathaws-usage-report.qmd
453 lines (381 loc) · 16 KB
/
aws-usage-report.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
---
title: "NASA Openscapes 2i2c JupyterHub\nUsage and Costs"
params:
year_month: "2024-06"
subtitle: "Monthly report for `r format(lubridate::ym(params$year_month), '%B %Y')`"
format: pdf
---
<!--
To render this using the quarto cli use:
ym="2024-06" # set the year and month parameter
quarto render aws-usage-report/aws-usage-report.qmd -P year_month:$ym --output "aws-usage-report_$ym.pdf"
-->
## Introduction
A key objective of NASA Openscapes is to minimize “the time to science” for researchers. Cloud infrastructure can facilitate shortening this time. We use a 2i2c-managed JupyterHub ("Hub"), which lets us work in the cloud next to NASA Earthdata in AWS US-West-2. The purpose of the JupyterHub is to provide initial, exploratory experiences accessing NASA Earthdata in the cloud. It is not meant to be a long-term solution to support on-going science work or software development. For those users that decide working in the Cloud is advantageous and want to move there, we support a migration from the Hub to their own environment through Coiled.io, and are working on other "fledging" pathways.
The main costs of running the JupyterHub come from two sources:
1. Compute, using AWS EC2
2. Storage using AWS EFS, via storage in users' home directories
Compute costs scale up and down as the Hub is used, however storage costs are
fixed - we pay for "data at rest", with
[ongoing daily costs/GB](https://aws.amazon.com/efs/pricing/) even while the
Hub is not running.
Storing large amounts of data in the cloud can incur significant ongoing costs if not done optimally. We are developing [technical strategies and policies](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/policies-admin/data-policies.html) in the Earthdata Cloud Cookbook to reduce storage costs that will keep the Openscapes 2i2c Hub a shared resource for us all to use, while also providing reusable strategies for other admins.
This report is intended to give a monthly summary of usage of the Hub and its
resources, by tracking metrics on costs and usage of key components of storage (EFS)
and compute (EC2).
```{r setup}
#| include: false
library(dplyr)
library(kyber)
library(ggplot2)
library(forcats)
library(lubridate)
library(paws)
library(here)
library(glue)
library(patchwork)
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
source(here("R/prometheus-utils.R"))
source(here("R/aws-ce-utils.R"))
if (interactive()) {
params <- list(year_month = format(Sys.Date() %m-% months(1), "%Y-%m"))
}
start_date <- ym(params$year_month)
end_date <- ceiling_date(start_date, unit = "month") - days(1)
reporting_month <- format(start_date, "%B")
reporting_my <- format(start_date, "%B %Y")
cost_explorer <- paws::costexplorer()
theme_set(theme_classic())
```
## Month over month changes
A comparison of monthly costs in the Hub can help us to compare usage over time
and identify longer-term patterns. We can query the [AWS Cost Explorer API](https://docs.aws.amazon.com/aws-cost-management/latest/APIReference/API_Operations_AWS_Cost_Explorer_Service.html) to explore these costs.
### Total Costs
The following plot shows the total monthly costs of all AWS services related to
the Hub, as well as a breakdown of costs by service each month.
```{r total-costs}
# https://www.paws-r-sdk.com/docs/costexplorer_get_cost_and_usage/
# https://docs.aws.amazon.com/aws-cost-management/latest/APIReference/API_GetDimensionValues.html
total_monthly_usage_costs <- cost_explorer$get_cost_and_usage(
TimePeriod = list(Start = ceiling_date(end_date %m-% months(6), unit = "month"), End = end_date),
Granularity = "MONTHLY",
Filter = list(Dimensions = list(
Key = "RECORD_TYPE",
Values = "Usage"
)),
Metrics = "UnblendedCost"
) |>
ce_to_df()
total_monthly_cost <- total_monthly_usage_costs$UnblendedCost[
total_monthly_usage_costs$start_date == start_date
]
total_monthly_cost_plot <- ggplot(total_monthly_usage_costs) +
geom_line(aes(x = start_date, y = UnblendedCost)) +
labs(
title = glue::glue(
"The total cost of all AWS Services for running the NASA\n Openscapes 2i2c",
"Hub in {reporting_my} was ${round(total_monthly_cost)}"
),
x = "Month",
y = "Monthly cost ($)"
)
```
```{r monthly-costs-by-service}
#| fig-height: 7
monthly_costs_by_service <- cost_explorer$get_cost_and_usage(
TimePeriod = list(Start = ceiling_date(end_date %m-% months(6), unit = "month"), End = end_date),
Granularity = "MONTHLY",
Filter = list(Dimensions = list(
Key = "RECORD_TYPE",
Values = "Usage"
)),
Metrics = "UnblendedCost",
GroupBy = list(
list(
Type = "DIMENSION",
Key = "SERVICE"
)
)
) |>
ce_to_df()
monthly_cost_service_summary <- monthly_costs_by_service |>
ce_categories() |>
mutate(
service = fct_reorder(service, UnblendedCost, .fun = mean)
)
monthly_cost_by_service_plot <- ggplot(
monthly_cost_service_summary,
aes(x = start_date, y = UnblendedCost, fill = service)
) +
geom_col() +
scale_fill_discrete(type = aws_ce_palette(n_distinct(monthly_cost_service_summary$service))) +
guides(fill = guide_legend(ncol = 2)) +
theme(
legend.position = "bottom",
legend.title.position = "top",
legend.text = element_text(size = 8)
) +
labs(
title = "Monthly cost of AWS Services",
subtitle = "Largest costs are EC2 compute (blue) and EFS (home directory)\n storage (red)",
caption = "*The top nine services are shown individually, with any remaining grouped into 'Other'",
x = "Month",
y = "Monthly cost ($)",
fill = "AWS Service"
)
# Combine plots with patchwork
total_monthly_cost_plot / monthly_cost_by_service_plot
```
### Storage
Managing storage is an effective way to manage long-term costs in the Hub, as
data-at-rest is an ongoing cost, much of which can be avoided by monitoring and
reducing storage of data that is not required.
User home directories are in an AWS ["Elastic File System" (EFS)](https://aws.amazon.com/efs/) mount, which is
a relatively expensive option for long-term storage of large files. The
following figure plots the daily total size of data storage in the user home
directories in the Hub over the past six months. The size of the home drives
is directly correlated with the costs for "Amazon Elastic File System" in the
previous chart.
```{r monthly-storage}
monthly_size <- query_prometheus_range(
query = "max(dirsize_total_size_bytes{namespace='prod'})",
start_time = floor_date(end_date, unit = "months") %m-% months(5),
end_time = end_date,
step = 60 * 60 * 24
) |>
create_range_df(value_name = "size")
monthly_size |>
ggplot() +
geom_line(aes(x = date, y = size)) +
scale_x_datetime(date_breaks = "1 month", date_labels = "%B") +
labs(
title = "Total size of user home directories in AWS EFS\nin the main Hub",
x = "Month",
y = "Total Size (GB)"
)
```
## Detailed breakdown for the month of `r reporting_month`
To understand more about usage and costs during the current month,
we can look at daily usage metrics and costs.
### Home directory sizes
The Hub can currently be accessed via two different "namespaces": "production"
(or "prod"), and "workshop". The production namespace is where participants are
given medium to long-term access, as NASA mentors, Champions participants, etc.
[Access is managed via GitHub](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/policies-admin/add-folks-to-2i2c-github-teams.html) by assigning user's GitHub usernames to specific
teams.
The "workshop" namespace is used specifically for large workshops and access
is granted on the day of the workshop by use of a [shared password](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/policies-admin/leading-workshops.html\#workshop-hub-access-via-shared-password) rather
than using GitHub teams. Access is short-term and usually revoked a week after
the workshop, at which point users' home directories are removed.
The following figure shows the total size of home directories by namespace. Note
the different y axis scales in each panel. The "prod" namespace panel is broken
out by the GitHub team by which they are granted access to the Hub (Long-Term Access and NASA Champions 2024).
```{r homedir-size-by-date}
size_by_date <- query_prometheus_range(
query = "max(dirsize_total_size_bytes) by (directory, namespace)",
start_time = start_date,
end_time = end_date,
step = 60 * 60 * 24
) |>
create_range_df(value_name = "size") |>
mutate(
directory = unsanitize_dir_names(directory)
)
# list_teams("nasa-openscapes")
# list_teams("nasa-openscapes-workshops")
lt_access_members <- list_team_members(
team = "LongtermAccess-2i2c",
org = "nasa-openscapes",
names_only = TRUE
) |>
tolower()
champions_members <- list_team_members(
team = "nasa-champions-2024",
org = "nasa-openscapes-workshops",
names_only = TRUE
) |>
tolower() |>
setdiff(lt_access_members)
teams <- data.frame(
team = "NASA Champions 2024",
user = champions_members
) |>
bind_rows(
data.frame(
team = "Long Term Access",
user = lt_access_members
)
)
# setdiff(champions_members, unique(size_by_date$directory))
size_by_date_by_team <- size_by_date |>
left_join(
teams,
by = join_by(directory == user)
) |>
mutate(
team = ifelse(namespace == "workshop", "workshop", team),
directory = fct_reorder(directory, desc(size), .fun = max, .desc = TRUE)
)
all_dirs_sum_by_date <- size_by_date_by_team |>
filter(namespace %in% c("prod", "workshop")) |>
group_by(namespace, date, team) |>
summarize(total_size_gb = sum(size)) |>
mutate(
team = ifelse(is.na(team) & namespace == "prod", "Other", team),
team = fct_reorder(team, desc(total_size_gb), .fun = max, .desc = TRUE)
)
```
```{r homedir-size-over-time}
all_dirs_sum_by_date |>
ggplot(aes(x = date, y = total_size_gb)) +
geom_area(aes(fill = team)) +
facet_grid(vars(namespace), scales = "free_y") +
theme(legend.position = "bottom", legend.title.position = "top") +
paletteer::scale_fill_paletteer_d(
"ggpomological::pomological_palette",
breaks = setdiff(unique(all_dirs_sum_by_date$team), "workshop")
) +
labs(
title = "Total size of user home directories by access team\nand Hub namespace",
x = "Date",
y = "Size (GiB)",
fill = "GitHub Team (production hub only)"
)
```
#### Champions cohort
It is also helpful to look more deeply into the Champions cohort to see how
they are using the Hub, and how much storage is being used. The following
figure breaks down the home directory size of Champions by user - usernames
are not displayed, but we can see if any users are using a disproportionate amount
of space. When we see disproportionate amount of space used, we reach out to users and work with them to reduce their storage, and update the [Cookbook tutorials](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/policies-admin/data-policies.html) as needed.
```{r homedir-size-champions}
size_by_date_by_team |>
filter(team == "NASA Champions 2024") |>
ggplot(aes(x = date, y = size, fill = directory)) +
geom_area() +
paletteer::scale_fill_paletteer_d("khroma::soil", guide = "none", direction = -1) +
labs(
title = "Size of home directories by user for 2024 Champions cohort",
x = "Date",
y = "Size (GiB)"
)
```
### Compute costs and usage
When a user logs into the Hub, they can choose the amount of RAM and number of
CPUs they would like to use, enabling them to scale computing power appropriate
to the tasks they are running. More powerful compute resources have higher
[hourly costs](https://aws.amazon.com/ec2/pricing/on-demand/), so it is
important to not choose a powerful instance when it isn't required.
Examining both the usage and the costs of the [EC2 instance types](https://aws.amazon.com/ec2/instance-types/) that users choose
can help us understand users's needs as well as compute costs. This helps us
develop policies and recommendations for Hub compute usage.
```{r ec2-costs}
# https://www.paws-r-sdk.com/docs/costexplorer_get_cost_and_usage/
# https://docs.aws.amazon.com/aws-cost-management/latest/APIReference/API_GetDimensionValues.html
# TODO: modify ce_to_df to deal with an arbitrary number of metrics so
# we can do this in one call with `Metrics = list("UnblendedCost", "UsageQuantity")
# and pass it to ce_to_df() once, rather than joining
ec2_instance_type_costs_usage_res <- cost_explorer$get_cost_and_usage(
TimePeriod = list(Start = start_date, End = end_date),
Granularity = "DAILY",
Filter = list(
Dimensions = list(
Key = "RECORD_TYPE",
Values = "Usage"
),
Dimensions = list(
Key = "SERVICE",
Values = "Amazon Elastic Compute Cloud - Compute"
)
),
Metrics = list("UnblendedCost", "UsageQuantity"),
GroupBy = list(
list(
Type = "DIMENSION",
Key = "SERVICE"
),
list(
Type = "DIMENSION",
Key = "INSTANCE_TYPE"
)
)
)
# Join costs and usage hours
ec2_instance_type_costs_usage <- ec2_instance_type_costs_usage_res |>
ce_to_df(metric = "UnblendedCost") |>
left_join(
ce_to_df(ec2_instance_type_costs_usage_res, metric = "UsageQuantity"),
by = c("start_date", "end_date", "service", "instance_type")
) |>
filter(
instance_type != "NoInstanceType"
)
```
The following plots show the usage and costs broken down by [instance type](https://aws.amazon.com/ec2/instance-types/). The
compute profiles that users can choose from run on `r5.xlarge` (4 CPUs, 32 GiB memory) or
`r5.4xlarge` (16 CPUs, 128 GiB memory) instances.
Note that during some large workshops, administrators will
choose very large instance types (for example `r5.16xlarge`; 64 CPUs, 512 GiB memory) so they can
provision a small number of nodes with many users per node. This is more
efficient than launching many nodes at once. Other instance types, such as
`m6i.xlarge` indicate usage of the AWS infrastructure outside of the Hub, mostly
using [coiled](https://openscapes.org/blog/2023-11-07-coiled-openscapes/).
<!-- TODO: Get workshops dates from workshop spreadsheet and overlay on these
charts -->
```{r daily-usage-by-instance}
ec2_usage_data <- ec2_instance_type_costs_usage |>
mutate(instance_type = fct_reorder(instance_type, UsageQuantity, .fun = sum))
ec2_usage_plot <- ggplot(ec2_usage_data, aes(x = start_date, y = UsageQuantity, fill = instance_type)) +
geom_col() +
scale_fill_discrete(type = aws_ce_palette(n_distinct(ec2_usage_data$instance_type))) +
labs(
title = "Daily EC2 usage by instance type*",
x = "Date",
y = "Usage (hours)",
fill = "EC2 Instance Type",
caption = "*Hub resource allocation options up to 3.7 CPUs run on\n 'r5.xlarge' instances, and those with up to 15.6 CPUs\n run on 'r5.4xlarge' instances."
)
```
```{r daily-cost-by-instance}
ec2_cost_data <- ec2_instance_type_costs_usage |>
mutate(instance_type = factor(instance_type, levels = levels(ec2_usage_data$instance_type)))
ec2_cost_plot <-
ggplot(ec2_cost_data, aes(x = start_date, y = UnblendedCost, fill = instance_type)) +
geom_col() +
scale_fill_discrete(type = aws_ce_palette(n_distinct(ec2_cost_data$instance_type))) +
labs(
title = "Daily EC2 cost by instance type",
x = "Date",
y = "Daily Cost ($)",
fill = "EC2 Instance Type"
)
```
```{r patchwork-plot-ec2-usage-cost}
ec2_usage_plot / ec2_cost_plot
```
Finally, it is useful to look at the relationship between compute hours and
total cost by instance type, to understand both the highest cost and highest
usage, as well as the cost-efficiency of the instance types.
```{r total-usage-vs-cost-by-instance}
ec2_instance_data <- ec2_instance_type_costs_usage |>
group_by(instance_type) |>
summarize(
total_hours = sum(UsageQuantity),
total_cost = sum(UnblendedCost)
) |>
mutate(instance_type = factor(instance_type, levels = levels(ec2_cost_data$instance_type)))
ggplot(ec2_instance_data, aes(x = total_hours, y = total_cost, colour = instance_type)) +
geom_point(size = 3) +
scale_colour_discrete(type = aws_ce_palette(n_distinct(ec2_instance_data$instance_type))) +
labs(
title = glue::glue(
"Total cost vs hours on different EC2 instance types\nin {reporting_my}"
),
x = "Total Hours",
y = "Total Cost ($)"
)
```