Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds scaleUpHealing cron #6390

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
85aed2f
more changes
ZainRizvi Dec 4, 2024
a908ce4
temp changes
ZainRizvi Dec 5, 2024
c3e3b7a
rename columns
ZainRizvi Dec 5, 2024
aabb321
add metrics and don't use sqs queue
Camyll Mar 5, 2025
18fe3a0
set timeout
Camyll Mar 5, 2025
f549fe0
add tests
Camyll Mar 6, 2025
fd8b270
fix tests
Camyll Mar 6, 2025
e0632a4
stash testing changes
Camyll Mar 6, 2025
621c708
working tests for scaleupchron
Camyll Mar 7, 2025
71285d9
Fixing linting errors
jeanschmidt Mar 11, 2025
a4c5895
fix build errorsclear
Camyll Mar 11, 2025
0419e58
complete scaleupchron and lambda tests and add scale-up-chron.json
Camyll Mar 11, 2025
4214cf8
add handling for no response data and move hud query to variable
Camyll Mar 11, 2025
5245d02
Typescript and other small improvements
jeanschmidt Mar 12, 2025
d1a7ce3
Copying permissions I believe are relevant from scaleUp to scaleUpCro…
jeanschmidt Mar 12, 2025
b43e52f
Fixes on terraform plan, permissions from scaleUp, not apply scaleUpC…
jeanschmidt Mar 12, 2025
e6e7e71
Works in production!
jeanschmidt Mar 13, 2025
03b220f
Removed test code
jeanschmidt Mar 13, 2025
79b89ae
[Mobile Benchmark Test] Add os, job_arn, and job_conclusion to artifa…
yangw-dev Mar 12, 2025
4635c2d
add regex filter for cost page (#6393)
wdvr Mar 12, 2025
e2e9454
monsterized failures in grouped view (#6394)
wdvr Mar 12, 2025
c75bc11
[ez][HUD] Fix build artifacts grouping in workflow box (#6395)
clee2000 Mar 12, 2025
c8a0ca7
[Benchmark] Prepare for execuTorch failure handling (#6391)
yangw-dev Mar 13, 2025
4e5c126
Update test matrix for release validations for 2.7 (#6401)
atalman Mar 13, 2025
0935d68
Adds additional tests to getRunnerTypes, simplifies code a bit, adds …
jeanschmidt Mar 13, 2025
084c572
Add ephemeral variants for all runner types, and updates validate_sca…
jeanschmidt Mar 13, 2025
d571353
[Queue Time Historgam] add schema for QueueTimeHistorgam (#6355)
yangw-dev Mar 13, 2025
b1c7c6e
fix test input and make naming consistent
Camyll Mar 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ export class Config {
readonly scaleConfigRepo: string;
readonly scaleConfigRepoPath: string;
readonly scaleUpMinQueueTimeMinutes: number;
readonly scaleUpRecordQueueUrl: string | undefined;
readonly scaleUpCronRecordQueueUrl: string | undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt do you think we should be consistent in naming with scaleUpChron vs scaleUpCron? If I were trying to understand this code I'd be confused why these are different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think we do.

readonly secretsManagerSecretsId: string | undefined;
readonly sSMParamCleanupAgeDays: number;
readonly sSMParamMaxCleanupAllowance: number;
Expand Down Expand Up @@ -96,8 +96,8 @@ export class Config {
this.retryScaleUpRecordDelayS = Number(process.env.RETRY_SCALE_UP_RECORD_DELAY_S || '0');
/* istanbul ignore next */
this.retryScaleUpRecordJitterPct = Number(process.env.RETRY_SCALE_UP_RECORD_JITTER_PCT || '0');
this.retryScaleUpRecordQueueUrl = process.env.RETRY_SCALE_UP_RECORD_QUEUE_URL;
this.scaleUpRecordQueueUrl = process.env.SCALE_UP_RECORD_QUEUE_URL;
this.retryScaleUpRecordQueueUrl = process.env.RETRY_SCALE_UP_CRON_RECORD_QUEUE_URL;
this.scaleUpCronRecordQueueUrl = process.env.SCALE_UP_CRON_HUD_QUERY_URL;
this.scaleUpMinQueueTimeMinutes = process.env.SCALE_UP_MIN_QUEUE_TIME_MINUTES
? Number(process.env.SCALE_UP_MIN_QUEUE_TIME_MINUTES)
: 30;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -913,8 +913,8 @@ export class Metrics {
}

export class ScaleUpMetrics extends Metrics {
constructor() {
super('scaleUp');
constructor(lambdaName: string | undefined = undefined) {
super(lambdaName || 'scaleUp');
}

/* istanbul ignore next */
Expand Down Expand Up @@ -1632,8 +1632,9 @@ export class ScaleDownMetrics extends Metrics {

export class ScaleUpChronMetrics extends ScaleUpMetrics {
constructor() {
super();
super('scaleUpChron');
}

queuedRunnerStats(org: string, runnerType: string, numQueuedJobs: number) {
const dimensions = new Map([
['Org', org],
Expand All @@ -1642,10 +1643,12 @@ export class ScaleUpChronMetrics extends ScaleUpMetrics {
]);
this.addEntry('gh.scaleupchron.queuedRunners', 3, dimensions);
}

queuedRunnerFailure(error: string) {
const dimensions = new Map([['error', error]]);
this.countEntry('gh.scaleupchron.queuedRunners.failure', 1, dimensions);
}

/* istanbul ignore next */
getQueuedJobsEndpointSuccess(ms: number) {
this.countEntry(`gh.calls.total`, 1);
Expand All @@ -1666,17 +1669,20 @@ export class ScaleUpChronMetrics extends ScaleUpMetrics {
this.scaleUpSuccess();
this.countEntry('run.scaleupchron.success');
}

scaleUpInstanceFailureNonRetryable(error: string) {
const dimensions = new Map([['error', error]]);
// should we add more information about this or do we not care since it'll be requeued?
this.countEntry('run.scaleupchron.failure.nonRetryable', 1, dimensions);
}

scaleUpInstanceFailureRetryable(error: string) {
const dimensions = new Map([['error', error]]);

// should we add more information about this or do we not care since it'll be requeued?
this.countEntry('run.scaleupchron.failure.retryable', 1, dimensions);
}

scaleUpInstanceNoOp() {
this.countEntry('run.scaleupchron.noop');
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import { scaleUp } from './scale-up';

import * as MetricsModule from './metrics';
import { RunnerType } from './runners';
import nock from 'nock';

jest.mock('./runners');
jest.mock('./gh-runners');
Expand Down Expand Up @@ -81,34 +82,36 @@ const runnerTypeInvalid = 'runner_type_invalid';
const baseCfg = {
scaleConfigOrg: 'test_org1',
scaleUpMinQueueTimeMinutes: 30,
scaleUpRecordQueueUrl: 'url',
scaleUpCronRecordQueueUrl: 'url',
} as unknown as Config;

const metrics = new MetricsModule.ScaleUpChronMetrics();
beforeEach(() => {
jest.resetModules();
jest.clearAllMocks();
jest.restoreAllMocks();

nock.disableNetConnect();
});

describe('scaleUpChron', () => {
it('invalid scaleUpRecordQueueUrl', async () => {
it('invalid scaleUpCronRecordQueueUrl', async () => {
const scaleUpChron = jest.requireActual('./scale-up-chron').scaleUpChron;

jest.clearAllMocks();
jest.spyOn(Config, 'Instance', 'get').mockImplementation(
() =>
({
...baseCfg,
scaleUpRecordQueueUrl: null,
scaleUpCronRecordQueueUrl: null,
} as unknown as Config),
);

mocked(getRepo).mockReturnValue({ owner: 'owner', repo: 'repo' });
mocked(getRunnerTypes).mockResolvedValue(new Map([[runnerTypeValid, { is_ephemeral: false } as RunnerType]]));

await expect(scaleUpChron(metrics)).rejects.toThrow(
new Error('scaleUpRecordQueueUrl is not set. Cannot send queued scale up requests'),
new Error('scaleUpCronRecordQueueUrl is not set. Cannot send queued scale up requests'),
);
});

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,23 @@ export async function scaleUpChron(metrics: ScaleUpChronMetrics): Promise<void>
// 1. Queries for queued runners via HUD
// 2. Polls scale-config to filter the list to ones that are self-hosted by this fleet and
// are ephemeral and nonephemeral
// 3. Sends a SQS request to the scale-up lambda to provision more of those instances
// 3. For each runner queued for longer than the minimum delay, try to scale it up

const scaleConfigRepo = getRepo(Config.Instance.scaleConfigOrg, Config.Instance.scaleConfigRepo);

const validRunnerTypes = await getRunnerTypes(scaleConfigRepo, metrics);

const minAutoScaleupDelayMinutes = Config.Instance.scaleUpMinQueueTimeMinutes;
if (!Config.Instance.scaleUpRecordQueueUrl) {
if (!Config.Instance.scaleUpCronRecordQueueUrl) {
metrics.scaleUpInstanceFailureNonRetryable(
'scaleUpRecordQueueUrl is not set. Cannot send queued scale up requests',
'scaleUpCronRecordQueueUrl is not set. Cannot send queued scale up requests',
);
throw new Error('scaleUpRecordQueueUrl is not set. Cannot send queued scale up requests');
throw new Error('scaleUpCronRecordQueueUrl is not set. Cannot send queued scale up requests');
}
const scaleUpRecordQueueUrl = Config.Instance.scaleUpRecordQueueUrl;
const scaleUpCronRecordQueueUrl = Config.Instance.scaleUpCronRecordQueueUrl;
// Only proactively scale up the jobs that have been queued for longer than normal
// Filter out the queued jobs that are do not correspond to a valid runner type
const queuedJobs = (await getQueuedJobs(metrics, scaleUpRecordQueueUrl))
const queuedJobs = (await getQueuedJobs(metrics, scaleUpCronRecordQueueUrl))
.filter((runner) => {
return (
runner.min_queue_time_minutes >= minAutoScaleupDelayMinutes && runner.org === Config.Instance.scaleConfigOrg
Expand Down Expand Up @@ -76,12 +76,12 @@ interface QueuedJobsForRunner {

export async function getQueuedJobs(
metrics: ScaleUpChronMetrics,
scaleUpRecordQueueUrl: string,
scaleUpCronRecordQueueUrl: string,
): Promise<QueuedJobsForRunner[]> {
// This function queries the HUD for queued runners
// and returns a list of them

const url = scaleUpRecordQueueUrl;
const url = scaleUpCronRecordQueueUrl;

try {
const response = await expBackOff(() => {
Expand All @@ -93,9 +93,25 @@ export async function getQueuedJobs(

// Map the response to the class
if (response && response.data) {
return JSON.parse(response.data) as QueuedJobsForRunner[];
const data = JSON.parse(response.data);
// check if data is valid as QueuedJobsForRunner[]
if (data && Array.isArray(data)) {
return data.filter(
(runner) =>
runner.runner_label &&
runner.org &&
runner.repo &&
typeof runner.num_queued_jobs == 'number' &&
runner.num_queued_jobs > 0 &&
typeof runner.min_queue_time_minutes == 'number' &&
typeof runner.max_queue_time_minutes == 'number',
);
} else {
/* istanbul ignore next */
throw new Error(`Invalid data returned from axios get request with url: ${url} - Not an array`);
}
} else {
throw new Error('No data returned from axios get request with url: ' + url);
throw new Error(`No data returned from axios get request with url: ${url}`);
}
} catch (error) {
metrics.queuedRunnerFailure((error as Error).message);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ resource "aws_lambda_function" "scale_up_chron" {
SCALE_CONFIG_REPO = var.scale_config_repo
SCALE_CONFIG_REPO_PATH = var.scale_config_repo_path
SCALE_UP_MIN_QUEUE_TIME_MINUTES = 30
SCALE_UP_RECORD_QUEUE_URL = var.hud_query_url
SCALE_UP_CRON_HUD_QUERY_URL = var.retry_scale_up_cron_hud_query_url
scale_up_chron_CONFIG = jsonencode(var.idle_config)
SECRETSMANAGER_SECRETS_ID = var.secretsmanager_secrets_id
AWS_REGIONS_TO_VPC_IDS = join(
Expand Down
2 changes: 1 addition & 1 deletion terraform-aws-github-runner/modules/runners/scale-up.tf
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ resource "aws_lambda_function" "scale_up" {
REDIS_LOGIN = var.redis_login
RETRY_SCALE_UP_RECORD_DELAY_S = "60"
RETRY_SCALE_UP_RECORD_JITTER_PCT = "0.5"
RETRY_SCALE_UP_RECORD_QUEUE_URL = var.sqs_build_queue_retry.url
RETRY_SCALE_UP_CRON_RECORD_QUEUE_URL = var.sqs_build_queue_retry.url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt, @ZainRizvi suggested we just add the actual URL here instead of using a variable since it own't change. if we leave it as a var, how do we set it?
The URL will be: https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D

RUNNER_EXTRA_LABELS = var.runner_extra_labels
SCALE_CONFIG_ORG = var.scale_config_org
SCALE_CONFIG_REPO = var.scale_config_repo
Expand Down
4 changes: 2 additions & 2 deletions terraform-aws-github-runner/modules/runners/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ variable "lambda_timeout_scale_up" {
variable "lambda_timeout_scale_up_chron" {
description = "Time out for the scale up chron lambda in seconds."
type = number
default = 60
default = 900
}

variable "role_permissions_boundary" {
Expand Down Expand Up @@ -319,7 +319,7 @@ variable "min_available_runners" {
type = number
}

variable "hud_query_url" {
variable "retry_scale_up_cron_hud_query_url" {
description = "URL used in scale-up-chron to query HUD for queued jobs"
type = string
default = ""
Expand Down
Loading