Closed
Description
Epic
End User Goal
Allow a user to run a question of any size without it timing out.
Overview
Cloud Run is limiting our ability to run questions that take longer than an hour and/or require more powerful hardware. It also locks us into a set of frustrating problems.
Creating a Kueue service backend will:
- Queue questions instead of dropping them if the service is overwhelmed
- Allow us to run questions that take any amount of time (specifically opening us up to runs > 1 hour)
- Access hardware we can't currently access (e.g. GPUs)
- Access arbitrarily provisioned hardware (CPU, memory, storage etc.)
- Stop pointless question reruns by allowing us to control when we acknowledge question events
- Cancel running questions
- Monitor running questions individually
- Run questions on providers other than Google (i.e. on any Kubernetes cluster)
Contents
- Experiment with Kueue #712
- Investigate whether Kueue can integrate directly with pub/sub (i.e. can we link Kueue up to push subscriptions?) #711
- Link pub/sub to kubernetes cluster #713
- Dispatch questions to Kueue via cloud function #714
- Run questions via CLI without starting a service #710
- Update SDK with Kueue features #715
- Set up Kueue infrastructure via Terraform #716
-
Upgrade service registries #717 - Create new GHA workflow to build and push images #719
- Remove Cloud Run service backend
- Remove old cloud run code
- Deprecate
octue create-push-subscription
CLI command and GHA
- Update documentation