KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. It offers several key components:
KubeRay core: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.
-
RayCluster: KubeRay fully manages the lifecycle of RayCluster, including cluster creation/deletion, autoscaling, and ensuring fault tolerance.
-
RayJob: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes.
-
RayService: RayService is made up of two parts: a RayCluster and a Ray Serve deployment graph. RayService offers zero-downtime upgrades for RayCluster and high availability.
KubeRay ecosystem: Some optional components.
-
Kubectl Plugin (Beta): Starting from KubeRay v1.3.0, you can use the
kubectl ray
plugin to simplify common workflows when deploying Ray on Kubernetes. If you aren’t familiar with Kubernetes, this plugin simplifies running Ray on Kubernetes. See kubectl-plugin for more details. -
KubeRay APIServer (Alpha): It provides a layer of simplified configuration for KubeRay resources. The KubeRay API server is used internally by some organizations to back user interfaces for KubeRay resource management.
-
KubeRay Dashboard (Experimental): Starting from KubeRay v1.4.0, we have introduced a new dashboard that enables users to view and manage KubeRay resources. While it is not yet production-ready, we welcome your feedback.
From September 2023, all user-facing KubeRay documentation will be hosted on the Ray documentation. The KubeRay repository only contains documentation related to the development and maintenance of KubeRay.
KubeRay examples are hosted on the Ray documentation. Examples span a wide range of use cases, including training, LLM online inference, batch inference, and more.
KubeRay integrates with the Kubernetes ecosystem, including observability tools (e.g., Prometheus, Grafana, py-spy), queuing systems (e.g., Volcano, Apache YuniKorn, Kueue), ingress controllers (e.g., Nginx), and more. See KubeRay Ecosystem for more details.
- Scaling Ray to 10K Models and Beyond Workday
- How Klaviyo built a robust model serving platform with Ray Serve Klaviyo
- Evolving Niantic AR Mapping Infrastructures with Ray Niantic
- Building a Modern Machine Learning Platform with Ray at Samsara Samsara
- Using Ray on Kubernetes with KubeRay at Google Cloud Google
- How DoorDash Built an Ensemble Learning Model for Time Series Forecasting with KubeRay Doordash
- AI/ML Models Batch Training at Scale with Open Data Hub Red Hat
- Distributed Machine Learning at Instacart Instacart
- Unleashing ML Innovation at Spotify with Ray Spotify
- Best Practices For Ray Cluster On ACK Alibaba Cloud
- Advanced Model Serving Techniques with Ray on Kubernetes | KubeCon 2024 NA Anyscale + Google
- Building Scalable AI Infrastructure with Kuberay and Kubernetes | Ray Summit 2024 Anyscale + Google
- Ray at Scale: Apple's Approach to Elastic GPU Management | Ray Summit 2024 Apple
- Scaling Ray Train to 10K Kubernetes Nodes on GKE | Ray Summit 2024 Google
- KubeSecRay: Fortifying Multi-Tenant Ray Clusters on Kubernetes | Ray Summit 2024 Microsoft
- Scaling LLM Inference: AWS Inferentia Meets Ray Serve on EKS | Ray Summit 2024 AWS
- How Roblox Scaled Machine Learning by Leveraging Ray for Efficient Batch Inference | Ray Summit 2024 Roblox
- Airbnb's LLM Evolution: Fine-Tuning with Ray | Ray Summit 2024 Airbnb
- Ray @ eBay: Pioneering a Next-Gen AI Platform | Ray Summit 2024 eBay
- Spotify Harnesses Ray for Next-Gen AI Infrastructure | Ray Summit 2024 Spotify
- Spotify's Approach to Distributed LLM Training with Ray on GKE | Ray Summit 2024 Spotify
- Reddit's ML Evolution: Scaling with Ray and KubeRay | Ray Summit 2024 Reddit
- IBM's Approach to Building a Cloud-Native AI Platform | Ray Summit 2024 IBM
- Exploring Hinge's ML Platform Evolution with Ray | Ray Summit 2024 Hinge
- How Rubrik Unlocked AI at Scale with Ray Serve | Ray Summit 2024 Rubrik
- Supercharge Your AI Platform with KubeRay | KubeCon 2023 NA Anyscale + Google
- Sailing Ray Workloads with KubeRay and Kueue in Kubernetes Volcano + DaoCloud
- Serving Large Language Models with KubeRay on TPUs Google
Please read our CONTRIBUTING guide before making a pull request. Refer to our DEVELOPMENT to build and run tests locally.
Join Ray's Slack workspace, and search the following public channels:
#kuberay-questions
: This channel aims to help KubeRay users with their questions. The messages will be closely monitored by the Ray and KubeRay maintainers.
KubeRay contributors are welcome to join the bi-weekly KubeRay community meetings.
- Add the Ray/KubeRay Google calendar to your calendar.
If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify KubeRay Security via our Slack Channel. Please do not create a public GitHub issue.
This project is licensed under the Apache-2.0 License.