Feat gpu (#585)

acarranoqovery · web-flow · commit 38fbfdcfca11 · 2025-10-15T10:32:57.000+02:00
* feat: add GPU support for karpenter

* fix
diff --git a/website/docs/using-qovery/configuration/application.md b/website/docs/using-qovery/configuration/application.md
@@ -1,5 +1,5 @@
 ---
-last_modified_on: "2025-09-16"
+last_modified_on: "2025-10-14"
 title: "Application"
 description: "Learn how to configure your Application on Qovery"
 ---
@@ -102,6 +102,7 @@ Within this section, you will need to define the resources to be assigned to you
 
 - vCPU: the vCPU assigned to each instance of your application. The default is 500m (0.5 vCPU).
 - RAM: the amount of RAM assigned to each instance of your application. The default is 512MB.
+- GPU: the amount of GPU assigned to each instance of your application. The default is 0. The GPU nodepool must be enabled on your cluster to be able to use GPU instances (see [AWS with Karpenter cluster setup documentation][docs.using-qovery.configuration.clusters.aws-with-karpenter]).
 - Number of instances (Application Auto-scaling): select the minimum and the maximum number of instances of your application that can run within your cluster. The number of instances running at an insant t is automatically managed by Kubernetes (Application auto-scaling) and it is based on real-time CPU consumption. When your app goes above 60% of CPU consumption for 5 minutes, your app will be auto-scaled and more instances will be added. It is transparent.
  Qovery runs your application on Kubernetes and relies on [metrics-server](https://github.com/kubernetes-sigs/metrics-server) service to auto-scale your app.
 
@@ -282,6 +283,14 @@ Default is 512MB.
 
 </Alert>
 
+#### GPU
+
+To configure the amount of GPU that your app needs, adjust the setting in `Resources` section of the application configuration.
+
+<Alert type="info">
+Default is 0. The GPU nodepool must be enabled on your cluster to be able to use GPU instances.
+</Alert>
+
 Please note that in this section you configure the CPU allocated by the cluster for your application and that cannot consume more than this value. Even if the application is underused and consume less resources, the cluster will still reserve the selected amount of CPU. If your application requires more RAM than requested, it will be killed by the kubernetes scheduler.
 
 #### Auto-scaling
@@ -559,6 +568,7 @@ In the application overview, click on the `3 dots` button and remove the applica
 [docs.using-qovery.configuration.advanced-settings]: /docs/using-qovery/configuration/advanced-settings/
 [docs.using-qovery.configuration.application-health-checks]: /docs/using-qovery/configuration/application-health-checks/
 [docs.using-qovery.configuration.clusters#use-custom-domain-and-wildcard-tls-for-the-whole-cluster-beta]: /docs/using-qovery/configuration/clusters/#use-custom-domain-and-wildcard-tls-for-the-whole-cluster-beta
+[docs.using-qovery.configuration.clusters.aws-with-karpenter]: /docs/using-qovery/configuration/clusters/aws-with-karpenter/
 [docs.using-qovery.configuration.environment-variable#connecting-to-a-database]: /docs/using-qovery/configuration/environment-variable/#connecting-to-a-database
 [docs.using-qovery.configuration.environment-variable#connecting-to-another-application]: /docs/using-qovery/configuration/environment-variable/#connecting-to-another-application
 [docs.using-qovery.configuration.environment-variable]: /docs/using-qovery/configuration/environment-variable/
diff --git a/website/docs/using-qovery/configuration/application.md.erb b/website/docs/using-qovery/configuration/application.md.erb
@@ -90,6 +90,7 @@ Within this section, you will need to define the resources to be assigned to you
 
 - vCPU: the vCPU assigned to each instance of your application. The default is 500m (0.5 vCPU).
 - RAM: the amount of RAM assigned to each instance of your application. The default is 512MB.
+- GPU: the amount of GPU assigned to each instance of your application. The default is 0. The GPU feature must be enabled on your cluster to be able to use GPU instances (see [AWS with Karpenter cluster setup documentation][docs.using-qovery.configuration.clusters.aws-with-karpenter]).
 - Number of instances (Application Auto-scaling): select the minimum and the maximum number of instances of your application that can run within your cluster. The number of instances running at an insant t is automatically managed by Kubernetes (Application auto-scaling) and it is based on real-time CPU consumption. When your app goes above 60% of CPU consumption for 5 minutes, your app will be auto-scaled and more instances will be added. It is transparent.
  Qovery runs your application on Kubernetes and relies on [metrics-server](https://github.com/kubernetes-sigs/metrics-server) service to auto-scale your app.
 
@@ -270,6 +271,14 @@ Default is 512MB.
 
 </Alert>
 
+#### GPU
+
+To configure the amount of GPU that your app needs, adjust the setting in `Resources` section of the application configuration.
+
+<Alert type="info">
+Default is 0. The GPU feature must be enabled on your cluster to be able to use GPU instances.
+</Alert>
+
 Please note that in this section you configure the CPU allocated by the cluster for your application and that cannot consume more than this value. Even if the application is underused and consume less resources, the cluster will still reserve the selected amount of CPU. If your application requires more RAM than requested, it will be killed by the kubernetes scheduler.
 
 #### Auto-scaling
diff --git a/website/docs/using-qovery/configuration/clusters/aws-with-karpenter.md b/website/docs/using-qovery/configuration/clusters/aws-with-karpenter.md
@@ -1,5 +1,5 @@
 ---
-last_modified_on: "2025-07-07"
+last_modified_on: "2025-10-14"
 title: "AWS EKS with Karpenter"
 description: "Learn how to configure your AWS Kubernetes clusters with Karpenter on Qovery"
 ---
@@ -44,6 +44,7 @@ To confirm, click `Next`.
 In the `Set Resources` window, select:
 
 * `Karpenter`: Toggle the switch to enable Karpenter on your AWS EKS cluster
+* `Node disk size (GB)`: Specify the disk capacity allocated per worker node, determining the amount of data each node can store. The minimum value is 20GB.
 * `Instance types scopes`: By editing it, you can apply different filters to the node architectures, categories, families, and sizes. On the right, you can view all the instance types that match the applied filters. This means Karpenter will be able to spawn nodes on any of the listed instance types.
   * `Architectures`: by default both `AMD64` and `ARM64` architectures are selected.
   * `Default build architecture`: by default `AMD64`. If you build your application with the Qovery CI, your application will be built using this architecture by default.
@@ -55,6 +56,9 @@ In the `Set Resources` window, select:
   <img src="/img/configuration/clusters/spot_usage.png" alt="Spot usage" />
 </p>
 
+* `Enable GPU Nodepool configuration`: If you want to run GPU workloads on your cluster, you can enable this option to create a dedicated nodepool for GPU instances. You will then be able to select the GPU instance types you want to use on this nodepool. To enable spot instances, toggle the spot instance flag.
+
+
 <Alert type="warning">
 Instance type selection from your Qovery Console has direct consequences on your cloud provider’s bill. While Qovery allows you to switch to a different instance type whenever you want, it is your sole responsibility to keep an eye on your infrastructure costs, especially when you want to upsize.
 
@@ -334,12 +338,14 @@ Qovery deploys two node pools by default:
 - **Stable node pool**: Used for single instances and internal Qovery applications. For example, any containerized databases or application having the number of minimum instances set to 1, will be deployed on this nodepool. On this nodepool the consolidation is deactivated by default.
 - **Default node pool**: Designed to handle general workloads and serves as the foundation for deploying most applications.
 
-Qovery allows you to modify the resources allocated to your cluster:
+An additional GPU node pool can be present if you have enabled the GPU node pool configuration when creating your cluster (can be enabled afterwards).
 
-##### Shared settings for both nodepools:
-- **Instance types**: Define the list of instance types that can be used.
-- **Spot instances**: Enable or disable spot instances.
-- **Node disk size (GB)**: Specify the disk capacity allocated per worker node, determining the amount of data each node can store.
+##### Settings for nodepools:
+- **Instance types**: Define the list of instance types that can be used.  (Shared for Stable and Default nodepools)
+- **Spot instances**: Enable or disable spot instances. (Shared across the three nodepools)
+- **Node disk size (GB)**: Specify the disk capacity allocated per worker node, determining the amount of data each node can store. (Shared for Stable and Default nodepools)
+- **Consolidation schedule** *(Stable nodepool only)*: Optimizes resource usage by consolidating workloads onto fewer nodes. This feature is not available for the default nodepool, as consolidation can happen at any time. We recommend enabling this option; otherwise, nodes will never be consolidated, leading to unnecessary infrastructure costs.
+- **Node pool limits**: Configure CPU and memory limits to ensure nodes stay within defined resource constraints, preventing excessive costs.
 
 <Alert type="warning">
 Instance type selection from your Qovery Console has direct consequences on your cloud provider’s bill. While Qovery allows you to switch to a different instance type whenever you want, it is your sole responsibility to keep an eye on your infrastructure costs, especially when you want to upsize.
@@ -348,9 +354,6 @@ For more information on the instance types provided by each cloud provider and t
 
 </Alert>
 
-##### Nodepool specific settings:
-- **Consolidation schedule** *(Stable nodepool only)*: Optimizes resource usage by consolidating workloads onto fewer nodes. This feature is not available for the default nodepool, as consolidation can happen at any time. We recommend enabling this option; otherwise, nodes will never be consolidated, leading to unnecessary infrastructure costs.
-- **Node pool limits**: Configure CPU and memory limits to ensure nodes stay within defined resource constraints, preventing excessive costs.
 
 #### Mirroring registry
 
diff --git a/website/docs/using-qovery/configuration/clusters/aws-with-karpenter.md.erb b/website/docs/using-qovery/configuration/clusters/aws-with-karpenter.md.erb
@@ -41,6 +41,7 @@ To confirm, click `Next`.
 In the `Set Resources` window, select:
 
 * `Karpenter`: Toggle the switch to enable Karpenter on your AWS EKS cluster
+* `Node disk size (GB)`: Specify the disk capacity allocated per worker node, determining the amount of data each node can store. The minimum value is 20GB.
 * `Instance types scopes`: By editing it, you can apply different filters to the node architectures, categories, families, and sizes. On the right, you can view all the instance types that match the applied filters. This means Karpenter will be able to spawn nodes on any of the listed instance types.
   * `Architectures`: by default both `AMD64` and `ARM64` architectures are selected.
   * `Default build architecture`: by default `AMD64`. If you build your application with the Qovery CI, your application will be built using this architecture by default.
@@ -52,6 +53,9 @@ In the `Set Resources` window, select:
   <img src="/img/configuration/clusters/spot_usage.png" alt="Spot usage" />
 </p>
 
+* `Enable GPU Nodepool configuration`: If you want to run GPU workloads on your cluster, you can enable this option to create a dedicated nodepool for GPU instances. You will then be able to select the GPU instance types you want to use on this nodepool. To enable spot instances, toggle the spot instance flag.
+
+
 <Alert type="warning">
 Instance type selection from your Qovery Console has direct consequences on your cloud provider’s bill. While Qovery allows you to switch to a different instance type whenever you want, it is your sole responsibility to keep an eye on your infrastructure costs, especially when you want to upsize.
 
@@ -331,12 +335,14 @@ Qovery deploys two node pools by default:
 - **Stable node pool**: Used for single instances and internal Qovery applications. For example, any containerized databases or application having the number of minimum instances set to 1, will be deployed on this nodepool. On this nodepool the consolidation is deactivated by default.
 - **Default node pool**: Designed to handle general workloads and serves as the foundation for deploying most applications.
 
-Qovery allows you to modify the resources allocated to your cluster:
+An additional GPU node pool can be present if you have enabled the GPU node pool configuration when creating your cluster (can be enabled afterwards).
 
-##### Shared settings for both nodepools:
-- **Instance types**: Define the list of instance types that can be used.
-- **Spot instances**: Enable or disable spot instances.
-- **Node disk size (GB)**: Specify the disk capacity allocated per worker node, determining the amount of data each node can store.
+##### Settings for nodepools:
+- **Instance types**: Define the list of instance types that can be used.  (Shared for Stable and Default nodepools)
+- **Spot instances**: Enable or disable spot instances. (Shared across the three nodepools)
+- **Node disk size (GB)**: Specify the disk capacity allocated per worker node, determining the amount of data each node can store. (Shared for Stable and Default nodepools)
+- **Consolidation schedule** *(Stable nodepool only)*: Optimizes resource usage by consolidating workloads onto fewer nodes. This feature is not available for the default nodepool, as consolidation can happen at any time. We recommend enabling this option; otherwise, nodes will never be consolidated, leading to unnecessary infrastructure costs.
+- **Node pool limits**: Configure CPU and memory limits to ensure nodes stay within defined resource constraints, preventing excessive costs.
 
 <Alert type="warning">
 Instance type selection from your Qovery Console has direct consequences on your cloud provider’s bill. While Qovery allows you to switch to a different instance type whenever you want, it is your sole responsibility to keep an eye on your infrastructure costs, especially when you want to upsize.
@@ -345,9 +351,6 @@ For more information on the instance types provided by each cloud provider and t
 
 </Alert>
 
-##### Nodepool specific settings:
-- **Consolidation schedule** *(Stable nodepool only)*: Optimizes resource usage by consolidating workloads onto fewer nodes. This feature is not available for the default nodepool, as consolidation can happen at any time. We recommend enabling this option; otherwise, nodes will never be consolidated, leading to unnecessary infrastructure costs.
-- **Node pool limits**: Configure CPU and memory limits to ensure nodes stay within defined resource constraints, preventing excessive costs.
 
 #### Mirroring registry