feat: add image for raycluster workspace #16

bincherry · 2024-10-15T02:18:04Z

For https://github.com/matrixorigin/Neolink.AI/issues/120

用于ray workspace的head节点镜像

修改了start.sh，需要测试验证对已有镜像的功能是否有影响，尤其是supervisord配置文件和启动日志

bincherry · 2024-10-17T02:56:42Z

workspace镜像使用方式：

https://github.com/matrixorigin/mlops-images/blob/91bb5bb66ad1c0a66aa09fc497d01289ae4224f9/common/online-files/start.sh 重命名为start-workspace.sh（这个文件名已写死在workspace镜像中）放到 http://sharefile.neolink.com/file/start-workspace.sh。这里暂时区分文件名用于测试，如果测试通过可以覆盖默认的start.sh
raycluster的修改：

raycluster增加annotation ray.io/overwrite-container-cmd: "true"，才能启动内置的supervisord
head节点使用定制镜像
worker节点使用社区镜像
worker节点需要手动指定args启动参数（受 ray.io/overwrite-container-cmd 的影响）

示例

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  annotations:
    ray.io/overwrite-container-cmd: "true"
  name: workspace-hello
  namespace: user-z1mzdnmc
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  rayVersion: '2.35.0'
  # If `enableInTreeAutoscaling` is true, the Autoscaler sidecar will be added to the Ray head pod.
  # Ray Autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
  enableInTreeAutoscaling: true
  # `autoscalerOptions` is an OPTIONAL field specifying configuration overrides for the Ray Autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  # (You may delete autoscalerOptions if the defaults are suitable.)
  autoscalerOptions:
    # `upscalingMode` is "Default" or "Aggressive."
    # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
    # Default: Upscaling is not rate-limited.
    # Aggressive: An alias for Default; upscaling is not rate-limited.
    upscalingMode: Default
    # `idleTimeoutSeconds` is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    idleTimeoutSeconds: 10
    # `image` optionally overrides the Autoscaler's container image. The Autoscaler uses the same image as the Ray container by default.
    image: ghcr.io/bincherry/ray:2.35.0-py310-cpu
    # `imagePullPolicy` optionally overrides the Autoscaler container's default image pull policy (IfNotPresent).
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    envFrom: []
    # resources specifies optional resource request and limit overrides for the Autoscaler container.
    # The default Autoscaler resource limits and requests should be sufficient for production use-cases.
    # However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
      limits:
        cpu: "500m"
        memory: "1Gi"
      requests:
        cpu: "100m"
        memory: "128Mi"
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the specific format demonstrated below:
      # resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
    # Pod template
    template:
      metadata:
        annotations:
          sidecar.istio.io/inject: "true"
      spec:
        containers:
        # The Ray head container
        - name: workspace-hello
          image: ghcr.io/bincherry/ray:workspace-2.35.0-python3.10-ubuntu22.04 # head节点定制镜像
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "2"
              memory: "4Gi"
            requests:
              cpu: "0.1"
              memory: "128Mi"
  workerGroupSpecs:
  # the Pod replicas in this group typed worker
  - replicas: 2
    minReplicas: 2
    maxReplicas: 2
    # logical group name, for this called small-group, also can be functional
    groupName: cpu2
    # If worker pods need to be added, Ray Autoscaler can increment the `replicas`.
    # If worker pods need to be removed, Ray Autoscaler decrements the replicas, and populates the `workersToDelete` list.
    # KubeRay operator will remove Pods from the list until the desired number of replicas is satisfied.
    #scaleStrategy:
    #  workersToDelete:
    #  - raycluster-complete-worker-small-group-bdtwh
    #  - raycluster-complete-worker-small-group-hv457
    #  - raycluster-complete-worker-small-group-k8tj7
    rayStartParams:
      num-cpus: "2"
    # Pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: ghcr.io/bincherry/ray:2.35.0-py310-cpu # worker节点用社区镜像，注意cuda版本
          args:
          - bash
          - -c
          - ulimit -n 65536; bash -lc "$KUBERAY_GEN_RAY_START_CMD" # worker节点需要手动指定args启动参数
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "0.1"
              memory: "128Mi"

bincherry mentioned this pull request Oct 15, 2024

feat: add image for rayservice #7

Draft

bincherry force-pushed the feat_ray_workspace branch from f139857 to 91bb5bb Compare October 15, 2024 02:43

bincherry force-pushed the feat_ray_workspace branch 2 times, most recently from 72c5e24 to b1a00da Compare October 24, 2024 05:26

feat: add image for raycluster workspace

8bbbd1e

bincherry force-pushed the feat_ray_workspace branch from b1a00da to 8bbbd1e Compare November 1, 2024 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add image for raycluster workspace #16

feat: add image for raycluster workspace #16

Uh oh!

bincherry commented Oct 15, 2024

Uh oh!

bincherry commented Oct 17, 2024 •

edited

Loading

Uh oh!

Uh oh!

feat: add image for raycluster workspace #16

Are you sure you want to change the base?

feat: add image for raycluster workspace #16

Uh oh!

Conversation

bincherry commented Oct 15, 2024

Uh oh!

bincherry commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bincherry commented Oct 17, 2024 •

edited

Loading