为 Ray head/worker Pod 指定容器命令#

您可以在两个时间点在 head/worker pod 上执行命令:

  • (1) ray start 之前: 作为示例,您可以设置一些将由 ray start 的环境变量

  • (2) ray start 之后 (RayCluster 已就绪): 例如,当 RayCluster 就绪时,您可以启动 Ray serve 部署。

当前容器命令的 KubeRay operator 行为#

  • 容器命令的当前行为尚未最终确定, 将来可能会更新

  • 查看 code 了解更多信息。

Timing 1: ray start 之前#

目前,对于 timing (1),我们可以在RayCluster规范中设置容器的 CommandArgs 来达到目标。

# https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
    rayStartParams:
        ...
    #pod template
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.5.0
          resources:
            ...
          ports:
            ...
          # `command` and `args` will become a part of `spec.containers.0.args` in the head Pod.
          command: ["echo 123"]
          args: ["456"]
  • Ray Head Pod

    • spec.containers.0.command 硬编码为 ["/bin/bash", "-lc", "--"].

    • spec.containers.0.args 包含两部分:

      • (第 1 部分) 用户指定命令: 将来自 RayCluster 的 headGroupSpec.template.spec.containers.0.commandheadGroupSpec.template.spec.containers.0.args 用字符串串联。

      • (第2部分) ray start 命令: 命令根据 RayCluster 中 rayStartParams 定义创建。命令看起来像 ulimit -n 65536; ray start ...

      • 总而言之, spec.containers.0.args 将是 $(user-specified command) && $(ray start command)

  • 示例

    # Prerequisite: There is a KubeRay operator in the Kubernetes cluster.
    
    # Download `ray-cluster.head-command.yaml`
    curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0-rc.0/ray-operator/config/samples/ray-cluster.head-command.yaml
    
    # Create a RayCluster
    kubectl apply -f ray-cluster.head-command.yaml
    
    # Check ${RAYCLUSTER_HEAD_POD}
    kubectl get pod -l ray.io/node-type=head
    
    # Check `spec.containers.0.command` and `spec.containers.0.args`.
    kubectl describe pod ${RAYCLUSTER_HEAD_POD}
    
    # Command:
    #   /bin/bash
    #   -lc
    #   --
    # Args:
    #    echo 123  456  && ulimit -n 65536; ray start --head  --dashboard-host=0.0.0.0  --num-cpus=1  --block  --metrics-export-port=8080  --memory=2147483648
    

Timing 2: ray start 之后 (RayCluster 已就绪)#

我们有两种解决方案来在 RayCluster 准备就绪后执行命令。这两种解决方案之间的主要区别是解决方案 1 中用户可以通过 kubectl logs 检查日志。

解决方案 1: 容器命令 (推荐)#

正如我们在“时序1:ray start 之前”一节中提到的,用户指定的命令将在 ray start 命令之前执行。因此,我们可以通过更新 ray-cluster.head-command.yamlheadGroupSpec.template.spec.containers.0.command 来在后台执行ray_cluster_resources.sh

# https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
# Parentheses for the command is required.
command: ["(/home/ray/samples/ray_cluster_resources.sh&)"]

# ray_cluster_resources.sh
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  ray_cluster_resources.sh: |
    #!/bin/bash

    # wait for ray cluster to finish initialization
    while true; do
        ray health-check 2>/dev/null
        if [ "$?" = "0" ]; then
            break
        else
            echo "INFO: waiting for ray head to start"
            sleep 1
        fi
    done

    # Print the resources in the ray cluster after the cluster is ready.
    python -c "import ray; ray.init(); print(ray.cluster_resources())"

    echo "INFO: Print Ray cluster resources"
  • 示例

    # (1) Update `command` to ["(/home/ray/samples/ray_cluster_resources.sh&)"]
    # (2) Comment out `postStart` and `args`.
    kubectl apply -f ray-cluster.head-command.yaml
    
    # Check ${RAYCLUSTER_HEAD_POD}
    kubectl get pod -l ray.io/node-type=head
    
    # Check the logs
    kubectl logs ${RAYCLUSTER_HEAD_POD}
    
    # INFO: waiting for ray head to start
    # .
    # . => Cluster initialization
    # .
    # 2023-02-16 18:44:43,724 INFO worker.py:1231 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
    # 2023-02-16 18:44:43,724 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.244.0.26:6379...
    # 2023-02-16 18:44:43,735 INFO worker.py:1535 -- Connected to Ray cluster. View the dashboard at http://10.244.0.26:8265
    # {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0}
    # INFO: Print Ray cluster resources
    

解决方案 2:postStart 挂钩#

# https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
lifecycle:
  postStart:
    exec:
      command: ["/bin/sh","-c","/home/ray/samples/ray_cluster_resources.sh"]
  • 我们通过 postStart 挂钩执行 ray_cluster_resources.sh 脚本。根据 本文档,无法保证挂钩将在容器 ENTRYPOINT 之前执行。因此,我们需要在 ray_cluster_resources.sh 中等待RayCluster完成初始化完成。

  • 示例

    kubectl apply -f ray-cluster.head-command.yaml
    
    # Check ${RAYCLUSTER_HEAD_POD}
    kubectl get pod -l ray.io/node-type=head
    
    # Forward the port of Dashboard
    kubectl port-forward --address 0.0.0.0 ${RAYCLUSTER_HEAD_POD} 8265:8265
    
    # Open the browser and check the Dashboard (${YOUR_IP}:8265/#/job).
    # You shold see a SUCCEEDED job with the following Entrypoint:
    #
    # `python -c "import ray; ray.init(); print(ray.cluster_resources())"`