GPU 支持#

GPU 对于许多机器学习应用程序至关重要。 Ray 远程支持 GPU 作为预定义的 resource 类型，并允许任务和 actor 指定它们的 GPU 资源需求。

启动 GPU 的 Ray 节点#

默认的，Ray 会将节点的 GPU 资源数量设置为 Ray 自动检测到的物理 GPU 数量。如果需要，你可以覆盖这个设置。

Note

没什么可以阻止你在机器上指定一个比真实 GPU 数量更大的 num_gpus 值，因为 Ray 资源是逻辑的。在此情况下，Ray 会表现的好像机器上有你指定的 GPU 数量一样，用于调度需要 GPU 的任务和 actor。只有当这些任务和 actor 尝试使用不存在的 GPU 时才会出现问题。

Tip

你可以在启动 Ray 节点之前设置 CUDA_VISIBLE_DEVICES 环境变量来限制 Ray 可见的 GPU。比如，CUDA_VISIBLE_DEVICES=1,3 ray start --head --num-gpus=2 将只让 Ray 看到设备 1 和 3。

在任务和 actor 中使用 GPU#

如果一个任务或 actor 需要 GPU，你可以指定相应的资源需求 <resource-requirements>`（比如 ``@ray.remote(num_gpus=1)`）。 Ray 然后会将任务或 actor 调度到一个有足够空闲 GPU 资源的节点上，并在运行任务或 actor 代码之前通过设置 CUDA_VISIBLE_DEVICES 环境变量来分配 GPU。

import os
import ray

ray.init(num_gpus=2)


@ray.remote(num_gpus=1)
class GPUActor:
    def ping(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
        print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))


@ray.remote(num_gpus=1)
def use_gpu():
    print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
    print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))


gpu_actor = GPUActor.remote()
ray.get(gpu_actor.ping.remote())
# The actor uses the first GPU so the task will use the second one.
ray.get(use_gpu.remote())
# (GPUActor pid=52420) ray.get_gpu_ids(): [0]
# (GPUActor pid=52420) CUDA_VISIBLE_DEVICES: 0
# (use_gpu pid=51830) ray.get_gpu_ids(): [1]
# (use_gpu pid=51830) CUDA_VISIBLE_DEVICES: 1

在任务或 actor 中，ray.get_gpu_ids() 会返回一个可用于任务或 actor 的 GPU ID 列表。通常，不需要调用 ray.get_gpu_ids()，因为 Ray 会自动设置 CUDA_VISIBLE_DEVICES 环境变量，这在大多数机器学习框架中都会被用于 GPU 分配。

注意： 上面定义的 use_gpu 函数实际上并没有使用任何 GPU。Ray 会将其调度到至少有一个 GPU 的节点上，并在其执行时为其保留一个 GPU，但实际上使用 GPU 是由函数自己决定的。这通常通过外部库（比如 TensorFlow）来完成。下面是一个实际使用 GPU 的例子。为了使这个例子工作，你需要安装 TensorFlow 的 GPU 版本。

@ray.remote(num_gpus=1)
def use_gpu():
    import tensorflow as tf

    # Create a TensorFlow session. TensorFlow will restrict itself to use the
    # GPUs specified by the CUDA_VISIBLE_DEVICES environment variable.
    tf.Session()

注意： 使用了 use_gpu 的人员可以忽略 ray.get_gpu_ids() 并使用机器上的所有 GPU。 Ray 不会阻止这种情况发生，这可能导致太多任务或 actor 同时使用同一个 GPU。然而，Ray 会自动设置 CUDA_VISIBLE_DEVICES 环境变量，它会限制大多数深度学习框架使用的 GPU，假设用户没有覆盖它。

分数 GPU#

Ray 支持分数资源需求，这样多个任务和 actor 可以共享同一个 GPU。

ray.init(num_cpus=4, num_gpus=1)


@ray.remote(num_gpus=0.25)
def f():
    import time

    time.sleep(1)


# The four tasks created here can execute concurrently
# and share the same GPU.
ray.get([f.remote() for _ in range(4)])

注意： 用户需要确保单个任务不会使用超过其 GPU 内存份额。 TensorFlow 可以配置为限制其内存使用量。

当 Ray 将节点的 GPU 分配给具有分数资源需求的任务或 actor 时，它会先分配一个 GPU，然后再分配下一个 GPU，以避免碎片化。

ray.init(num_gpus=3)


@ray.remote(num_gpus=0.5)
class FractionalGPUActor:
    def ping(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))


fractional_gpu_actors = [FractionalGPUActor.remote() for _ in range(3)]
# Ray will try to pack GPUs if possible.
[ray.get(fractional_gpu_actors[i].ping.remote()) for i in range(3)]
# (FractionalGPUActor pid=57417) ray.get_gpu_ids(): [0]
# (FractionalGPUActor pid=57416) ray.get_gpu_ids(): [0]
# (FractionalGPUActor pid=57418) ray.get_gpu_ids(): [1]

Worker 不释放 GPU 资源#

目前，当 worker 执行使用 GPU 的任务（比如，通过 TensorFlow）时，任务可能会在 GPU 上分配内存，并且在任务执行完毕后可能不会释放它。这可能会导致下次任务尝试使用相同的 GPU 时出现问题。为了解决这个问题，Ray 默认情况下禁用了 GPU 任务之间的 worker 进程重用，其中 GPU 资源在任务进程退出后被释放。由于这会增加 GPU 任务调度的开销，你可以通过在 ray.remote 装饰器中设置 max_calls=0 来重新启用 worker 重用。

# By default, ray will not reuse workers for GPU tasks to prevent
# GPU resource leakage.
@ray.remote(num_gpus=1)
def leak_gpus():
    import tensorflow as tf

    # This task will allocate memory on the GPU and then never release it.
    tf.Session()

加速器类型#

Ray 支持特定资源的加速器类型。accelerator_type 选项可以用于强制任务或 actor 在具有特定类型加速器的节点上运行。在底层，加速器类型选项被实现为 "accelerator_type:<type>": 0.001 的自定义资源需求。这会强制任务或 actor 被放置在具有该特定加速器类型的节点上。这还让多节点类型自动缩放器知道有需求的资源类型，可能会触发提供该加速器的新节点的启动。

from ray.util.accelerators import NVIDIA_TESLA_V100

@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def train(data):
    return "This function was run on a node with a Tesla V100 GPU"

ray.get(train.remote(1))

参考 ray.util.accelerators 获取可用的加速器类型。当前自动检测到的加速器类型包括：

Nvidia GPU

AWS Neuron Cores

AWS Neuron Core 加速器（实验性）#

类似于 Nvidia GPU，Ray 默认情况下会自动检测 AWS Neuron Cores。用户可以在任务或 actor 的资源需求中指定 resources={"neuron_cores": some_number} 来分配 Neuron Core(s)。

Note

Ray 支持异构的 GPU 和 Neuron Core 集群，但不允许为任务或 actor 指定 num_gpus 和 neuron_cores 的资源需求。

import ray
import os
from ray.util.accelerators import AWS_NEURON_CORE

# On trn1.2xlarge instance, there will be 2 neuron cores.
ray.init(resources={"neuron_cores": 2})


@ray.remote(resources={"neuron_cores": 1})
class NeuronCoreActor:
    def info(self):
        ids = ray.get_runtime_context().get_resource_ids()
        print("neuron_core_ids: {}".format(ids["neuron_cores"]))
        print(f"NEURON_RT_VISIBLE_CORES: {os.environ['NEURON_RT_VISIBLE_CORES']}")


@ray.remote(resources={"neuron_cores": 1}, accelerator_type=AWS_NEURON_CORE)
def use_neuron_core_task():
    ids = ray.get_runtime_context().get_resource_ids()
    print("neuron_core_ids: {}".format(ids["neuron_cores"]))
    print(f"NEURON_RT_VISIBLE_CORES: {os.environ['NEURON_RT_VISIBLE_CORES']}")


neuron_core_actor = NeuronCoreActor.remote()
ray.get(neuron_core_actor.info.remote())
ray.get(use_neuron_core_task.remote())

Ray 2.7.2

GPU 支持

Contents

GPU 支持#

启动 GPU 的 Ray 节点#

在任务和 actor 中使用 GPU#

分数 GPU#

Worker 不释放 GPU 资源#

加速器类型#

AWS Neuron Core 加速器（实验性）#