Actor 容错#

如果 Actor 进程失败，或者 Actor 的 owner 进程失败，Actor 可能会失败。 Actor 的所有者是通过调用 ActorClass.remote() 创建 Actor 的 worker。 Detached actors 没有 owner 进程，当 Ray 集群被销毁时，它们会被清理。

Actor 进程失败#

Ray 可能在 actor 进程失败后自动重启 actor。这个行为由 max_restarts 控制，它设置了 actor 可以重启的最大次数。 max_restarts 的默认值是 0，意味着 actor 不会被重启。如果设置为 -1，actor 将无限次重启。当 actor 重启时，它的状态将通过重新运行其构造函数来重新创建。在指定的重启次数之后，后续的 actor 方法将引发 RayActorError。

默认的，actor 任务以最多一次的语义执行（@ray.remote 中 :func:decorator <ray.remote> 的 max_task_retries=0）。这意味着如果一个 actor 任务被提交到一个不可达的 actor，Ray 将会报告错误，抛出 RayActorError，这是一个 Python 级别的异常，当调用 ray.get 时，会抛出这个异常。请注意，即使任务确实成功执行，也可能抛出这个异常。例如，如果 actor 在执行任务后立即死亡，就会发生这种情况。

Ray 还为 actor 任务提供了至少一次的执行语义（ max_task_retries=-1 或 max_task_retries > 0）。这意味着如果一个 actor 任务被提交到一个不可达的 actor，系统将自动重试任务。使用此选项，系统只会在发生以下情况之一时向应用程序抛出 RayActorError：(1) actor 的 max_restarts 限制已经超过，actor 不能再重启，或者 (2) 此特定任务的 max_task_retries 限制已经超过。请注意，如果 actor 在提交任务时正在重启，这将计为一次重试。重试限制可以通过 max_task_retries = -1 设置为无限。

你可以通过运行以下代码来尝试这个行为。

import os
import ray

ray.init()

# This actor kills itself after executing 10 tasks.
@ray.remote(max_restarts=4, max_task_retries=-1)
class Actor:
    def __init__(self):
        self.counter = 0

    def increment_and_possibly_fail(self):
        # Exit after every 10 tasks.
        if self.counter == 10:
            os._exit(0)
        self.counter += 1
        return self.counter

actor = Actor.remote()

# The actor will be reconstructed up to 4 times, so we can execute up to 50
# tasks successfully. The actor is reconstructed by rerunning its constructor.
# Methods that were executing when the actor died will be retried and will not
# raise a `RayActorError`. Retried methods may execute twice, once on the
# failed actor and a second time on the restarted actor.
for _ in range(50):
    counter = ray.get(actor.increment_and_possibly_fail.remote())
    print(counter)  # Prints the sequence 1-10 5 times.

# After the actor has been restarted 4 times, all subsequent methods will
# raise a `RayActorError`.
for _ in range(10):
    try:
        counter = ray.get(actor.increment_and_possibly_fail.remote())
        print(counter)  # Unreachable.
    except ray.exceptions.RayActorError:
        print("FAILURE")  # Prints 10 times.

对于至少一次执行的 actor，系统仍然会根据初始提交顺序保证执行顺序。例如，任何在失败的 actor 任务之后提交的任务都不会在 actor 上执行，直到失败的 actor 任务成功重试。系统不会尝试重新执行任何在失败之前成功执行的任务（除非 max_task_retries 不为零且任务对于对象重建是必要的）。

Note

对于异步或线程 actor，任务可能会无序执行。 Actor 重启后，系统只会重试 未完成 的任务。之前已完成的任务不会被重新执行。之前已完成的任务不会被重新执行。

至少一次执行最适合只读 actor 或具有不需要在失败后重建的临时状态的 actor。对于具有关键状态的 actor，应用程序负责恢复状态，例如，通过定期检查点并在 actor 重启时从检查点恢复。

Actor 检查点#

max_restarts 会自动重启崩溃的 actor，但不会自动恢复 actor 的应用程序级状态。相反，您应该手动检查点 actor 的状态，并在 actor 重启时恢复。

对于手动重启的 actor，actor 的创建者应该管理检查点，并在失败时手动重启和恢复 actor。如果你希望创建者决定何时重启 actor，或者如果创建者正在协调 actor 检查点与其他执行，这是推荐的做法：

import os
import sys
import ray
import json
import tempfile
import shutil


@ray.remote(num_cpus=1)
class Worker:
    def __init__(self):
        self.state = {"num_tasks_executed": 0}

    def execute_task(self, crash=False):
        if crash:
            sys.exit(1)

        # Execute the task
        # ...

        # Update the internal state
        self.state["num_tasks_executed"] = self.state["num_tasks_executed"] + 1

    def checkpoint(self):
        return self.state

    def restore(self, state):
        self.state = state


class Controller:
    def __init__(self):
        self.worker = Worker.remote()
        self.worker_state = ray.get(self.worker.checkpoint.remote())

    def execute_task_with_fault_tolerance(self):
        i = 0
        while True:
            i = i + 1
            try:
                ray.get(self.worker.execute_task.remote(crash=(i % 2 == 1)))
                # Checkpoint the latest worker state
                self.worker_state = ray.get(self.worker.checkpoint.remote())
                return
            except ray.exceptions.RayActorError:
                print("Actor crashes, restarting...")
                # Restart the actor and restore the state
                self.worker = Worker.remote()
                ray.get(self.worker.restore.remote(self.worker_state))


controller = Controller()
controller.execute_task_with_fault_tolerance()
controller.execute_task_with_fault_tolerance()
assert ray.get(controller.worker.checkpoint.remote())["num_tasks_executed"] == 2

或者，如果你使用 Ray 的自动 actor 重启，actor 可以手动检查点自己，并在构造函数中从检查点恢复：

@ray.remote(max_restarts=-1, max_task_retries=-1)
class ImmortalActor:
    def __init__(self, checkpoint_file):
        self.checkpoint_file = checkpoint_file

        if os.path.exists(self.checkpoint_file):
            # Restore from a checkpoint
            with open(self.checkpoint_file, "r") as f:
                self.state = json.load(f)
        else:
            self.state = {}

    def update(self, key, value):
        import random

        if random.randrange(10) < 5:
            sys.exit(1)

        self.state[key] = value

        # Checkpoint the latest state
        with open(self.checkpoint_file, "w") as f:
            json.dump(self.state, f)

    def get(self, key):
        return self.state[key]


checkpoint_dir = tempfile.mkdtemp()
actor = ImmortalActor.remote(os.path.join(checkpoint_dir, "checkpoint.json"))
ray.get(actor.update.remote("1", 1))
ray.get(actor.update.remote("2", 2))
assert ray.get(actor.get.remote("1")) == 1
shutil.rmtree(checkpoint_dir)

Note

如果检查点保存在外部存储中，请确保整个集群都可以访问它，因为 actor 可能会在不同的节点上重启。例如，将检查点保存到云存储（例如 S3）或共享目录（例如通过 NFS）。

Actor 创建者失败#

对于 non-detached actors，actor 的所有者是创建它的 worker，即调用 ActorClass.remote() 的 worker。类似于 objects，如果 actor 的所有者死亡，actor 也会与所有者共享命运。 Ray 不会自动恢复其所有者已死亡的 actor，即使它有非零的 max_restarts。

由于 detached actors 没有所有者，即使它们的原始创建者死亡，Ray 仍会重启它们。直到达到最大重启次数、actor 被销毁，或者 Ray 集群被销毁，detached actors 仍会被自动重启。

你可以在以下代码中尝试这个行为。

import ray
import os
import signal
ray.init()

@ray.remote(max_restarts=-1)
class Actor:
    def ping(self):
        return "hello"

@ray.remote
class Parent:
    def generate_actors(self):
        self.child = Actor.remote()
        self.detached_actor = Actor.options(name="actor", lifetime="detached").remote()
        return self.child, self.detached_actor, os.getpid()

parent = Parent.remote()
actor, detached_actor, pid = ray.get(parent.generate_actors.remote())

os.kill(pid, signal.SIGKILL)

try:
    print("actor.ping:", ray.get(actor.ping.remote()))
except ray.exceptions.RayActorError as e:
    print("Failed to submit actor call", e)
# Failed to submit actor call The actor died unexpectedly before finishing this task.
# 	class_name: Actor
# 	actor_id: 56f541b178ff78470f79c3b601000000
# 	namespace: ea8b3596-7426-4aa8-98cc-9f77161c4d5f
# The actor is dead because because all references to the actor were removed.

try:
    print("detached_actor.ping:", ray.get(detached_actor.ping.remote()))
except ray.exceptions.RayActorError as e:
    print("Failed to submit detached actor call", e)
# detached_actor.ping: hello

强制杀死行为不端的 actor

有时，应用程序级代码可能会导致 actor 挂起或泄漏资源。这种情况下，Ray 允许您通过手动终止 actor 来从失败中恢复。你可以通过在 actor 的任何句柄上调用 ray.kill 来做到这一点。请注意，它不需要是 actor 的原始句柄。请注意，它不需要是 actor 的原始句柄。

如果 max_restarts 被设置，你也可以通过将 no_restart=False 传递给 ray.kill 来允许 Ray 自动重启 actor。

Ray 2.7.2

Actor 容错

Contents

Actor 容错#

Actor 进程失败#

Actor 检查点#

Actor 创建者失败#