Best practices in production
Contents
Best practices in production#
This section helps you:
Understand best practices when operating Serve in production
Learn more about managing Serve with the Serve CLI
Configure your HTTP requests when querying Serve
CLI best practices#
This section summarizes the best practices for deploying to production using the Serve CLI:
Use
serve runto manually test and improve your Serve application locally.Use
serve buildto create a Serve config file for your Serve application.For development, put your Serve application’s code in a remote repository and manually configure the
working_dirorpy_modulesfields in your Serve config file’sruntime_envto point to that repository.For production, put your Serve application’s code in a custom Docker image instead of a
runtime_env. See this tutorial to learn how to create custom Docker images and deploy them on KubeRay.
Use
serve statusto track your Serve application’s health and deployment progress. See the monitoring guide for more info.Use
serve configto check the latest config that your Serve application received. This is its goal state. See the monitoring guide for more info.Make lightweight configuration updates (e.g.,
num_replicasoruser_configchanges) by modifying your Serve config file and redeploying it withserve deploy.
Best practices for HTTP requests#
Most examples in these docs use straightforward get or post requests using Python’s requests library, such as:
import requests
response = requests.get("http://localhost:8000/")
result = response.text
This pattern is useful for prototyping, but it isn’t sufficient for production. In production, HTTP requests should use:
Retries: Requests may occasionally fail due to transient issues (e.g., slow network, node failure, power outage, spike in traffic, etc.). Retry failed requests a handful of times to account for these issues.
Exponential backoff: To avoid bombarding the Serve application with retries during a transient error, apply an exponential backoff on failure. Each retry should wait exponentially longer than the previous one before running. For example, the first retry may wait 0.1s after a failure, and subsequent retries wait 0.4s (4 x 0.1), 1.6s, 6.4s, 25.6s, etc. after the failure.
Timeouts: Add a timeout to each retry to prevent requests from hanging. The timeout should be longer than the application’s latency to give your application enough time to process requests. Additionally, set an end-to-end timeout in the Serve application, so slow requests don’t bottleneck replicas.
import requests
from requests.adapters import HTTPAdapter, Retry
session = requests.Session()
retries = Retry(
total=5, # 5 retries total
backoff_factor=1, # Exponential backoff
status_forcelist=[ # Retry on server errors
500,
501,
502,
503,
504,
],
)
session.mount("http://", HTTPAdapter(max_retries=retries))
response = session.get("http://localhost:8000/", timeout=10) # Add timeout
result = response.text