ray.data.Dataset.map_batches
ray.data.Dataset.map_batches#
- Dataset.map_batches(fn: Union[Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Iterator[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]]], _CallableClassProtocol], *, batch_size: Union[int, None, Literal['default']] = 'default', compute: Optional[ray.data._internal.compute.ComputeStrategy] = None, batch_format: Optional[str] = 'default', zero_copy_batch: bool = False, fn_args: Optional[Iterable[Any]] = None, fn_kwargs: Optional[Dict[str, Any]] = None, fn_constructor_args: Optional[Iterable[Any]] = None, fn_constructor_kwargs: Optional[Dict[str, Any]] = None, num_cpus: Optional[float] = None, num_gpus: Optional[float] = None, **ray_remote_args) Dataset[source]#
Apply the given function to batches of data.
This method is useful for preprocessing data and performing inference. To learn more, see Transforming batches.
You can use either Ray Tasks or Ray Actors to perform the transformation. By default, Ray Data uses Tasks. To use Actors, see Transforming batches with actors.
Tip
If
fndoesn’t mutate its input, setzero_copy_batch=Trueto improve performance and decrease memory utilization.Examples
Call
map_batches()to transform your data.from typing import Dict import numpy as np import ray def add_dog_years(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: batch["age_in_dog_years"] = 7 * batch["age"] return batch ds = ( ray.data.from_items([ {"name": "Luna", "age": 4}, {"name": "Rory", "age": 14}, {"name": "Scout", "age": 9}, ]) .map_batches(add_dog_years) ) ds.show()
{'name': 'Luna', 'age': 4, 'age_in_dog_years': 28} {'name': 'Rory', 'age': 14, 'age_in_dog_years': 98} {'name': 'Scout', 'age': 9, 'age_in_dog_years': 63}If your function returns large objects, yield outputs in chunks.
from typing import Dict import ray import numpy as np def map_fn_with_large_output(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: for i in range(3): yield {"large_output": np.ones((100, 1000))} ds = ( ray.data.from_items([1]) .map_batches(map_fn_with_large_output) )
You can also use
map_batches()to perform offline inference. To learn more, see End-to-end: Offline Batch Inference.- Parameters
fn – The function or generator to apply to a record batch, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy. Note
fnmust be pickle-able.batch_size – The desired number of rows in each batch, or
Noneto use entire blocks as batches (blocks may contain different numbers of rows). The actual size of the batch provided tofnmay be smaller thanbatch_sizeifbatch_sizedoesn’t evenly divide the block(s) sent to a given map task. Default batch_size is 4096 with “default”.compute – Either “tasks” (default) to use Ray Tasks or an
ActorPoolStrategyto use an autoscaling actor pool.batch_format – If
"default"or"numpy", batches areDict[str, numpy.ndarray]. If"pandas", batches arepandas.DataFrame.zero_copy_batch – Whether
fnshould be provided zero-copy, read-only batches. If this isTrueand no copy is required for thebatch_formatconversion, the batch is a zero-copy, read-only view on data in Ray’s object store, which can decrease memory utilization and improve performance. If this isFalse, the batch is writable, which requires an extra copy to guarantee. Iffnmutates its input, this needs to beFalsein order to avoid “assignment destination is read-only” or “buffer source array is read-only” errors. Default isFalse.fn_args – Positional arguments to pass to
fnafter the first argument. These arguments are top-level arguments to the underlying Ray task.fn_kwargs – Keyword arguments to pass to
fn. These arguments are top-level arguments to the underlying Ray task.fn_constructor_args – Positional arguments to pass to
fn’s constructor. You can only provide this iffnis a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.fn_constructor_kwargs – Keyword arguments to pass to
fn’s constructor. This can only be provided iffnis a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.num_cpus – The number of CPUs to reserve for each parallel map worker.
num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify
num_gpus=1to request 1 GPU for each parallel map worker.ray_remote_args – Additional resource requirements to request from ray for each map worker.
Note
The size of the batches provided to
fnmight be smaller than the specifiedbatch_sizeifbatch_sizedoesn’t evenly divide the block(s) sent to a given map task.See also
iter_batches()Call this function to iterate over batches of data.
flat_map()Call this method to create new records from existing ones. Unlike
map(), a function passed toflat_map()can return multiple records.map()Call this method to transform one record at time.