ray.data.preprocessors.Concatenator
ray.data.preprocessors.Concatenator#
- class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: Optional[List[str]] = None, exclude: Optional[Union[str, List[str]]] = None, dtype: Optional[numpy.dtype] = None, raise_if_missing: bool = False)[source]#
Bases:
ray.data.preprocessor.PreprocessorCombine numeric columns into a column of type
TensorDtype.This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains
TensorArrayElementobjects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation.Examples
>>> import numpy as np >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Concatenator
Concatenatorcombines numeric columns into a column ofTensorDtype.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator() >>> concatenator.fit_transform(ds).to_pandas() concat_out 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
By default, the created column is called
"concat_out", but you can specify a different name.>>> concatenator = Concatenator(output_column_name="tensor") >>> concatenator.fit_transform(ds).to_pandas() tensor 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the
excludeparameter.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator(exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
Alternatively, you can specify which columns to concatenate with the
includeparameter.>>> concatenator = Concatenator(include=["X0", "X1"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
Note that if a column is in both
includeandexclude, the column is excluded.>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
By default, the concatenated tensor is a
dtypecommon to the input columns. However, you can also explicitly set thedtypewith thedtypeparameter.>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32) >>> concatenator.fit_transform(ds) Dataset(num_blocks=1, num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
- Parameters
output_column_name – The desired name for the new column. Defaults to
"concat_out".include – A list of columns to concatenate. If
None, all columns are concatenated.exclude – A list of column to exclude from concatenation. If a column is in both
includeandexclude, the column is excluded from concatenation.dtype – The
dtypeto convert the output tensors to. If unspecified, thedtypeis determined by standard coercion rules.raise_if_missing – If
True, an error is raised if any of the columns inincludeorexcludedon’t exist. Defaults toFalse.
- Raises
ValueError – if
raise_if_missingisTrueand a column inincludeorexcludedoesn’t exist in the dataset.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
deserialize(serialized)Load the original preprocessor serialized via
self.serialize().fit(ds)Fit this Preprocessor to the Dataset.
fit_transform(ds)Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
transform(ds)Transform the given dataset.
transform_batch(data)Transform a single batch of data.
Return Dataset stats for the most recent transform call, if any.