ray.data.preprocessors.Concatenator#

class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: Optional[List[str]] = None, exclude: Optional[Union[str, List[str]]] = None, dtype: Optional[numpy.dtype] = None, raise_if_missing: bool = False)[source]#

Bases: ray.data.preprocessor.Preprocessor

Combine numeric columns into a column of type TensorDtype.

This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator

Concatenator combines numeric columns into a column of TensorDtype.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator()
>>> concatenator.fit_transform(ds).to_pandas()  
   concat_out
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

By default, the created column is called "concat_out", but you can specify a different name.

>>> concatenator = Concatenator(output_column_name="tensor")
>>> concatenator.fit_transform(ds).to_pandas()  
       tensor
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the exclude parameter.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

Alternatively, you can specify which columns to concatenate with the include parameter.

>>> concatenator = Concatenator(include=["X0", "X1"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

Note that if a column is in both include and exclude, the column is excluded.

>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

By default, the concatenated tensor is a dtype common to the input columns. However, you can also explicitly set the dtype with the dtype parameter.

>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32)
>>> concatenator.fit_transform(ds)  
Dataset(num_blocks=1, num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})

Parameters

output_column_name – The desired name for the new column. Defaults to "concat_out".
include – A list of columns to concatenate. If None, all columns are concatenated.
exclude – A list of column to exclude from concatenation. If a column is in both include and exclude, the column is excluded from concatenation.
dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules.
raise_if_missing – If True, an error is raised if any of the columns in include or exclude don’t exist. Defaults to False.

Raises

ValueError – if raise_if_missing is True and a column in include or exclude doesn’t exist in the dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`(serialized)	Load the original preprocessor serialized via `self.serialize()`.
`fit`(ds)	Fit this Preprocessor to the Dataset.
`fit_transform`(ds)	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`()	Batch format hint for upstream producers to try yielding best block format.
`serialize`()	Return this preprocessor serialized as a string.
`transform`(ds)	Transform the given dataset.
`transform_batch`(data)	Transform a single batch of data.
`transform_stats`()	Return Dataset stats for the most recent transform call, if any.

Ray 2.7.2

ray.data.preprocessors.Concatenator

ray.data.preprocessors.Concatenator#