ray.data.preprocessors.Categorizer#

class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Optional[Dict[str, pandas.core.dtypes.dtypes.CategoricalDtype]] = None)[source]#

Bases: ray.data.preprocessor.Preprocessor

Convert columns to pd.CategoricalDtype.

Use this preprocessor with frameworks that have built-in support for pd.CategoricalDtype like LightGBM.

Warning

If you don’t specify dtypes, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Categorizer
>>>
>>> df = pd.DataFrame(
... {
...     "sex": ["male", "female", "male", "female"],
...     "level": ["L4", "L5", "L3", "L4"],
... })
>>> ds = ray.data.from_pandas(df)  
>>> categorizer = Categorizer(columns=["sex", "level"])
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]

If you know the categories in advance, you can specify the categories with the dtypes parameter.

>>> categorizer = Categorizer(
...     columns=["sex", "level"],
...     dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)},
... )
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]

Parameters

columns – The columns to convert to pd.CategoricalDtype.
dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects. If you don’t include a column in dtypes, the categories are inferred.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`(serialized)	Load the original preprocessor serialized via `self.serialize()`.
`fit`(ds)	Fit this Preprocessor to the Dataset.
`fit_transform`(ds)	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`()	Batch format hint for upstream producers to try yielding best block format.
`serialize`()	Return this preprocessor serialized as a string.
`transform`(ds)	Transform the given dataset.
`transform_batch`(data)	Transform a single batch of data.
`transform_stats`()	Return Dataset stats for the most recent transform call, if any.

Ray 2.7.2

ray.data.preprocessors.Categorizer

ray.data.preprocessors.Categorizer#