ray.data.preprocessors.Categorizer
ray.data.preprocessors.Categorizer#
- class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Optional[Dict[str, pandas.core.dtypes.dtypes.CategoricalDtype]] = None)[source]#
Bases:
ray.data.preprocessor.PreprocessorConvert columns to
pd.CategoricalDtype.Use this preprocessor with frameworks that have built-in support for
pd.CategoricalDtypelike LightGBM.Warning
If you don’t specify
dtypes, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Categorizer >>> >>> df = pd.DataFrame( ... { ... "sex": ["male", "female", "male", "female"], ... "level": ["L4", "L5", "L3", "L4"], ... }) >>> ds = ray.data.from_pandas(df) >>> categorizer = Categorizer(columns=["sex", "level"]) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]
If you know the categories in advance, you can specify the categories with the
dtypesparameter.>>> categorizer = Categorizer( ... columns=["sex", "level"], ... dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)}, ... ) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]
- Parameters
columns – The columns to convert to
pd.CategoricalDtype.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtypeobjects. If you don’t include a column indtypes, the categories are inferred.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
deserialize(serialized)Load the original preprocessor serialized via
self.serialize().fit(ds)Fit this Preprocessor to the Dataset.
fit_transform(ds)Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
transform(ds)Transform the given dataset.
transform_batch(data)Transform a single batch of data.
Return Dataset stats for the most recent transform call, if any.