.. _air-preprocessors: 使用预处理器 =================== Data preprocessing is a common technique for transforming raw data into features for a machine learning model. In general, you may want to apply the same preprocessing logic to your offline training data and online inference data. This page covers *preprocessors*, which are a higher level API on top of existing Ray Data operations like ``map_batches``, targeted towards tabular and structured data use cases. If you are working with tabular data, you should use Ray Data preprocessors. However, the recommended way to perform preprocessing for unstructured data is to :ref:`use existing Ray Data operations ` instead of preprocessors. .. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit .. image:: images/preprocessors.svg Overview -------- The :class:`Preprocessor ` class has four public methods: #. :meth:`fit() `: Compute state information about a :class:`Dataset ` (for example, the mean or standard deviation of a column) and save it to the :class:`Preprocessor `. This information is used to perform :meth:`transform() `, and the method is typically called on a training dataset. #. :meth:`transform() `: Apply a transformation to a :class:`Dataset `. If the :class:`Preprocessor ` is stateful, then :meth:`fit() ` must be called first. This method is typically called on training, validation, and test datasets. #. :meth:`transform_batch() `: Apply a transformation to a single :class:`batch ` of data. This method is typically called on online or offline inference data. #. :meth:`fit_transform() `: Syntactic sugar for calling both :meth:`fit() ` and :meth:`transform() ` on a :class:`Dataset `. To show these methods in action, walk through a basic example. First, set up two simple Ray ``Dataset``\s. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __preprocessor_setup_start__ :end-before: __preprocessor_setup_end__ Next, ``fit`` the ``Preprocessor`` on one ``Dataset``, and then ``transform`` both ``Dataset``\s with this fitted information. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __preprocessor_fit_transform_start__ :end-before: __preprocessor_fit_transform_end__ Finally, call ``transform_batch`` on a single batch of data. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __preprocessor_transform_batch_start__ :end-before: __preprocessor_transform_batch_end__ The most common way of using a preprocessor is by using it on a :ref:`Ray Data dataset `, which is then passed to a Ray Train :ref:`Trainer `. See also: * Ray Train's data preprocessing and ingest section for :ref:`PyTorch ` * Ray Train's data preprocessing and ingest section for :ref:`LightGBM/XGBoost ` Types of preprocessors ---------------------- Built-in preprocessors ~~~~~~~~~~~~~~~~~~~~~~ Ray Data provides a handful of preprocessors out of the box. **Generic preprocessors** .. autosummary:: :nosignatures: ray.data.preprocessors.Concatenator ray.data.preprocessor.Preprocessor ray.data.preprocessors.SimpleImputer **Categorical encoders** .. autosummary:: :nosignatures: ray.data.preprocessors.Categorizer ray.data.preprocessors.LabelEncoder ray.data.preprocessors.MultiHotEncoder ray.data.preprocessors.OneHotEncoder ray.data.preprocessors.OrdinalEncoder **Feature scalers** .. autosummary:: :nosignatures: ray.data.preprocessors.MaxAbsScaler ray.data.preprocessors.MinMaxScaler ray.data.preprocessors.Normalizer ray.data.preprocessors.PowerTransformer ray.data.preprocessors.RobustScaler ray.data.preprocessors.StandardScaler **Utilities** .. autosummary:: :nosignatures: ray.data.Dataset.train_test_split Which preprocessor should you use? ---------------------------------- The type of preprocessor you use depends on what your data looks like. This section provides tips on handling common data formats. Categorical data ~~~~~~~~~~~~~~~~ Most models expect numerical inputs. To represent your categorical data in a way your model can understand, encode categories using one of the preprocessors described below. .. list-table:: :header-rows: 1 * - Categorical Data Type - Example - Preprocessor * - Labels - ``"cat"``, ``"dog"``, ``"airplane"`` - :class:`~ray.data.preprocessors.LabelEncoder` * - Ordered categories - ``"bs"``, ``"md"``, ``"phd"`` - :class:`~ray.data.preprocessors.OrdinalEncoder` * - Unordered categories - ``"red"``, ``"green"``, ``"blue"`` - :class:`~ray.data.preprocessors.OneHotEncoder` * - Lists of categories - ``("sci-fi", "action")``, ``("action", "comedy", "animated")`` - :class:`~ray.data.preprocessors.MultiHotEncoder` .. note:: If you're using LightGBM, you don't need to encode your categorical data. Instead, use :class:`~ray.data.preprocessors.Categorizer` to convert your data to `pandas.CategoricalDtype`. Numerical data ~~~~~~~~~~~~~~ To ensure your models behaves properly, normalize your numerical data. Reference the table below to determine which preprocessor to use. .. list-table:: :header-rows: 1 * - Data Property - Preprocessor * - Your data is approximately normal - :class:`~ray.data.preprocessors.StandardScaler` * - Your data is sparse - :class:`~ray.data.preprocessors.MaxAbsScaler` * - Your data contains many outliers - :class:`~ray.data.preprocessors.RobustScaler` * - Your data isn't normal, but you need it to be - :class:`~ray.data.preprocessors.PowerTransformer` * - You need unit-norm rows - :class:`~ray.data.preprocessors.Normalizer` * - You aren't sure what your data looks like - :class:`~ray.data.preprocessors.MinMaxScaler` .. warning:: These preprocessors operate on numeric columns. If your dataset contains columns of type :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`, you may need to :ref:`implement a custom preprocessor `. Additionally, if your model expects a tensor or ``ndarray``, create a tensor using :class:`~ray.data.preprocessors.Concatenator`. .. tip:: Built-in feature scalers like :class:`~ray.data.preprocessors.StandardScaler` don't work on :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype` columns, so apply :class:`~ray.data.preprocessors.Concatenator` after feature scaling. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __concatenate_start__ :end-before: __concatenate_end__ Filling in missing values ~~~~~~~~~~~~~~~~~~~~~~~~~ If your dataset contains missing values, replace them with :class:`~ray.data.preprocessors.SimpleImputer`. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __simple_imputer_start__ :end-before: __simple_imputer_end__ Chaining preprocessors ~~~~~~~~~~~~~~~~~~~~~~ If you need to apply more than one preprocessor, simply apply them in sequence on your dataset. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __chain_start__ :end-before: __chain_end__ .. _air-custom-preprocessors: Implementing custom preprocessors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you want to implement a custom preprocessor that needs to be fit, extend the :class:`~ray.data.preprocessor.Preprocessor` base class. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __custom_stateful_start__ :end-before: __custom_stateful_end__ If your preprocessor doesn't need to be fit, use :meth:`map_batches() ` to directly transform your dataset. For more details, see :ref:`Transforming Data `.