.. _train-monitoring-and-logging: Monitoring and Logging Metrics ============================== Ray Train provides an API for reporting intermediate results and checkpoints from the training function (run on distributed workers) up to the ``Trainer`` (where your python script is executed) by calling ``train.report(metrics)``. The results will be collected from the distributed workers and passed to the driver to be logged and displayed. .. warning:: Only the results from rank 0 worker will be used. However, in order to ensure consistency, ``train.report()`` has to be called on each worker. If you want to aggregate results from multiple workers, see :ref:`train-aggregating-results`. The primary use-case for reporting is for metrics (accuracy, loss, etc.) at the end of each training epoch. .. tab-set:: .. tab-item:: PyTorch .. code-block:: python from ray import train def train_func(): ... for i in range(num_epochs): result = model.train(...) train.report({"result": result}) .. tab-item:: PyTorch Lightning In PyTorch Lightning, we use a callback to call ``train.report()``. .. code-block:: python from ray import train import pytorch_lightning as pl from pytorch_lightning.callbacks import Callback class MyRayTrainReportCallback(Callback): def on_train_epoch_end(self, trainer, pl_module): metrics = trainer.callback_metrics metrics = {k: v.item() for k, v in metrics.items()} train.report(metrics=metrics) def train_func_per_worker(): ... trainer = pl.Trainer( # ... callbacks=[MyRayTrainReportCallback()] ) trainer.fit() .. _train-aggregating-results: How to obtain and aggregate results from different workers? ----------------------------------------------------------- In real applications, you may want to calculate optimization metrics besides accuracy and loss: recall, precision, Fbeta, etc. You may also want to collect metrics from multiple workers. While Ray Train currently only reports metrics from the rank 0 worker, you can use third-party libraries or distributed primitives of your machine learning framework to report metrics from multiple workers. .. tab-set:: .. tab-item:: Native PyTorch Ray Train natively supports `TorchMetrics `_, which provides a collection of machine learning metrics for distributed, scalable PyTorch models. Here is an example of reporting both the aggregated R2 score and mean train and validation loss from all workers. .. literalinclude:: ../doc_code/torchmetrics_example.py :language: python :start-after: __start__