Theme by the Executable Book Project

ray.data.from_spark

ray.data.from_spark#

ray.data.from_spark(df: pyspark.sql.DataFrame, *, parallelism: Optional[int] = None) → ray.data.dataset.MaterializedDataset[source]#

Create a Dataset from a Spark DataFrame.

Parameters

df – A Spark DataFrame, which must be created by RayDP (Spark-on-Ray).
parallelism – The amount of parallelism to use for the dataset. If not provided, the parallelism is equal to the number of partitions of the original Spark DataFrame.

Returns

A MaterializedDataset holding rows read from the DataFrame.

PublicAPI: This API is stable across Ray releases.