Contents

Dataset API

Contents

Dataset API#

Constructor#

Dataset(plan, epoch[, lazy, logical_plan])

A Dataset is a distributed data collection for data loading and processing.

Basic Transformations#

`Dataset.map`(fn, *[, compute, ...])	Apply the given function to each row of this dataset.
`Dataset.map_batches`(fn, *[, batch_size, ...])	Apply the given function to batches of data.
`Dataset.flat_map`(fn, *[, compute, ...])	Apply the given function to each row and then flatten results.
`Dataset.filter`(fn, *[, compute])	Filter out rows that don't satisfy the given predicate.
`Dataset.add_column`(col, fn, *[, compute])	Add the given column to the dataset.
`Dataset.drop_columns`(cols, *[, compute])	Drop one or more columns from the dataset.
`Dataset.select_columns`(cols, *[, compute])	Select one or more columns from the dataset.
`Dataset.random_sample`(fraction, *[, seed])	Returns a new `Dataset` containing a random fraction of the rows.
`Dataset.limit`(limit)	Truncate the dataset to the first `limit` rows.

Sorting, Shuffling, Repartitioning#

`Dataset.sort`([key, descending])	Sort the dataset by the specified key column or key function.
`Dataset.random_shuffle`(*[, seed, num_blocks])	Randomly shuffle the rows of this `Dataset`.
`Dataset.randomize_block_order`(*[, seed])	Randomly shuffle the blocks of this `Dataset`.
`Dataset.repartition`(num_blocks, *[, shuffle])	Repartition the `Dataset` into exactly this number of blocks.

Splitting and Merging Datasets#

`Dataset.split`(n, *[, equal, locality_hints])	Materialize and split the dataset into `n` disjoint pieces.
`Dataset.split_at_indices`(indices)	Materialize and split the dataset at the given indices (like `np.split`).
`Dataset.split_proportionately`(proportions)	Materialize and split the dataset using proportions.
`Dataset.streaming_split`(n, *[, equal, ...])	Returns `n` `DataIterators` that can be used to read disjoint subsets of the dataset in parallel.
`Dataset.train_test_split`(test_size, *[, ...])	Materialize and split the dataset into train and test subsets.
`Dataset.union`(*other)	Materialize and concatenate `Datasets` across rows.
`Dataset.zip`(other)	Materialize and zip the columns of this dataset with the columns of another.

Grouped and Global Aggregations#

`Dataset.groupby`(key)	Group rows of a `Dataset` according to a column.
`Dataset.unique`(column)	List the unique elements in a given column.
`Dataset.aggregate`(*aggs)	Aggregate values using one or more functions.
`Dataset.sum`([on, ignore_nulls])	Compute the sum of one or more columns.
`Dataset.min`([on, ignore_nulls])	Return the minimum of one or more columns.
`Dataset.max`([on, ignore_nulls])	Return the maximum of one or more columns.
`Dataset.mean`([on, ignore_nulls])	Compute the mean of one or more columns.
`Dataset.std`([on, ddof, ignore_nulls])	Compute the standard deviation of one or more columns.

Consuming Data#

`Dataset.show`([limit])	Print up to the given number of rows from the `Dataset`.
`Dataset.take`([limit])	Return up to `limit` rows from the `Dataset`.
`Dataset.take_batch`([batch_size, batch_format])	Return up to `batch_size` rows from the `Dataset` in a batch.
`Dataset.take_all`([limit])	Return all of the rows in this `Dataset`.
`Dataset.iterator`()	Return a `DataIterator` over this dataset.
`Dataset.iter_rows`(*[, prefetch_blocks])	Return an iterator over the rows in this dataset.
`Dataset.iter_batches`(*[, prefetch_batches, ...])	Return an iterator over batches of data.
`Dataset.iter_torch_batches`(*[, ...])	Return an iterator over batches of data represented as Torch tensors.
`Dataset.iter_tf_batches`(*[, ...])	Return an iterator over batches of data represented as TensorFlow tensors.

I/O and Conversion#

`Dataset.write_parquet`(path, *[, filesystem, ...])	Writes the `Dataset` to parquet files under the provided `path`.
`Dataset.write_json`(path, *[, filesystem, ...])	Writes the `Dataset` to JSON and JSONL files.
`Dataset.write_csv`(path, *[, filesystem, ...])	Writes the `Dataset` to CSV files.
`Dataset.write_numpy`(path, *, column[, ...])	Writes a column of the `Dataset` to .npy files.
`Dataset.write_tfrecords`(path, *[, ...])	Write the `Dataset` to TFRecord files.
`Dataset.write_webdataset`(path, *[, ...])	Writes the dataset to WebDataset files.
`Dataset.write_mongo`(uri, database, collection)	Writes the `Dataset` to a MongoDB database.
`Dataset.write_datasource`(datasource, *[, ...])	Writes the dataset to a custom `Datasource`.
`Dataset.to_torch`(*[, label_column, ...])	Return a Torch IterableDataset over this `Dataset`.
`Dataset.to_tf`(feature_columns, label_columns, *)	Return a TensorFlow Dataset over this `Dataset`.
`Dataset.to_dask`([meta, verify_meta])	Convert this `Dataset` into a Dask DataFrame.
`Dataset.to_mars`()	Convert this `Dataset` into a Mars DataFrame.
`Dataset.to_modin`()	Convert this `Dataset` into a Modin DataFrame.
`Dataset.to_spark`(spark)	Convert this `Dataset` into a Spark DataFrame.
`Dataset.to_pandas`([limit])	Convert this `Dataset` to a single pandas DataFrame.
`Dataset.to_pandas_refs`()	Converts this `Dataset` into a distributed set of Pandas dataframes.
`Dataset.to_numpy_refs`(*[, column])	Converts this `Dataset` into a distributed set of NumPy ndarrays or dictionary of NumPy ndarrays.
`Dataset.to_arrow_refs`()	Convert this `Dataset` into a distributed set of PyArrow tables.
`Dataset.to_random_access_dataset`(key[, ...])	Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL).

Inspecting Metadata#

`Dataset.count`()	Count the number of records in the dataset.
`Dataset.columns`([fetch_if_missing])	Returns the columns of this Dataset.
`Dataset.schema`([fetch_if_missing])	Return the schema of the dataset.
`Dataset.num_blocks`()	Return the number of blocks of this dataset.
`Dataset.size_bytes`()	Return the in-memory size of the dataset.
`Dataset.input_files`()	Return the list of input files for the dataset.
`Dataset.stats`()	Returns a string containing execution timing information.
`Dataset.get_internal_block_refs`()	Get a list of references to the underlying blocks of this dataset.

Execution#

`Dataset.materialize`()	Execute and materialize this dataset into object store memory.
`ActorPoolStrategy`([legacy_min_size, ...])	Specify the compute strategy for a Dataset transform.

Serialization#

`Dataset.has_serializable_lineage`()	Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster.
`Dataset.serialize_lineage`()	Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster.
`Dataset.deserialize_lineage`(serialized_ds)	Deserialize the provided lineage-serialized Dataset.

Internals#

`block.Block`	The central part of internal API.
`block.BlockExecStats`()	Execution stats for this block.
`block.BlockMetadata`(num_rows, size_bytes, ...)	Metadata about the block.
`block.BlockAccessor`()	Provides accessor methods for a specific block.