Dataset API#

Constructor#

Dataset(plan, epoch[, lazy, logical_plan])

A Dataset is a distributed data collection for data loading and processing.

Basic Transformations#

Dataset.map(fn, *[, compute, ...])

Apply the given function to each row of this dataset.

Dataset.map_batches(fn, *[, batch_size, ...])

Apply the given function to batches of data.

Dataset.flat_map(fn, *[, compute, ...])

Apply the given function to each row and then flatten results.

Dataset.filter(fn, *[, compute])

Filter out rows that don't satisfy the given predicate.

Dataset.add_column(col, fn, *[, compute])

Add the given column to the dataset.

Dataset.drop_columns(cols, *[, compute])

Drop one or more columns from the dataset.

Dataset.select_columns(cols, *[, compute])

Select one or more columns from the dataset.

Dataset.random_sample(fraction, *[, seed])

Returns a new Dataset containing a random fraction of the rows.

Dataset.limit(limit)

Truncate the dataset to the first limit rows.

Sorting, Shuffling, Repartitioning#

Dataset.sort([key, descending])

Sort the dataset by the specified key column or key function.

Dataset.random_shuffle(*[, seed, num_blocks])

Randomly shuffle the rows of this Dataset.

Dataset.randomize_block_order(*[, seed])

Randomly shuffle the blocks of this Dataset.

Dataset.repartition(num_blocks, *[, shuffle])

Repartition the Dataset into exactly this number of blocks.

Splitting and Merging Datasets#

Dataset.split(n, *[, equal, locality_hints])

Materialize and split the dataset into n disjoint pieces.

Dataset.split_at_indices(indices)

Materialize and split the dataset at the given indices (like np.split).

Dataset.split_proportionately(proportions)

Materialize and split the dataset using proportions.

Dataset.streaming_split(n, *[, equal, ...])

Returns n DataIterators that can be used to read disjoint subsets of the dataset in parallel.

Dataset.train_test_split(test_size, *[, ...])

Materialize and split the dataset into train and test subsets.

Dataset.union(*other)

Materialize and concatenate Datasets across rows.

Dataset.zip(other)

Materialize and zip the columns of this dataset with the columns of another.

Grouped and Global Aggregations#

Dataset.groupby(key)

Group rows of a Dataset according to a column.

Dataset.unique(column)

List the unique elements in a given column.

Dataset.aggregate(*aggs)

Aggregate values using one or more functions.

Dataset.sum([on, ignore_nulls])

Compute the sum of one or more columns.

Dataset.min([on, ignore_nulls])

Return the minimum of one or more columns.

Dataset.max([on, ignore_nulls])

Return the maximum of one or more columns.

Dataset.mean([on, ignore_nulls])

Compute the mean of one or more columns.

Dataset.std([on, ddof, ignore_nulls])

Compute the standard deviation of one or more columns.

Consuming Data#

Dataset.show([limit])

Print up to the given number of rows from the Dataset.

Dataset.take([limit])

Return up to limit rows from the Dataset.

Dataset.take_batch([batch_size, batch_format])

Return up to batch_size rows from the Dataset in a batch.

Dataset.take_all([limit])

Return all of the rows in this Dataset.

Dataset.iterator()

Return a DataIterator over this dataset.

Dataset.iter_rows(*[, prefetch_blocks])

Return an iterator over the rows in this dataset.

Dataset.iter_batches(*[, prefetch_batches, ...])

Return an iterator over batches of data.

Dataset.iter_torch_batches(*[, ...])

Return an iterator over batches of data represented as Torch tensors.

Dataset.iter_tf_batches(*[, ...])

Return an iterator over batches of data represented as TensorFlow tensors.

I/O and Conversion#

Dataset.write_parquet(path, *[, filesystem, ...])

Writes the Dataset to parquet files under the provided path.

Dataset.write_json(path, *[, filesystem, ...])

Writes the Dataset to JSON and JSONL files.

Dataset.write_csv(path, *[, filesystem, ...])

Writes the Dataset to CSV files.

Dataset.write_numpy(path, *, column[, ...])

Writes a column of the Dataset to .npy files.

Dataset.write_tfrecords(path, *[, ...])

Write the Dataset to TFRecord files.

Dataset.write_webdataset(path, *[, ...])

Writes the dataset to WebDataset files.

Dataset.write_mongo(uri, database, collection)

Writes the Dataset to a MongoDB database.

Dataset.write_datasource(datasource, *[, ...])

Writes the dataset to a custom Datasource.

Dataset.to_torch(*[, label_column, ...])

Return a Torch IterableDataset over this Dataset.

Dataset.to_tf(feature_columns, label_columns, *)

Return a TensorFlow Dataset over this Dataset.

Dataset.to_dask([meta, verify_meta])

Convert this Dataset into a Dask DataFrame.

Dataset.to_mars()

Convert this Dataset into a Mars DataFrame.

Dataset.to_modin()

Convert this Dataset into a Modin DataFrame.

Dataset.to_spark(spark)

Convert this Dataset into a Spark DataFrame.

Dataset.to_pandas([limit])

Convert this Dataset to a single pandas DataFrame.

Dataset.to_pandas_refs()

Converts this Dataset into a distributed set of Pandas dataframes.

Dataset.to_numpy_refs(*[, column])

Converts this Dataset into a distributed set of NumPy ndarrays or dictionary of NumPy ndarrays.

Dataset.to_arrow_refs()

Convert this Dataset into a distributed set of PyArrow tables.

Dataset.to_random_access_dataset(key[, ...])

Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL).

Inspecting Metadata#

Dataset.count()

Count the number of records in the dataset.

Dataset.columns([fetch_if_missing])

Returns the columns of this Dataset.

Dataset.schema([fetch_if_missing])

Return the schema of the dataset.

Dataset.num_blocks()

Return the number of blocks of this dataset.

Dataset.size_bytes()

Return the in-memory size of the dataset.

Dataset.input_files()

Return the list of input files for the dataset.

Dataset.stats()

Returns a string containing execution timing information.

Dataset.get_internal_block_refs()

Get a list of references to the underlying blocks of this dataset.

Execution#

Dataset.materialize()

Execute and materialize this dataset into object store memory.

ActorPoolStrategy([legacy_min_size, ...])

Specify the compute strategy for a Dataset transform.

Serialization#

Dataset.has_serializable_lineage()

Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster.

Dataset.serialize_lineage()

Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster.

Dataset.deserialize_lineage(serialized_ds)

Deserialize the provided lineage-serialized Dataset.

Internals#

block.Block

The central part of internal API.

block.BlockExecStats()

Execution stats for this block.

block.BlockMetadata(num_rows, size_bytes, ...)

Metadata about the block.

block.BlockAccessor()

Provides accessor methods for a specific block.