Input/Output#

Synthetic Data#

range(n, *[, parallelism])

Creates a Dataset from a range of integers [0..n).

range_tensor(n, *[, shape, parallelism])

Creates a Dataset tensors of the provided shape from range [0...n].

Python Objects#

from_items(items, *[, parallelism])

Create a Dataset from a list of local Python objects.

Parquet#

read_parquet(paths, *[, filesystem, ...])

Creates a Dataset from parquet files.

read_parquet_bulk(paths, *[, filesystem, ...])

Create Dataset from parquet files without reading metadata.

Dataset.write_parquet(path, *[, filesystem, ...])

Writes the Dataset to parquet files under the provided path.

CSV#

read_csv(paths, *[, filesystem, ...])

Creates a Dataset from CSV files.

Dataset.write_csv(path, *[, filesystem, ...])

Writes the Dataset to CSV files.

JSON#

read_json(paths, *[, filesystem, ...])

Creates a Dataset from JSON and JSONL files.

Dataset.write_json(path, *[, filesystem, ...])

Writes the Dataset to JSON and JSONL files.

Text#

read_text(paths, *[, encoding, ...])

Create a Dataset from lines stored in text files.

Images#

read_images(paths, *[, filesystem, ...])

Creates a Dataset from image files.

Dataset.write_images(path, column[, ...])

Writes the Dataset to images.

Binary#

read_binary_files(paths, *[, include_paths, ...])

Create a Dataset from binary files of arbitrary contents.

TFRecords#

read_tfrecords(paths, *[, filesystem, ...])

Create a Dataset from TFRecord files that contain tf.train.Example messages.

Dataset.write_tfrecords(path, *[, ...])

Write the Dataset to TFRecord files.

Pandas#

from_pandas(dfs)

Create a Dataset from a list of pandas dataframes.

from_pandas_refs(dfs)

Create a Dataset from a list of Ray object references to pandas dataframes.

Dataset.to_pandas([limit])

Convert this Dataset to a single pandas DataFrame.

Dataset.to_pandas_refs()

Converts this Dataset into a distributed set of Pandas dataframes.

NumPy#

read_numpy(paths, *[, filesystem, ...])

Create an Arrow dataset from numpy files.

from_numpy(ndarrays)

Creates a Dataset from a list of NumPy ndarrays.

from_numpy_refs(ndarrays)

Creates a Dataset from a list of Ray object references to NumPy ndarrays.

Dataset.write_numpy(path, *, column[, ...])

Writes a column of the Dataset to .npy files.

Dataset.to_numpy_refs(*[, column])

Converts this Dataset into a distributed set of NumPy ndarrays or dictionary of NumPy ndarrays.

Arrow#

from_arrow(tables)

Create a Dataset from a list of PyArrow tables.

from_arrow_refs(tables)

Create a Dataset from a list of Ray object references to PyArrow tables.

Dataset.to_arrow_refs()

Convert this Dataset into a distributed set of PyArrow tables.

MongoDB#

read_mongo(uri, database, collection, *[, ...])

Create a Dataset from a MongoDB database.

Dataset.write_mongo(uri, database, collection)

Writes the Dataset to a MongoDB database.

SQL Databases#

read_sql(sql, connection_factory, *[, ...])

Read from a database that provides a Python DB API2-compliant connector.

Dask#

from_dask(df)

Create a Dataset from a Dask DataFrame.

Dataset.to_dask([meta, verify_meta])

Convert this Dataset into a Dask DataFrame.

Spark#

from_spark(df, *[, parallelism])

Create a Dataset from a Spark DataFrame.

Dataset.to_spark(spark)

Convert this Dataset into a Spark DataFrame.

Modin#

from_modin(df)

Create a Dataset from a Modin DataFrame.

Dataset.to_modin()

Convert this Dataset into a Modin DataFrame.

Mars#

from_mars(df)

Create a Dataset from a Mars DataFrame.

Dataset.to_mars()

Convert this Dataset into a Mars DataFrame.

Torch#

from_torch(dataset)

Create a Dataset from a Torch Dataset.

Hugging Face#

from_huggingface(dataset)

Create a MaterializedDataset from a Hugging Face Datasets Dataset or a Dataset from a Hugging Face Datasets IterableDataset.

TensorFlow#

from_tf(dataset)

Create a Dataset from a TensorFlow Dataset.

WebDataset#

read_webdataset(paths, *[, filesystem, ...])

Create a Dataset from WebDataset files.

Datasource API#

read_datasource(datasource, *[, ...])

Read a stream from a custom Datasource.

Dataset.write_datasource(datasource, *[, ...])

Writes the dataset to a custom Datasource.

Datasource()

Interface for defining a custom Dataset datasource.

ReadTask(read_fn, metadata)

A function used to read blocks from the Dataset.

datasource.Reader()

A bound read operation for a Datasource.

Partitioning API#

datasource.Partitioning(style[, base_dir, ...])

Partition scheme used to describe path-based partitions.

datasource.PartitionStyle(value)

Supported dataset partition styles.

datasource.PathPartitionParser(partitioning)

Partition parser for path-based partition formats.

datasource.PathPartitionFilter(...)

Partition filter for path-based partition formats.

datasource.FileExtensionFilter(file_extensions)

A file-extension-based path filter that filters files that don't end with the provided extension(s).

MetadataProvider API#

datasource.FileMetadataProvider()

Abstract callable that provides metadata for the files of a single dataset block.

datasource.BaseFileMetadataProvider()

Abstract callable that provides metadata for FileBasedDatasource implementations that reuse the base prepare_read() method.

datasource.ParquetMetadataProvider()

Abstract callable that provides block metadata for Arrow Parquet file fragments.

datasource.DefaultFileMetadataProvider()

Default metadata provider for FileBasedDatasource implementations that reuse the base prepare_read method.

datasource.DefaultParquetMetadataProvider()

The default file metadata provider for ParquetDatasource.

datasource.FastFileMetadataProvider()

Fast Metadata provider for FileBasedDatasource implementations.

BlockWritePathProvider API#

datasource.BlockWritePathProvider()

Abstract callable that provides concrete output paths when writing dataset blocks.

datasource.DefaultBlockWritePathProvider()

Default block write path provider implementation that writes each dataset block out to a file of the form: {base_path}/{dataset_uuid}_{task_index}_{block_index}.{file_format}