ray.data.Dataset.write_json
ray.data.Dataset.write_json#
- Dataset.write_json(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = <ray.data.datasource.file_based_datasource.DefaultBlockWritePathProvider object>, pandas_json_args_fn: Callable[[], Dict[str, Any]] = <function Dataset.<lambda>>, ray_remote_args: Dict[str, Any] = None, **pandas_json_args) None[source]#
Writes the
Datasetto JSON and JSONL files.The number of files is determined by the number of blocks in the dataset. To control the number of number of blocks, call
repartition().This method is only supported for datasets with records that are convertible to pandas dataframes.
By default, the format of the output files is
{uuid}_{block_idx}.json, whereuuidis a unique id for the dataset. To modify this behavior, implement a customBlockWritePathProviderand pass it in as theblock_path_providerargument.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
Write the dataset as JSON file to a local directory.
>>> import ray >>> import pandas as pd >>> ds = ray.data.from_pandas([pd.DataFrame({"one": [1], "two": ["a"]})]) >>> ds.write_json("local:///tmp/data")
Write the dataset as JSONL files to a local directory.
>>> ds = ray.data.read_json("s3://anonymous@ray-example-data/train.jsonl") >>> ds.write_json("local:///tmp/data")
Time complexity: O(dataset size / parallelism)
- Parameters
path – The path to the destination root directory, where the JSON files are written to.
filesystem – The pyarrow filesystem implementation to write to. These filesystems are specified in the pyarrow docs. Specify this if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
s3://, theS3FileSystemis used.try_create_dir – If
True, attempts to create all directories in the destination path. Does nothing if all directories already exist. Defaults toTrue.arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream, which is used when opening the file to write to.
block_path_provider – A
BlockWritePathProviderimplementation specifying the filename structure for each output parquet file. By default, the format of the output files is{uuid}_{block_idx}.json, whereuuidis a unique id for the dataset.pandas_json_args_fn – Callable that returns a dictionary of write arguments that are provided to pandas.DataFrame.to_json() when writing each block to a file. Overrides any duplicate keys from
pandas_json_args. Use this parameter instead ofpandas_json_argsif any of your write arguments can’t be pickled, or if you’d like to lazily resolve the write arguments for each dataset block.ray_remote_args – kwargs passed to
remote()in the write tasks.pandas_json_args –
These args are passed to pandas.DataFrame.to_json(), which is used under the hood to write out each
Datasetblock. These are dict(orient=”records”, lines=True) by default.