# Scaling your Gradio app with Ray Serve

In this guide, we will show you how to scale up your [Gradio](https://gradio.app/) application using Ray Serve. Keeping the internal architecture of your Gradio app intact (no changes), we simply wrap the app within Ray Serve as a deployment and scale it to access more resources.
## Dependencies

To follow this tutorial, you will need Ray Serve and Gradio. If you haven't already, install them by running:
```console
$ pip install "ray[serve]"
$ pip install gradio==3.19
```
For this tutorial, we will use Gradio apps that run text summarization and generation models and use [HuggingFace's Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) to access these models. **Note that you can substitute this Gradio app for any Gradio app of your own!**

First, let's install the transformers module.
```console
$ pip install transformers
```

## Quickstart: Deploy your Gradio app with Ray Serve

This section shows you an easy way to deploy your app onto Ray Serve. First, create a new Python file named `demo.py`. Second, import `GradioServer` from Ray Serve to deploy your Gradio app later, `gradio`, and `transformers.pipeline` to load text summarization models.
```{literalinclude} ../doc_code/gradio-integration.py
:start-after: __doc_import_begin__
:end-before: __doc_import_end__
```

Then, we write a builder function that constructs the Gradio app `io`. This application takes in text and uses the [T5 Small](https://huggingface.co/t5-small) text summarization model loaded using [HuggingFace's Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) to summarize that text.
:::{note} 
Remember you can substitute this with your own Gradio app if you want to try scaling up your own Gradio app!
:::
```{literalinclude} ../doc_code/gradio-integration.py
:start-after: __doc_gradio_app_begin__
:end-before: __doc_gradio_app_end__
```

### Deploying Gradio Server
In order to deploy your Gradio app onto Ray Serve, you need to wrap your Gradio app in a Serve [deployment](serve-key-concepts-deployment). `GradioServer` acts as that wrapper. It serves your Gradio app remotely on Ray Serve so that it can process and respond to HTTP requests. 

By wrapping your application in `GradioServer`, you can increase the number of CPUs and/or GPUs available to the application.
:::{note}
Currently, there is no support for routing requests properly to multiple replicas of `GradioServer`, so we recommend only having a single replica.
:::

:::{note} 
`GradioServer` is simply `GradioIngress` but wrapped in a Serve deployment. You can use `GradioServer` for the simple wrap-and-deploy use case, but as you will see in the next section, you can use `GradioIngress` to define your own Gradio Server for more customized use cases.
:::

:::{note} 
Ray can’t pickle Gradio. Instead, pass a builder function that constructs the Gradio interface.
:::

Using either Gradio app `io` constructed by the builder function above or providing your own application (of type `Interface`, `Block`, `Parallel`, etc.), wrap it in your Gradio Server. Pass the builder function as input to your Gradio Server. It will be used to construct your Gradio app on the Ray cluster.

```{literalinclude} ../doc_code/gradio-integration.py
:start-after: __doc_app_begin__
:end-before: __doc_app_end__
```

Finally, deploy your Gradio Server! Run the following in your terminal:
```console
$ serve run demo:app
```

Now you can access your Gradio app at `http://localhost:8000`! This is what it should look like:
![Gradio Result](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/gradio_result.png)

See the [Production Guide](serve-in-production) for more information on how to deploy your app in production.


## Parallelizing models with Ray Serve
You can run multiple models in parallel with Ray Serve by utilizing the [deployment graph](serve-deployment-graphs) in Ray Serve.

### Original Approach
Suppose you want to run the following program.

1. Take two text generation models, [`gpt2`](https://huggingface.co/gpt2) and [`distilgpt2`](https://huggingface.co/distilgpt2).
2. Run the two models on the same input text, such that the generated text has a minimum length of 20 and maximum length of 100.
3. Display the outputs of both models using Gradio.

This is how you would build it normally:

```{literalinclude} ../doc_code/gradio-original.py
:start-after: __doc_code_begin__
:end-before: __doc_code_end__
```
And launch the Gradio app:
```
demo.launch()
```

### Parallelize using Ray Serve

With Ray Serve, we can parallelize the two text generation models by wrapping each model in a separate Ray Serve [deployment](serve-key-concepts-deployment). Deployments are defined by decorating a Python class or function with `@serve.deployment`, and they usually wrap the models that you want to deploy on Ray Serve to handle incoming requests.

Let's walk through a few steps to achieve parallelism. First, let's import our dependencies. Note that we need to import `GradioIngress` instead of `GradioServer` like before since we're now building a customized `MyGradioServer` that can run models in parallel.

```{literalinclude} ../doc_code/gradio-integration-parallel.py
:start-after: __doc_import_begin__
:end-before: __doc_import_end__
```

Then, let's wrap our `gpt2` and `distilgpt2` models in Serve deployments, named `TextGenerationModel`.
```{literalinclude} ../doc_code/gradio-integration-parallel.py
:start-after: __doc_models_begin__
:end-before: __doc_models_end__
```

Next, instead of simply wrapping our Gradio app in a `GradioServer` deployment, we can build our own `MyGradioServer` that reroutes the Gradio app so that it runs the `TextGenerationModel` deployments.

```{literalinclude} ../doc_code/gradio-integration-parallel.py
:start-after: __doc_gradio_server_begin__
:end-before: __doc_gradio_server_end__
```

Lastly, we link everything together:
```{literalinclude} ../doc_code/gradio-integration-parallel.py
:start-after: __doc_app_begin__
:end-before: __doc_app_end__
```

:::{note} 
This will bind your two text generation models (wrapped in Serve deployments) to `MyGradioServer._d1` and `MyGradioServer._d2`, forming a [deployment graph](serve-deployment-graphs). Thus, we have built our Gradio Interface `io` such that it calls `MyGradioServer.fanout()`, which simply sends requests to your two text generation models that are deployed on Ray Serve.
:::

Now, you can run your scalable app, and the two text generation models will run in parallel on Ray Serve.
Run your Gradio app with the following command:

```console
$ serve run demo:app
```

Access your Gradio app at `http://localhost:8000`, and you should see the following interactive interface:
![Gradio Result](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/gradio_result_parallel.png)

See the [Production Guide](serve-in-production) for more information on how to deploy your app in production.