This page was generated from examples/triton_gpt2/README.ipynb.

Pretrained GPT2 Model Deployment Example

In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon’s Triton pre-packed server. the example also covers converting the model to ONNX format. The implemented example below is of the Greedy approach for the next token prediction. more info: https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2

After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module inference performance.

Steps:

  1. Download pretrained GPT2 model from hugging face

  2. Convert the model to ONNX

  3. Store it in MinIo bucket

  4. Setup Seldon-Core in your kubernetes cluster

  5. Deploy the ONNX model with Seldon’s prepackaged Triton server.

  6. Interact with the model, run a greedy alg example (generate sentence completion)

  7. Run load test using vegeta

  8. Clean-up

Basic requirements

  • Helm v3.0.0+

  • A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)

  • kubectl v1.14+

  • Python 3.6+

[ ]:
%%writefile requirements.txt
transformers==4.5.1
torch==1.8.1
tokenizers<0.11,>=0.10.1
tensorflow==2.4.1
tf2onnx
[ ]:
!pip install --trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org -r requirements.txt

Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally

[ ]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained(
    "gpt2", from_pt=True, pad_token_id=tokenizer.eos_token_id
)
model.save_pretrained("./tfgpt2model", saved_model=True)

Convert the TensorFlow saved model to ONNX

[ ]:
!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 11  --output model.onnx

Copy your model to a local MinIo

Setup MinIo

Use the provided notebook to install MinIo in your cluster and configure mc CLI tool. Instructions also online.

– Note: You can use your prefer remote storage server (google/ AWS etc.)

Create a Bucket and store your model

[1]:
!mc mb minio-seldon/onnx-gpt2 -p
!mc cp ./model.onnx minio-seldon/onnx-gpt2
Bucket created successfully `minio-seldon/onnx-gpt2`.
./model.onnx:  622.37 MiB / 622.37 MiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 127.01 MiB/s 4s

Run Seldon in your kubernetes cluster

Follow the Seldon-Core Setup notebook to Setup a cluster with Ambassador Ingress or Istio and install Seldon Core

Deploy your model with Seldon pre-packaged Triton server

[2]:
%%writefile secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: minio
  RCLONE_CONFIG_S3_ENV_AUTH: "false"
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: minioadmin
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: minioadmin
  RCLONE_CONFIG_S3_ENDPOINT: http://minio.minio-system.svc.cluster.local:9000

Overwriting secret.yaml
[3]:
%%writefile gpt2-deploy.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: gpt2
spec:
  predictors:
  - graph:
      implementation: TRITON_SERVER
      logger:
        mode: all
      modelUri: s3://onnx-gpt2
      envSecretRefName: seldon-init-container-secret
      name: gpt2
      type: MODEL
    name: default
    replicas: 1
  protocol: kfserving
Overwriting gpt2-deploy.yaml
[4]:
!kubectl apply -f secret.yaml -n default
!kubectl apply -f gpt2-deploy.yaml -n default
secret/seldon-init-container-secret configured
seldondeployment.machinelearning.seldon.io/gpt2 created
[5]:
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2 -o jsonpath='{.items[0].metadata.name}')
deployment "gpt2-default-0-gpt2" successfully rolled out

Interact with the model: get model metadata (a “test” request to make sure our model is available and loaded correctly)

[1]:
!curl -v http://localhost:80/seldon/default/gpt2/v2/models/gpt2
*   Trying 127.0.0.1:80...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /seldon/default/gpt2/v2/models/gpt2 HTTP/1.1

> Host: localhost

> User-Agent: curl/7.68.0

> Accept: */*

>

* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK

< access-control-allow-headers: Accept, Accept-Encoding, Authorization, Content-Length, Content-Type, X-CSRF-Token

< access-control-allow-methods: GET,OPTIONS

< access-control-allow-origin: *

< content-type: application/json

< seldon-puid: 97dd6a08-3520-48f2-af34-7c32e5967c38

< x-content-type-options: nosniff

< date: Fri, 21 May 2021 14:26:22 GMT

< content-length: 336

< x-envoy-upstream-service-time: 3

< server: istio-envoy

<

* Connection #0 to host localhost left intact
{"name":"gpt2","versions":["1"],"platform":"onnxruntime_onnx","inputs":[{"name":"input_ids","datatype":"INT32","shape":[-1,-1]},{"name":"attention_mask","datatype":"INT32","shape":[-1,-1]}],"outputs":[{"name":"past_key_values","datatype":"FP32","shape":[12,2,-1,12,-1,64]},{"name":"logits","datatype":"FP32","shape":[-1,-1,50257]}]}

Run prediction test: generate a sentence completion using GPT2 model - Greedy approach

[2]:
import json

import numpy as np
import requests
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
count = 0
max_gen_len = 10
gen_sentence = input_text
while count < max_gen_len:
    input_ids = tokenizer.encode(gen_sentence, return_tensors="tf")
    shape = input_ids.shape.as_list()
    payload = {
        "inputs": [
            {
                "name": "input_ids",
                "datatype": "INT32",
                "shape": shape,
                "data": input_ids.numpy().tolist(),
            },
            {
                "name": "attention_mask",
                "datatype": "INT32",
                "shape": shape,
                "data": np.ones(shape, dtype=np.int32).tolist(),
            },
        ]
    }

    ret = requests.post(
        "http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer", json=payload
    )

    try:
        res = ret.json()
    except:
        continue

    # extract logits
    logits = np.array(res["outputs"][1]["data"])
    logits = logits.reshape(res["outputs"][1]["shape"])

    # take the best next token probability of the last token of input ( greedy approach)
    next_token = logits.argmax(axis=2)[0]
    next_token_str = tokenizer.decode(
        next_token[-1:], skip_special_tokens=True, clean_up_tokenization_spaces=True
    ).strip()
    gen_sentence += " " + next_token_str
    count += 1

print(f"Input: {input_text}\nOutput: {gen_sentence}")
Input: I enjoy working in Seldon
Output: I enjoy working in Seldon 's office , and I 'm glad to see that

Run Load Test / Performance Test using vegeta

Install vegeta, for more details take a look in vegeta official documentation

[8]:
!wget https://github.com/tsenart/vegeta/releases/download/v12.8.3/vegeta-12.8.3-linux-amd64.tar.gz
!tar -zxvf vegeta-12.8.3-linux-amd64.tar.gz
!chmod +x vegeta

Generate vegeta target file contains “post” cmd with payload in the requiered structure

[9]:
import base64
import json
from subprocess import PIPE, Popen, run

import numpy as np
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_text = "I enjoy working in Seldon"
input_ids = tokenizer.encode(input_text, return_tensors="tf")
shape = input_ids.shape.as_list()
payload = {
    "inputs": [
        {
            "name": "input_ids",
            "datatype": "INT32",
            "shape": shape,
            "data": input_ids.numpy().tolist(),
        },
        {
            "name": "attention_mask",
            "datatype": "INT32",
            "shape": shape,
            "data": np.ones(shape, dtype=np.int32).tolist(),
        },
    ]
}

cmd = {
    "method": "POST",
    "header": {"Content-Type": ["application/json"]},
    "url": "http://localhost:80/seldon/default/gpt2/v2/models/gpt2/infer",
    "body": base64.b64encode(bytes(json.dumps(payload), "utf-8")).decode("utf-8"),
}

with open("vegeta_target.json", mode="w") as file:
    json.dump(cmd, file)
    file.write("\n\n")
[10]:
!vegeta attack -targets=vegeta_target.json -rate=1 -duration=60s -format=json | vegeta report -type=text
Requests      [total, rate, throughput]         60, 1.02, 1.01
Duration      [total, attack, wait]             59.198s, 59s, 197.751ms
Latencies     [min, mean, 50, 90, 95, 99, max]  179.123ms, 280.177ms, 214.79ms, 325.753ms, 457.825ms, 1.936s, 2.009s
Bytes In      [total, mean]                     475783920, 7929732.00
Bytes Out     [total, mean]                     13140, 219.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:60
Error Set:

Clean-up

[11]:
!kubectl delete -f gpt2-deploy.yaml -n default
seldondeployment.machinelearning.seldon.io "gpt2" deleted