Engineering

Nov 21, 2023

Engineering

Backend.AI Model Service Hands-on: Running GPT-NeoX

  • Kyujin Cho

    Senior Software Engineer

Nov 21, 2023

Engineering

Backend.AI Model Service Hands-on: Running GPT-NeoX

  • Kyujin Cho

    Senior Software Engineer

Backend.AI version 23.09 has been officially released to the public. We covered Model Service, a key feature in version 23.09, in our previous Sneak Peek: Backend.AI Model Service preview article. Since then, we have added a variety of new features, including GUI support, authentication token history management, and more, and we are going to walk you through them in a tutorial format to make it easy to understand the Backend.AI Model Service. In this tutorial post, we will show you how to use the Backend.AI Model Service to run GPT-NeoX models on top of Triton Inference Server. Triton Inference Server is an open source model inference framework from NVIDIA that enables easy HTTP and gRPC1 delivery of its TritonRT, FasterTransformer, and TritonRT-LLM models, as well as PyTorch, TensorFlow, vLLM, and many others.

Create a Model VFolder

  1. Navigate to the Data & Folders tab. Click the "New Folder" button to open the VFolder creation dialog.
  2. Create a new model folder. It does not matter how you name the folder, but make sure to set the "Usage" at the bottom to "Model". Once you have specified all the values, click the "Create" button at the bottom. Your model VFolder has now been created.

FasterTransformer Format Model Conversion

  1. Navigate to the "Sessions" tab. Click the "Start" button to open the session creation dialog.
  2. Select ngc-pytorch for "Running Environment" and 23.07 for "Version". Once you have made your selections, click the arrow icon in the lower right corner.
  3. The window to select the VFolder to mount in the session. To load the model, select the VFolder you just created under the "Model storage folder to mount" section. Once you have made your selections, click the arrow icon in the lower right corner.
  4. A window to specify the amount of resources to be used by the model session. You should allocate at least 16 CPU cores and 128 GB of RAM to ensure smooth model conversion. Once you have made your selections, click the arrow icon in the lower right corner.
  5. After confirming that all settings have been applied correctly, click the "Start" button below to start the session.
  6. Once the session is created, a popup will appear to select an app, as shown below. Click the "Console" app to access the terminal environment.
  7. Run the following shell script to download the GPT-NeoX 20B model and convert it to the FasterTransformer format. Note that where the script mentions <VFolder name>, you must replace it with the name of the model VFolder you created.
cd /home/work/<VFolder name> pip install -U transformers bitsandbytes git clone https://github.com/NVIDIA/FasterTransformer git clone https://huggingface.co/ElutherAI/gpt-neox-20b cd neo-gptx-20b git lfs install git lfs pull

The GPT-NeoX 20B model requires at least 40GB of VRAM to run. If the physical GPU you are using has less VRAM than this and you need to split the model across multiple GPUs, adjust the number in the -i_g parameter to match the number of GPUs you are using.

cd /home/work/<VFolder name> mkdir -p triton-deploy/gpt-neox-20b-ft python ~/<VFolder name>/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \ -i /home/work/<VFolder name>/gpt-neox-20b \ -o /home/work/<VFolder name>/triton-deploy/gpt-neox-20b-ft \ -i_g 1 \ -m_n GPT-NeoX-20B

  1. If you followed all the steps up to step 7, you should have the following folders under the VFolder.
work@main1[PRRLCIqu-session]:~/GPT-NeoX-Triton-FT$ ls -al total 62 drwxr-xr-x 5 work work 11776 Oct 12 12:14 . drwxr-xr-x 9 work work 4096 Oct 12 12:29 .. drwxr-xr-x 14 work work 12800 Oct 12 11:24 FasterTransformer drwxr-xr-x 3 work work 16896 Oct 12 10:18 gpt-neox-20b drwxr-xr-x 3 work work 11776 Oct 12 11:56 triton-deploy

Now it's time to add the configuration file for Triton Inference Server. Create the file triton-deploy/gpt-neox-20b-ft/config.pbtxt and add the following contents.

If you set the value of the -i_g parameter to anything other than 1 in step 7, you must modify the value of tensor_para_size in the settings below to match the value of -i_g.

name: "gpt-neox-20b-ft" backend: "fastertransformer" default_model_filename: "gpt-neox-20b-ft" max_batch_size: 1024 model_transaction_policy { decoupled: False } input [ { name: "input_ids" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "start_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "input_lengths" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "is_return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "prompt_learning_task_name_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_decay" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_min" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_reset_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_CPU } ] parameters { key: "tensor_para_size" value: { string_value: "1" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "data_type" value: { string_value: "fp16" } } parameters { key: "model_type" value: { string_value: "GPT-NeoX" } } parameters { key: "model_checkpoint_path" value: { string_value: "/models/triton-deploy/gpt-neox-20b-ft/1-gpu" } } parameters { key: "enable_custom_all_reduce" value: { string_value: "0" } }
  1. Finally, you need to add the Backend.AI Model Service definition file to the root of the VFolder, under model-definition.yaml (model-definition.yml is also acceptable). Let's take a closer look at the model definition file for running Triton Inference Server.
models: - name: "GPT-NeoX" model_path: "/models/triton-deploy" ...

This is where you specify the model name and the path to the model.

The name and path you set here can be accessed by the model server process as the BACKEND_MODEL_NAME and BACKEND_MODEL_PATH environment variables, respectively.

... service: start_command: - tritonserver - --model-repository=/models/triton-deploy - --disable-auto-complete-config - --log-verbose - "1" ...

This is the part that defines the command line syntax for starting the Model Server process.

... port: 8000 ...

This is where you fill in the port for API communication that the model server process exposes. If not specified, Triton Inference Server exposes port 8000 for HTTP API communication by default, so you will also write that port in the model definition file.

... health_check: path: /v2/health/ready max_retries: 3 max_wait_time: 5 expected_status_code: 200

This is where you enable and set up the Health Check feature. If the Health Check feature is enabled, Backend.AI will continuously send HTTP GET requests to the path to verify that it returns an HTTP response code corresponding to the expected_status_code (can be omitted, defaults to 200). If the model server does not respond, or returns an undefined response code, Backend.AI determines that the session is unhealthy and excludes it from the service. When a session is excluded from the service, it is not automatically terminated and the Model Service administrator must manually take the appropriate action by checking container logs, etc. The Health Check feature can be disabled by omitting the syntax entirely. If you do this, Backend.AI will not check the health of the model server and will always assume it is in a healthy state. The max_wait_time is the part that defines the API response timeout. It must be a number in seconds. The max_retries is the number of times the request is retried before the model server is judged to be unhealthy.
The finished model definition file looks like this.

models: - name: "GPT-NeoX" model_path: "/models/triton-deploy" service: start_command: - tritonserver - --model-repository=/models/triton-deploy - --disable-auto-complete-config - --log-verbose - "1" port: 8000 health_check: path: /v2/health/ready max_retries: 3 max_wait_time: 5

More information about model definition files can be found in the Backend.AI WebUI documentation.

Now you're all set to run the Model Service.

Create a Model Service

  1. Navigate to the "Model Serving" tab. Click the "Start Service" button to open the Create Model Service window. Let's take a look at each section in a little more detail.
    • Service name: This is where you specify the name of the Model Service. The name of the Model Service can be used as a subdomain of the Model Service Endpoint (coming soon).
    • Resource Group: This is the field to select the resource group where the Inference Session for the Model Service will be created.
    • Open your app to the outside world: When this feature is enabled, all API requests to the model server must be accompanied by an authentication header before they can be made. For more information about Model Service authentication, see the Backend.AI WebUI documentation.
    • Desired number of routes: A field to specify the number of inference sessions the Model Server process runs in. Setting this value to a number greater than 1 creates multiple identical sessions and enables the round-robin load balancer feature, which distributes API requests evenly among these sessions. This value can be modified at any time after Model Service creation.
    • A panel that specifies the amount of resources for the inference session.

The GPT-NeoX 20B model requires a minimum of 40 GB of vRAM to run. The relationship between fGPU units and vRAM in Backend.AI may apply differently depending on the settings of your Backend.AI. Consult with the administrator of your Backend.AI for more information. If you have set all the values correctly, press the "OK" button to create the Model Service.

  1. the Model Service has been created. If the Model Service is not yet ready for the model process in the reasoning session, the status will remain "PROVISIONING". Click on the "INFERENCE" section of the "Sessions" tab and you'll see that an inference session has been created corresponding to the Model Service you created in 1. Model Service administrators can click the clipboard icon in the "Control" row to view logs related to the model server processes in an inference session.
  2. When the Model Server process is running normally, the status of the route at the bottom and the status at the top will both change to "HEALTHY", and the address to access the Model Service will appear under "Service Endpoints". You can now access the Triton Inference Server that ran the inference session through that address.

Conclusion

In this article, you've learned how to start serving LLM models using the Backend.AI Model Service. The Model Service feature is available in Backend.AI's Cloud Beta. Start serving your own models today!

1: Not supported by Backend.AI Model Service

This post is automatically translated from Korean

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

8F, 577, Seolleung-ro, Gangnam-gu, Seoul, Republic of Korea

© Lablup Inc. All rights reserved.