Engineering

ON THIS PAGE

Jul 11, 2024

Engineering

Model Variant: Easily Serving Various Model Services

Jihyun Kang
Senior Software Engineer

Jul 11, 2024

Engineering

Model Variant: Easily Serving Various Model Services

Jihyun Kang
Senior Software Engineer

Introduction

Imagine a scenario where you need to train an AI for research purposes and produce results. Our job would simply be to wait for the AI to correctly learn the data we've taught it. However, if we assume we're creating a service that 'utilizes' AI, things get more complicated. Every factor becomes a concern, from how to apply various models to the system to what criteria to use for scaling under load conditions. We can't carelessly modify the production environment where users exist to get answers to these concerns. If an accident occurs while expanding or reducing the production environment, terrible things could happen. If something terrible does happen, we'll need time to recover from it, and we can't expect the same patience from consumers using our service as we would from researchers waiting for model training. Besides engineering difficulties, there are also cost challenges. Obviously, there's a cost to serving models, and users are incurring expenses even at the moment of training models as resources are being consumed.

However, there's no need to worry. Many well-made models already exist in the world, and in many cases, it's sufficient for us to take these models and serve them. As those interested in our solution may already know, Backend.AI already supports various features you need when serving models. It's possible to increase or decrease services according to traffic, and to serve various models tailored to users' preferences.

But the Backend.AI team doesn't stop here. We have enhanced the model service provided from Backend.AI version 23.09 and improved it to easily serve various models. Through this post, we'll explore how to serve various models easily and conveniently.

This post introduces features that allow you to serve various types of models more conveniently. Since we've already given an explanation about model service when releasing the 23.09 version update, we'll skip the detailed explanation. If you're unfamiliar with Backend.AI's model service, we recommend reading the following post first: Backend.AI Model Service Preview

Existing Method

Requirement	Existing Method	Model Variant
Writing model definition file (model-definition.yaml)	O	X
Uploading model definition file to model folder	O	X
Model metadata needed	O	△ (Some can be received automatically)

Backend.AI model service required a model definition file (model-definition.yaml) that contains commands to be executed when serving the model in a certain format, in addition to the model metadata needed to run. The service execution order was as follows: Write the model definition file, upload it to the model type folder so it can be read, and when starting the model service, mount the model folder. Then, an API server that automatically transfers the end user's input to the model according to the model definition file and sends back the response value would be executed. However, this method had the disadvantage of having to access the file every time the model definition file needed to be modified. Also, it was cumbersome to write different model definition files each time the model changed because the model path was already set in the model definition file.

The Model Variant introduced this time is a feature that allows you to serve models immediately by inputting a few configuration values or without any input at all, using only model metadata without a model definition file. Model Variant supports command, vLLM, and NIM (NVIDIA Inference Microservice) methods. The methods of serving and verifying model service execution are as follows.

Basically, model service requires metadata of the model to be served. Download the model you want to serve from Hugging Face, where you can easily access model metadata. In this example, we used the Llama-2-7b-hf model and Calm3-22b-chat model from Hugging Face. For how to upload model metadata to the model folder, refer to the "Preparing Model Storage" section in the previous post.

Automatically Serving Model from Built Image (Command Method)

The first introduced command method is a form where the command part that executes to serve the model in the model definition file is included in the execution image. After specifying the command to execute in the CMD environment variable, build the image and execute it immediately without any other input when actually serving the model. The command method does not support what's called a Health check, which verifies if the service is running properly. Therefore, it's more suitable for immediately setting up and checking a service as a prototype rather than performing large-scale services. The execution method is as follows:

On the start screen, select Llama-2-7b-hf in the Model Storage To Mount item to mount the model folder containing the model metadata corresponding to the model service to be served, and select Predefined Image Command in the Inference Runtime Variant item.

Activate the Open To Public switch button if you want to provide model service accessible without a separate token.

모델-서비스-시작화면-모델-메타데이터-마운트-및-CMD-선택

Select the environment to serve. Here, we use vllm:0.5.0 and allocate CPU 4 Core, Memory 16 GiB, NVIDIA CUDA GPU 10 fGPU as resources.

모델-서비스-시작화면-실행환경-선택-및-자원할당

Finally, select the cluster size and click the start button. The cluster size is set to single node, single container.

모델-서비스-시작-화면-클러스터-크기-선택-및-시작

If the service has been successfully launched, the service status will change to HEALTHY and the endpoint address will appear.

모델-서비스-상세-화면

Verifying the Service

If the service has been launched normally, check the service model name with the cURL command:

curl https://cmd-model-service.asia03.app.backend.ai/v1/models \
-H "Content-Type: application/json"

모델명-확인하기

Now, let's send input to the service with the cURL command and check the response:

For model services run with CMD, the model name is already defined in the image, so after checking the model name, you must enter the model name as the value of the model key when sending a request.

curl https://cmd-model-service.asia03.app.backend.ai/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "image-model",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0}'

모델-서비스-요청-결과-화면

Serving Models in vLLM Mode

The vLLM mode is similar to the command method introduced earlier, but various options entered when running vLLM can be written as environment variables. The execution method is as follows:

How to Run

On the start screen, mount the model folder for the model service to be served and select vLLM in the Inference Runtime Variant item.

모델-서비스-시작-화면-모델-메타데이터-마운트-및-vLLM-선택

Select the environment to serve. As with the command method explained earlier, select vllm:0.5.0, and (although you can set the resources the same) this time we'll allocate CPU 16 Core, Memory 64 GiB, NVIDIA CUDA GPU 10 fGPU.

모델-서비스-시작-화면-실행환경-선택-및-자원-할당

Finally, select the cluster size and enter the environment variable BACKEND_MODEL_NAME. This value corresponds to the --model-name option in vLLM and becomes the model value specified by the user when sending a request to the service.

Likewise, if the service has been successfully launched, the service status will change to HEALTHY, and the endpoint address where the service is launched will appear.

모델-서비스-상세-화면

Verifying the Service

Let's send input to the service with the cURL command and check the response value. At this time, enter the model value as the BACKEND_MODEL_NAME value you set earlier. Once the input is complete, click the START button to create the service.

curl https://vllm-calm3-22b-chat.asia03.app.backend.ai/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vllm-model",
"prompt": "初めて会う日本人ビジネスマンに渡す最高の挨拶は何でしょうか？",
"max_tokens":  200,
"temperature": 0
}'

모델-서비스-요청-결과-화면

Serving Models in NIM Mode

To run NIM, you need an API key issued from an account that can access NGC's NIM model registry. For how to obtain the key value, please refer to the following content: NVIDIA Docs Hub : How to get NGC API Key

The NIM (NVIDIA Inference Microservice) mode is also similar to the command mode, but it must be run with an image that has NVIDIA's NIM-supporting model server built-in. Also, when loading the model, the NGC API key value is needed. Assuming everything is ready, let's start the model service.

How to Run

On the start screen, select an empty model type folder to cache the metadata to be received from NIM, and select NIM in the Inference Runtime Variant item.

모델-서비스-시작-화면-모델-폴더-마운트-및-NIM-선택

Select the environment to serve. Here, we use ngc-nim:1.0.0-llama3.8b and set to allocate CPU 8 Core, Memory 32 GiB, NVIDIA CUDA GPU 15 fGPU as resources.

모델-서비스-시작-화면-실행환경-선택-및-자원-할당

Finally, select the cluster size and enter the default path /models for the environment variable HF_HOME. Then enter NGC_API_KEY and input the issued key value. Once the input is complete, click the CREATE button to create the service.

모델-서비스-시작-화면-클러스터-크기-선택-환경변수-입력-및-시작

When using NIM, it may take some time for the first execution as it receives model metadata from the repository. You can check the progress by viewing the container logs for the routing session in service on the session page.

Like the command and vLLM modes, if the service has been successfully launched, the service status will change to HEALTHY. Let's input the content to send to the service using the endpoint address where the service is launched as follows, and check the response value.

Verifying the Service

from openai import OpenAI

client = OpenAI(
  base_url = "https://nim-model-service.asia03.app.backend.ai/v1",
  api_key = "$YOUR_NGC_API_KEY"
)

completion = client.chat.completions.create(
  model="meta/llama3-8b-instruct",
  messages=[
      {        
        "role":"user", 
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }],
  temperature=0.5,
  top_p=1,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

모델-서비스-요청-결과-화면

Conclusion

The Model Variant feature will be of great help to researchers and companies aiming to provide actual services with already trained models. Based on a powerful resource management system and support for various AI accelerators such as NVIDIA GPU, AMD ROCm, TPU, Graphcore IPU, Furiosa Warboy, Rebellions ATOM, Hyperaccel LPU, etc., Backend.AI now provides an integrated environment that can easily deploy services beyond simply training models. Try serving your desired AI model anytime with Backend.AI!

Blog

Engineering

Model Variant: Easily Serving Various Model Services

Model Variant: Easily Serving Various Model Services

Introduction

Existing Method

Automatically Serving Model from Built Image (Command Method)

Verifying the Service

Serving Models in vLLM Mode

How to Run

Verifying the Service

Serving Models in NIM Mode

How to Run

Verifying the Service

Conclusion