Engineering
- Introduction
- BackendAI Model Service
- Inference Sessions
- Model storage
- Model definition file
- Tutorial Model Service with BackendAI Model Service
- Write the API server code
- Create a model definition file
- Prepare model storage
- Create a model service
- Use the Reasoning API
- Restricting access to the Reasoning API
- Scaling inference sessions
- Closing thoughts
May 30, 2023
Engineering
Sneak Peek: Backend.AI Model Service Preview
Kyujin Cho
Senior Software Engineer
May 30, 2023
Engineering
Sneak Peek: Backend.AI Model Service Preview
Kyujin Cho
Senior Software Engineer
Introduction
As super-sized AI models flood the market, there is a growing concern about not only developing the models, but also how to deliver them "well" and "efficiently" to users. Prior to Large Language Models (LLMs), the computing power of AI models was focused on training rather than inference, as the hardware requirements for attempting to make inferences with a trained model were much smaller than the computing power needed to train the model. Deployers of models could get enough power for inference from the NPU of a real user's end device (such as a smartphone). However, with the advent of LLMs, the tables were turned.
Take Meta's [OPT 175b] (https://github.com/facebookresearch/metaseq) as an example: OPT-175b, as its name implies, has 175 billion parameters and requires roughly 320+ GB of GPU memory just to load them onto the GPU to perform inference tasks. That's a huge difference from the 4GB that pre-LLM image processing models used to require.
With this change in AI model behavior, efficiently managing service resources has become paramount to keeping your service running reliably. In this article, we'll preview Backend.AI's upcoming model service feature, Backend.AI Model Service, and show you how it will allow you to efficiently run your AI model from training to serving with a single infrastructure.
Backend.AI Model Service
Backend.AI Model Service is a model serving system that runs on top of the existing Backend.AI solution. It takes Backend.AI's tried-and-true container management technology and container app delivery system, AppProxy1, to the next level, enabling both AI training and model service in one infrastructure without installing additional components and by simply upgrading the existing Backend.AI infrastructure. It also supports an auto-scaling feature that automatically scales up and down inference sessions based on per-session GPU usage, number of API calls, or time of day, allowing you to effectively manage AI resources used for inference.
Inference Sessions
Inference sessions in Backend.AI are conceptually the same as traditional training sessions. You can use the same execution environment you've been using for training for inference sessions, or you can deploy a dedicated execution environment just for inference sessions. Inference sessions are volatile and stateless, so you can terminate them at any time if the session is not performing well. In this case, Backend.AI will attempt to recover the original state by creating a new inference session, while simultaneously forwarding inference requests to other living inference sessions to minimize downtime for the inference service.
Model storage
Models to be served through Backend.AI are managed as "model storage" units. Model storage consists of model files, code for model services, and model definition files.
Model definition file
The model definition file is where you define the information for running a service provider's model in the Backend.AI Model Service. The model definition file contains information about the model, the ports exposed by the model service, and a set of tasks that must be executed to run the model service. If your model service provides a health check feature that reports its own health, you can use that information to take action, such as excluding sessions from the service if they are in bad health.
models:
- name: "KoAlpaca-5.8B-model"
model_path: "/models/KoAlpaca-5.8B"
service:
pre_start_actions:
- action: run_command
args:
command: ["pip3", "install", "-r", "/models/requirements.txt"]
start_command:
- uvicorn
- --app-dir
- /models
- chatbot-api:app
- --port
- "8000"
- --host
- "0.0.0.0"
port: 8000
health_check:
path: /health
max_retries: 10
Here is an example of a well-defined model definition file, which contains a set of steps to run the KoAlpaca 5.8B model as a model service.
Tutorial: Model Service with Backend.AI Model Service
In this tutorial, we'll actually use Backend.AI to service a KoAlpaca 5.8B model quantized to 8 bits.
Write the API server code
Write a simple API server to serve the model.
import os
from typing import Any, List
from fastapi import FastAPI, Response
from fastapi.responses import RedirectResponse, StreamingResponse, JSONResponse
from fastapi.staticfiles import StaticFiles
import numpy as np
from pydantic import BaseModel
import torch
from transformers import pipeline, AutoModelForCausalLM
import uvicorn
URL = "localhost:8000"
KOALPACA_MODEL = os.environ["BACKEND_MODEL_PATH"]
torch.set_printoptions(precision=6)
app = FastAPI()
model = AutoModelForCausalLM.from_pretrained(
KOALPACA_MODEL,
device_map="auto",
load_in_8bit=True,
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=KOALPACA_MODEL,
)
class Message(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: List[Message]
BASE_CONTEXTS = [
Message(role="맥락", content="KoAlpaca(코알파카)는 EleutherAI에서 개발한 Polyglot-ko 라는 한국어 모델을 기반으로, 자연어 처리 연구자 Beomi가 개발한 모델입니다."),
Message(role="맥락", content="ChatKoAlpaca(챗코알파카)는 KoAlpaca를 채팅형으로 만든 것입니다."),
Message(role="명령어", content="친절한 AI 챗봇인 ChatKoAlpaca 로서 답변을 합니다."),
Message(role="명령어", content="인사에는 짧고 간단한 친절한 인사로 답하고, 아래 대화에 간단하고 짧게 답해주세요."),
]
def preprocess_messages(messages: List[Message]) -> List[Message]:
...
def flatten_messages(messages: List[Message]) -> str:
...
def postprocess(answer: List[Any]) -> str:
...
@app.post("/api/chat")
async def chat(req: ChatRequest) -> StreamingResponse:
messages = preprocess_messages(req.messages)
conversation_history = flatten_messages(messages)
ans = pipe(
conversation_history,
do_sample=True,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
return_full_text=False,
eos_token_id=2,
)
msg = postprocess(ans)
async def iterator():
yield msg.strip().encode("utf-8")
return StreamingResponse(iterator())
@app.get("/health")
async def health() -> Response:
return JSONResponse(content={"healthy": True})
@app.exception_handler(404)
async def custom_404_handler(_, __):
return RedirectResponse("/404.html")
app.mount(
"/",
StaticFiles(directory=os.path.join(KOALPACA_MODEL, "..", "chatbot-ui"), html=True),
name="html",
)
Create a model definition file
Create a model definition file for your API server.
models:
- name: "KoAlpaca-5.8B-model"
model_path: "/models/KoAlpaca-Ployglot-5.8B"
service:
pre_start_actions:
- action: run_command
args:
command: ["pip3", "install", "-r", "/models/requirements.txt"]
start_command:
- uvicorn
- --app-dir
- /models
- chatbot-api:app
- --port
- "8000"
- --host
- "0.0.0.0"
port: 8000
health_check:
path: /health
max_retries: 10
In a session of the model service, model storage is always mounted under the
/models
path.
Prepare model storage
Add the model API server code you wrote, the model definition file, and the KoAlpaca model to your model storage.
Create a model service
With both the model file and the model definition file ready, you can now start the Backend.AI Model Service. The Model Service can be created using the backend.ai service create
command in the Backend.AI CLI. The arguments accepted by service create
are almost identical to the backend.ai session create
command. After the image to use, you pass the ID of the model storage and the number of inference sessions to initially create.
Using backend.ai service info
, you can check the status of the model service and the inference sessions belonging to the service. You can see that one inference session has been successfully created.
Use the Reasoning API
You can use the backend.ai service get-endpoint
command to see the inference endpoint of a created model service. The inference endpoint continues to have a unique value until a model service is created and removed. If a model service belongs to multiple inference sessions, AppProxy will distribute requests across the multiple inference sessions.
Restricting access to the Reasoning API
If you want to restrict who can access the inference API, you can enable authentication for the inference API by starting the model service with the --public
option removed. Authentication tokens can be issued with the backend.ai service generate-token
command.
Scaling inference sessions
The backend.ai service scale
command allows you to change the scale of inference sessions belonging to the model service.
Closing thoughts
So far, we've learned about Backend.AI Model Service and how to actually deploy a model service with the Model Service feature. Backend.AI Model Service is targeted for general availability in Backend.AI 23.03. We're working hard to make the Model Service feature publicly available in the near future, so stay tuned.
---]
This post is automatically translated from Korean
Footnotes
-
Available from Backend.AI Enterprise. ↩