Tag : Artificial Intelligence General
Finetuning Domain adaptive language model with FastTrack
By Yonggeun KwonIntroduction
This article explains how to train and evaluate a language model specialized in supply chain and trade-related domains using Backend.AI's MLOps platform, FastTrack. For this language model, we used the gemma-2-2b-it model as the base model, which was continually pretrained with supply chain and trade domain datasets. To train a model specialized in the Question Answering task, domain datasets collected and processed from the web were converted into a Q/A task format, consisting of trainable questions and answers, depending on the use case.
Developing AI involves stages such as data preprocessing, training, validation, deployment, and inference. Using Lablup's FastTrack, each of these stages can be configured into a single pipeline, allowing for easy customization, such as skipping specific stages or adjusting resource allocation per stage according to the pipeline configuration.
Concept of Domain Adaptation
Before diving into model training, a process called Domain Adaptation is necessary. To briefly explain for those unfamiliar, Domain Adaptation refers to the process of refining a pretrained model to make it suitable for a specific domain. Most general-purpose language models we encounter today are not designed to possess expertise in specific fields. These models are typically trained using datasets from general domains to predict the next token effectively and then fine-tuned to fit overall usage directions.
However, when creating a model for use in a specialized domain, training with general datasets is insufficient. For instance, a model trained in a general domain can understand contexts like "This movie was amazing," but it may struggle with sentences in the legal domain, such as "The court ordered the seizure of the debtor's assets," due to the lack of learning of specific terms and expressions used in each domain. Similarly, if a Q/A task is given, implementing it with general data might not be possible. To properly handle a Q/A task, a language model must be fine-tuned with domain-specific datasets trained for the Q/A task. This fine-tuning process allows the model to better understand the nuances of the task and effectively respond to domain-specific questions from the user.
This article focuses on the process of developing a model specialized in Supply Chain Management (SCM) and trade domains. As shown in the above image, there is a significant difference between general domain terms like "movie" or "travel" and SCM-specific terms like "air waybill" or "payment manager." To bridge this gap, our goal today is to adjust the model using datasets from SCM and trade domains to enhance the model's understanding of these domains and accurately capture the context.
In summary, Domain Adaptation is essentially a process of overcoming the gaps between different domains, enabling the model to perform better in new contexts.
Train model from scratch vs DAPT
So, why not train the model from scratch using datasets from the specific domain? While this is possible, it comes with several limitations. Training a model from scratch with domain-specific datasets requires extensive data and training because the model lacks both general domain knowledge and domain-specific expertise. Collecting datasets for general domain deep learning is already challenging, but gathering high-quality, domain-specific data is even more difficult. Even if data is collected, preprocessing it to fit model training can be time-consuming and costly. Therefore, training a model from scratch is more suitable for companies with abundant domain-specific data and resources.What if you want to develop a domain-adaptive model but don't have access to vast datasets or sufficient resources? In such cases, Domain-Adaptive Pre-Training (DAPT) can be an effective approach. DAPT involves continual pretraining of a model that has already been extensively trained on general domains with domain-specific datasets to develop a specialized model. Since this method builds upon a model that already possesses knowledge of general domains, it requires relatively less cost and fewer datasets compared to training a model from scratch.
Development environment Setup
- Package Installation
pip install bitsandbytes==0.43.2 pip install deepspeed==0.14.4 pip install transformers==4.43.3 pip install accelerate==0.33.0 pip install flash-attn==1.0.5 pip install xforms==0.1.0 pip install datasets==2.20.0 pip install wandb pip install evaluate==0.4.2 pip install vertexai==1.60.0 pip install peft==0.12.0 pip install tokenizers==0.19.1 pip install sentencepiece==0.2.0 pip install trl==0.9.6 pip install bitsandbytes==0.43.2 pip install deepspeed==0.14.4 pip install transformers==4.43.3 pip install accelerate==0.33.0 pip install flash-attn==1.0.5 pip install xforms==0.1.0 pip install datasets==2.20.0 pip install wandb pip install evaluate==0.4.2 pip install vertexai==1.60.0 pip install peft==0.12.0 pip install tokenizers==0.19.1 pip install sentencepiece==0.2.0 pip install trl==0.9.6
- Import Modules
import os import json from datasets import load_from_disk, Dataset,load_dataset import torch from transformers import AutoTokenizer, AutoModelForCausalLM, Gemma2ForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments from peft import LoraConfig, get_peft_model import transformers from trl import SFTTrainer from dotenv import load_dotenv import wandb from huggingface_hub import login
Dataset preparation
The preparation of datasets should vary depending on the purpose of fine-tuning. In this article, since our goal is to train a model that can effectively respond to questions related to the trade domain, we decided to use datasets that we collected ourselves through web crawling. The datasets are categorized into three types: trade certification exam datasets, trade term-definition datasets, and trade lecture script datasets.
- Trade Certification Exam Data Set
질문: 다음 중 우리나라 대외무역법의 성격에 대한 설명으로 거리가 먼 것을 고르시오. 1. 우리나라에서 성립되고 이행되는 대외무역행위는 기본적으로 대외무역법을 적용한다. 2. 타 법에서 명시적으로 대외무역법의 적용을 배제하면 당해 법은 특별법으로서 대외무역법보다 우선 적용된다. 3. 대외무역법은 국내법으로서 국민의 국내 경제생활에 적용되는 법률이기 때문에 외국인이 국내에서 행하는 무역행위는 그 적용 대상이 아니다. 4. 관계 행정기관의 장은 해당 법률에 의한 물품의 수출·수입 요령 그 시행일 전에 지식경제부 장관이 통합하여 공고할 수 있도록 제출하여야 한다. 정답: 대외무역법은 국내법으로서 국민의 국내 경제생활에 적용되는 법률이기 때문에 외국인이 국내에서 행하는 무역행위는 그 적용 대상이 아니다. 질문: ...
- Trade Terms Definition Data Set
{ "term": "(계약 등을) 완전 무효화하다, 백지화하다, (처음부터) 없었던 것으로 하다(Rescind)", "description": "계약을 파기, 무효화, 철회, 취소하는 것; 그렇지 않았음에도 불구하고 계약을 시작부터 무효인 것으로 선언하고 종결짓는 것." }
- Trade Lecture Script Dataset
예전에는 전자상거래 셀러가 엑셀에다가 입력을 해서 수출신고 데이터를 업로드 해서 생성을 했잖아요 그리고 대량으로 전송하는 셀러는 api를 통해서 신고를 했습니다 그런데 그 수출신고 정보의 원천정보를 뭐냐면 쇼핑몰에서 제공하는 판매 주문정보입니다 그래서 그 쇼핑몰에 직접 저희가 연계를 해서 판매 주문 정보를 가져올 수 있게끔 새 서비스를 만들었어요 그래서 API 연계된 쇼핑몰들이 있는데 그게 현재 5개가 연결되어 있는데 쇼피 쇼피파이 라자다 라쿠텐 q10이 있고요 아마존하고 위치도 연계 예정에 있습니다 그래서 셀러는 ...
To create a model suitable for Q/A tasks, the datasets need to be converted into a question-and-answer format. The first dataset, the trade certification exam dataset, and the second dataset, the trade term-definition dataset, can be converted using simple code. However, upon examining the third dataset, the trade lecture script dataset, it appears challenging to directly convert the conversational data. In this case, an approach can be employed that uses large language models (LLMs) to extract Q/A pairs from the conversational scripts. The prompts and their results are as follows.
Prompt
summary: {summary} Instruction: 다음 summary를 기반으로 질문과 해당 답변을 추출하세요. 출력 형식은 JSON으로, 각 질문과 답변을 객체로 묶어 'qa_pairs'라는 배열 안에 넣어야 합니다. 'questions' key에는 요약에서 도출된 질문 목록이 포함되어야 하며, 'answers' key에는 각 질문에 대한 해당 답변 목록이 포함되어야 합니다: 출력 예시: {{"qa_pairs": [{{"question": "Question1", "answer": "Answer1"}},{{"question": "Question2", "answer": "Answer2"}},{{"question": "Question3", "answer": "Answer3"}},]...}} Output:
Output
{"qa_pairs": [ {"question": "전자상거래 무역 플랫폼은 어떤 서비스를 제공하나요?", "answer": "판매 주문 관리, 창고 관리, 수출입 통관, 배송 관리 등 전자상거래 업무를 위한 다양한 서비스를 제공합니다"}, {"question": "쇼핑몰과의 API 연계를 통해 무엇을 간소화할 수 있나요?", "answer": "판매 주문 정보를 자동으로 가져와 수출신고를 간소화할 수 있습니다"}, {"question": "창고 관리 기능은 어떤 것을 관리할 수 있나요?", "answer": "자가 창고, 위촉 창고, 해외 풀필먼트 센터 등을 관리할 수 있습니다"}, {"question": "수출입 통관 서비스는 어떤 기능을 제공하나요?", "answer": "미국 세관과 연동된 수출입 통관 서비스를 통해 미국 수출 시 수입 신고를 간편하게 처리할 수 있습니다"}, {"question": "배송 관리 기능은 어떤 서비스를 지원하나요?", "answer": "우체국 EMS, DHL 등 개별 배송과 함께 해외 현지 라스트 마일 배송까지 지원합니다"} ] }
It looks like we are now ready to convert each dataset into a Q/A dataset using simple code. Below is the code that demonstrates how to convert each dataset into Q/A format.
import os import json import re from datasets import Dataset, concatenate_datasets, load_from_disk def replace_dot_number(text): result = re.sub(r'\.(\d+)\.', r'. \1.', text) return result def read_json(path): with open(path, 'r', encoding='utf-8') as f: return json.load(f) def write_json(data, path): with open(path, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False) def dataset_maker(data:list) -> Dataset: return Dataset.from_list(data) def save_dataset(dataset, save_path): dataset.save_to_disk(save_path) def exam_qa_formatter(): data = [] root = 'dataset/exam_data' for file in sorted(os.listdir(root)): file_path = os.path.join(root, file) content = read_json(file_path)['fixed_text'] question_list = content.split('질문:')[1:] for question in question_list: try: question_and_options = replace_dot_number(question.split('정답:')[0]).strip() answer = question.split('정답:')[1].strip() data.append({"context": replace_dot_number(question), "question":question_and_options, "answer":answer}) except Exception as e: pass return data def description_to_term_formattter(kor_term, eng_term, description): context = f"{kor_term}: {description}" question = f"설명: '{description}' 이 설명에 해당하는 무역 용어는 무엇인가요?" answer = kor_term if eng_term is None else f"{kor_term}, {eng_term}" return context, question, answer def term_to_description(kor_term, eng_term, description): context = f"{kor_term}: {description}" question = f"'{kor_term}({eng_term})' 이라는 무역 용어는 어떤 의미인가요?" if eng_term is not None else f"'{kor_term}' 이라는 무역 용어는 어떤 의미인가요?" answer = description return context, question, answer def term_qa_formatter(): data = [] root = 'dataset/term_data' for file in os.listdir(root): file_path = os.path.join(root, file) term_set = read_json(file_path) if file == 'terms_data_2.json': term_set = [item for sublist in term_set for item in sublist] for pair in term_set: eng_term = pair.get('eng_term', None) if 'term' in pair.keys(): kor_term = pair['term'] else: kor_term = pair['kor_term'] description = pair['description'] context_1, question_1, answer_1 = description_to_term_formattter(kor_term, eng_term, description) context_2, question_2, answer_2 = term_to_description(kor_term, eng_term, description) data_1 = {"context": context_1, "question": question_1, "answer": answer_1} data_2 = {"context": context_2, "question": question_2, "answer": answer_2} data.append(data_1) data.append(data_2) return data def transcript_qa_formatter(): data = [] root = 'dataset/transcript_data/success' for file in sorted(os.listdir(root)): file_path = os.path.join(root, file) for line in open(file_path): line = json.loads(line) context = line['context'] output = line['json_output'] qa_pairs = json.loads(output)['qa_pairs'] for pair in qa_pairs: question = pair['question'] answer = pair['answer'] if type(answer) == list: answer = answer[0] data.append({"context": context, "question": question, "answer": answer}) return data
###### Term dataset {'context': 'APEC 경제위원회(Economic Committee (EC)): 개별위원회나 실무그룹이 추진하기 어려운 여러분야에 걸친 이슈에 대한 분석적 연구작업을 수행하기 위해 결성된 APEC 기구,', 'question': "설명: '개별위원회나 실무그룹이 추진하기 어려운 여러분야에 걸친 이슈에 대한 분석적 연구작업을 수행하기 위해 결성된 APEC 기구,' 이 설명에 해당하는 무역 용어는 무엇인가요?", 'answer': 'APEC 경제위원회(Economic Committee (EC))'} ###### Transcript dataset {'context': '수입 신고는 일반적으로 입항 후에 하는 것이 원칙이며, 보세 구역에서 5부 10장을 작성하여 신고합니다', 'question': '수입 신고는 언제 하는 것이 원칙인가요?', 'answer': '수입 신고는 일반적으로 입항 후에 하는 것이 원칙입니다.'} ###### Exam dataset {'context': ' 다음 중 우리나라 대외무역법의 성격에 대한 설명으로 거리가 먼 것을 고르시오. 1. 우리나라에서 성립되고 이행되는 대외무역행위는 기본적으로 대외무역법을 적용한다. 2. 타 법에서 명시적으로 대외무역법의 적용을 배제하면 당해 법은 특별법으로서 대외무역법보다 우선 적용된다. 3. 대외무역법은 국내법으로서 국민의 국내 경제생활에 적용되는 법률이기 때문에 외국인이 국내에서 행하는 무역행위는 그 적용 대상이 아니다. 4. 관계 행정기관의 장은 해당 법률에 의한 물품의 수출·수입 요령 그 시행일 전에 지식경제부 장관이 통합하여 공고할 수 있도록 제출하여야 한다.정답: 대외무역법은 국내법으로서 국민의 국내 경제생활에 적용되는 법률이기 때문에 외국인이 국내에서 행하는 무역행위는 그 적용 대상이 아니다.', 'question': '다음 중 우리나라 대외무역법의 성격에 대한 설명으로 거리가 먼 것을 고르시오. 1. 우리나라에서 성립되고 이행되는 대외무역행위는 기본적으로 대외무역법을 적용한다. 2. 타 법에서 명시적으로 대외무역법의 적용을 배제하면 당해 법은 특별법으로서 대외무역법보다 우선 적용된다. 3. 대외무역법은 국내법으로서 국민의 국내 경제생활에 적용되는 법률이기 때문에 외국인이 국내에서 행하는 무역행위는 그 적용 대상이 아니다. 4. 관계 행정기관의 장은 해당 법률에 의한 물품의 수출·수입 요령 그 시행일 전에 지식경제부 장관이 통합하여 공고할 수 있도록 제출하여야 한다.', 'answer': '대외무역법은 국내법으로서 국민의 국내 경제생활에 적용되는 법률이기 때문에 외국인이 국내에서 행하는 무역행위는 그 적용 대상이 아니다.'}
# Exam dataset Dataset({ features: ['context', 'question', 'answer'], num_rows: 1430 }) # Term dataset Dataset({ features: ['context', 'question', 'answer'], num_rows: 15678 }) # Transcript dataset Dataset({ features: ['context', 'question', 'answer'], num_rows: 8885 }) # Concatenated dataset Dataset({ features: ['context', 'question', 'answer'], num_rows: 25993 })
The combined dataset (training dataset) with the Q/A format is as above. About 26,000 Q/A pairs are expected to be used for training.
Now, the dataset for fine-tuning is ready. Let’s check how this dataset is actually fed into the model.
<bos><start_of_turn>user Write a hello world program<end_of_turn> <start_of_turn>model
On the Huggingface website, you can find the model card for gemma-2-2b-it, which includes information on the chat template format and the definition of the model's prompt format (gemma-2-2b-it). This means that to ask questions to gemma, you need to create a prompt in a format that the model can understand.
The start of the conversation is marked with <start_of_turn>, and the end of the conversation is marked with <end_of_turn>. The speakers are specified as the user and the model. Therefore, when asking a question to the model, the prompt should follow this format.
def formatting_func(example): prompt_list = [] for i in range(len(example['question'])): prompt_list.append("""<bos><start_of_turn>user 다음 질문에 대답해주세요: {}<end_of_turn> <start_of_turn>model {}<end_of_turn><eos>""".format(example['question'][i], example['answer'][i])) return prompt_list This document focuses on training the model using the Q/A dataset, so the approach will be to train the model in the manner of "for this type of question, respond in this way." Considering the previously mentioned chat template, you can write code in the format described above. At this point, even if tokens are not explicitly included in the chat template, the model may attempt to generate more content beyond the delimiter. To ensure the model provides only an answer and then ends its turn, an <eos> token is added.
<start_of_turn>user 다음 질문에 대답해주세요: '(관세)감축률(Reduction Rate)' 이라는 무역 용어는 어떤 의미인가요?<end_of_turn> <start_of_turn>model 관세를 감축하는 정도를 말함. 예를 들어 200%p에 관세감축률이 50%를 적용하면 감축 후 관세는 100%p가 됨. 극단적인 경우로 관세감축률이 100%이면 모든 관세는 감축 후에는 0%p가 됨.<end_of_turn> In actual training, examples like the one above will be used as input. Now, the dataset preparation for training is complete. # Training The training code is very simple. We use SFTTrainer, and as the base model, we use the gemma-2-2b-it model, which has been continually pretrained on SCM & trade datasets. ```python model_id = "google/gemma-2-2b-it" output_dir = 'QA_finetune/gemma-2-2b-it-lora128' tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token) model = AutoModelForCausalLM.from_pretrained( # "google/gemma-2-2b-it", "yonggeun/gemma-2-2b-it-lora128-merged", device_map="auto", torch_dtype=torch.bfloat16, token=access_token, attn_implementation="eager", # attn_implementation, cache_dir="./models/models", ) def formatting_func(example): prompt_list = [] for i in range(len(example['question'])): prompt_list.append("""<bos><start_of_turn>user 다음 질문에 대답해주세요: {}<end_of_turn> <start_of_turn>model {}<end_of_turn><eos>""".format(example['question'][i], example['answer'][i])) return prompt_list def train(data): valid_set = data["test"] valid_set.save_to_disk('QA_finetune/valid_set/gemma-2-2b-it-lora128') lora_config = LoraConfig( r=256, lora_alpha=32, lora_dropout=0.05, bias="none", target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"], task_type="CAUSAL_LM", ) training_args = TrainingArguments( per_device_train_batch_size=2, warmup_steps=2, logging_steps=1, gradient_accumulation_steps=4, # num_train_epochs=3, num_train_epochs=3, learning_rate=2e-4, save_steps=100, fp16=False, bf16=True, output_dir=output_dir, push_to_hub=True, report_to="wandb" ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=data['train'], args=training_args, formatting_func=formatting_func, peft_config=lora_config, max_seq_length=max_length, packing= False, ) model.config.use_cache = False print("Training...") trainer.train() print("Training done!")
Evaluation
Once the training is successfully completed, it is essential to evaluate the model's performance. This article focuses on evaluating Question Answering performance in a specific domain, which required different metrics than those typically used for benchmarking general models. In this article, the model was evaluated using SemScore and Truthfulness.
SemScore: An evaluation method based on the semantic textual similarity between the target response and the model's response. (SemScore)
Evaluating Truthfulness: This method measures truthfulness on a scale of 1 to 5 by providing the model's response and the target answer to an LLM. (Truthfulness)
Fasttrack pipeline
Now, let’s create a pipeline in FastTrack that will be used for model training. A pipeline is a unit of work used in FastTrack. Each pipeline can be represented as a collection of tasks, which are the smallest executable units. Multiple tasks within a single pipeline can have dependencies on each other, and their sequential execution is ensured based on these dependencies.
Create Pipeline
In the image above, find the blue '+' button to create a new pipeline.
When creating a pipeline, you can choose the pipeline’s name and description, the location of the data repository to be used, and the environment variables that will be commonly applied in the pipeline. After entering the necessary information, click the "Save" button at the bottom to create the pipeline.
Drag and create task
Once a new pipeline is created, you can add a new task to the task template. Click on the "Custom Task" and drag it into the workspace below to create a new task.
Enter information
When creating a task, you need to enter the information required for task execution, as shown above. Write the task name and description clearly, and choose between a single node or multiple nodes. In this document, we will perform training on a single node, so we will select a single node.
Next, you need to write the command. The command essentially runs the session. Make sure to specify the directory of the mounted V-folder correctly so that the script runs without errors. Most of the packages required for training are already installed in the session, but if additional packages need to be installed or there are version issues, you may need to reinstall the packages. In such cases, you can specify the required packages in the requirements.txt file, install them, and then run the other scripts.
Resource configuration
Next are the settings for the session, resources, and V-folder.
Although the code in this article is written based on PyTorch, you can also choose other environments like TensorFlow, Triton server, etc.
One of the advantages of FastTrack is its ability to utilize resources as efficiently as possible. Even within a single resource group, resources can be divided among multiple sessions, maximizing the resource utilization rate.
For dataset preparation, GPU computation is not required, so it is acceptable not to allocate GPU resources. This allows you to run the code with minimal resources and allocate GPU resources to other sessions during this time, preventing GPU resources from remaining idle. Furthermore, if parallel model training is needed (e.g., when 10 GPUs are available and each training session requires 5 GPUs), you can train models in parallel. This approach helps reduce resource wastage and shortens training time.
Select the V-folder where the prepared dataset and training code are correctly located.
Duplicate or delete task
By clicking the meatball menu icon (⋯) at the top right corner of the task block, you can duplicate or delete the created task.
In FastTrack, you can set the order between multiple created tasks like this. This process involves adding dependencies between tasks. In some cases, you can set the next task to run only after several tasks are completed. In such cases, the next task will not proceed until all dependent tasks are finished. The completed example is shown above. In this article, we will proceed in the order of dataset preparation - fine-tuning - evaluation.
If each task is defined correctly, click "Run" to execute the pipeline.
On the left side of the FastTrack screen, you can see the pipelines you created. By clicking on them, you can monitor the currently running tasks and previously executed tasks in the pipeline task session.
Monitoring jobs
You can monitor the tasks through a screen like the one above. Each task proceeds in the specified order; once a previous task is completed, resources are allocated to start the session for the next task, and when the task is done, the session is terminated. There is also an option to skip tasks if needed. For example, in the image above, you can see that the fine-tuning task is running after skipping the dataset preparation task.
Skipped tasks are shown in pink, running tasks are in light blue, and tasks scheduled to run are in yellow.
Log checking
By clicking the blue button next to each task's name, highlighted with a red square, you can check the logs of each task. This allows you to directly monitor the training progress. The logs appear the same as they would in a terminal, as shown in the screen above, allowing you to verify that the training is progressing correctly.
Once the pipeline execution is successfully completed, you can check the results. In this document, the evaluation results are plotted and saved as /home/work/XaaS/train/QA_finetune/truthfulness_result.png.
(Backend.AI's V-folder has a default directory structure of /home/work/~.)
After training is complete, the result image is generated at the specified path.
Result checking
As shown above, you can see the successful execution of the pipeline by checking to the left of each task name.
Result
Now, let’s compare the results of the fine-tuned model with the base gemma-2-2b-it model.
-
SemScore (Semantic text similarity between target response and model response, 1.00 is the best)
| Base Model | Trained Model | |------------|---------------| | 0.62 | 0.77 |
The SemScore of the trained model has increased (0.62 -> 0.77). This result indicates that the trained model can generate outputs that are more semantically similar to the target responses. In other words, the trained model has improved in generating responses that are closer to the intended target responses and more semantically consistent. As a result, the overall performance and reliability of the trained model have significantly improved.
-
Truthfulness The trained model shows a trend of increasing high-score cases and decreasing low-score cases. Low scores (1, 2 points) decreased (1,111 -> 777), while high scores (4, 5 points) increased (108 -> 376). This indicates that the model's ability to identify domain information closer to the truth has improved, showing that the training was effective.
Truthfulness result
Conclusion
In this article, we built a pipeline to train a model specialized in a specific domain using FastTrack, the MLOps platform of Backend.AI.
Even though we utilized only some of FastTrack’s features, it allowed us to flexibly manage resources, freely configure tasks, reduce training time, and improve resource utilization. Moreover, we were able to train models stably in independent execution environments and monitor the execution information of pipeline jobs, enabling us to track resource usage and execution counts for each pipeline during training.
In addition to the contents covered in this article, FastTrack supports a variety of additional features such as scheduling and parallel model training. For more information about other features of FastTrack, you can refer to the blog posts written by Kang Ji-hyun and Kang Jung-seok, linked below.
Although we did not fully utilize all of FastTrack's features, its flexible resource management and free task configuration helped shorten training time and increase resource utilization rates. Furthermore, it provided a stable training environment and allowed us to monitor resource usage and execution frequency within each pipeline through pipeline job execution information. FastTrack also supports many other functionalities such as scheduling and parallel model training. You can find more information about FastTrack in the documents below.
26 September 2024
Model Variant: Easily Serving Various Model Services
By Jihyun KangIntroduction
Imagine a scenario where you need to train an AI for research purposes and produce results. Our job would simply be to wait for the AI to correctly learn the data we've taught it. However, if we assume we're creating a service that 'utilizes' AI, things get more complicated. Every factor becomes a concern, from how to apply various models to the system to what criteria to use for scaling under load conditions. We can't carelessly modify the production environment where users exist to get answers to these concerns. If an accident occurs while expanding or reducing the production environment, terrible things could happen. If something terrible does happen, we'll need time to recover from it, and we can't expect the same patience from consumers using our service as we would from researchers waiting for model training. Besides engineering difficulties, there are also cost challenges. Obviously, there's a cost to serving models, and users are incurring expenses even at the moment of training models as resources are being consumed.
However, there's no need to worry. Many well-made models already exist in the world, and in many cases, it's sufficient for us to take these models and serve them. As those interested in our solution may already know, Backend.AI already supports various features you need when serving models. It's possible to increase or decrease services according to traffic, and to serve various models tailored to users' preferences.
But the Backend.AI team doesn't stop here. We have enhanced the model service provided from Backend.AI version 23.09 and improved it to easily serve various models. Through this post, we'll explore how to serve various models easily and conveniently.
This post introduces features that allow you to serve various types of models more conveniently. Since we've already given an explanation about model service when releasing the 23.09 version update, we'll skip the detailed explanation. If you're unfamiliar with Backend.AI's model service, we recommend reading the following post first: Backend.AI Model Service Preview
Existing Method
| Requirement | Existing Method | Model Variant | |-------------|-----------------|---------------| | Writing model definition file (model-definition.yaml) | O | X | | Uploading model definition file to model folder | O | X | | Model metadata needed | O | △ (Some can be received automatically) |
Backend.AI model service required a model definition file (model-definition.yaml) that contains commands to be executed when serving the model in a certain format, in addition to the model metadata needed to run. The service execution order was as follows: Write the model definition file, upload it to the model type folder so it can be read, and when starting the model service, mount the model folder. Then, an API server that automatically transfers the end user's input to the model according to the model definition file and sends back the response value would be executed. However, this method had the disadvantage of having to access the file every time the model definition file needed to be modified. Also, it was cumbersome to write different model definition files each time the model changed because the model path was already set in the model definition file.
The Model Variant introduced this time is a feature that allows you to serve models immediately by inputting a few configuration values or without any input at all, using only model metadata without a model definition file. Model Variant supports command, vLLM, and NIM (NVIDIA Inference Microservice) methods. The methods of serving and verifying model service execution are as follows.
Basically, model service requires metadata of the model to be served. Download the model you want to serve from Hugging Face, where you can easily access model metadata. In this example, we used the Llama-2-7b-hf model and Calm3-22b-chat model from Hugging Face. For how to upload model metadata to the model folder, refer to the "Preparing Model Storage" section in the previous post.
Automatically Serving Model from Built Image (Command Method)
The first introduced command method is a form where the command part that executes to serve the model in the model definition file is included in the execution image. After specifying the command to execute in the CMD environment variable, build the image and execute it immediately without any other input when actually serving the model. The command method does not support what's called a Health check, which verifies if the service is running properly. Therefore, it's more suitable for immediately setting up and checking a service as a prototype rather than performing large-scale services. The execution method is as follows:
- On the start screen, select
Llama-2-7b-hf
in the Model Storage To Mount item to mount the model folder containing the model metadata corresponding to the model service to be served, and select Predefined Image Command in the Inference Runtime Variant item.
Activate the Open To Public switch button if you want to provide model service accessible without a separate token.
- Select the environment to serve. Here, we use
vllm:0.5.0
and allocate CPU 4 Core, Memory 16 GiB, NVIDIA CUDA GPU 10 fGPU as resources.
- Finally, select the cluster size and click the start button. The cluster size is set to single node, single container.
If the service has been successfully launched, the service status will change to
HEALTHY
and the endpoint address will appear.Verifying the Service
If the service has been launched normally, check the service model name with the
cURL
command:curl https://cmd-model-service.asia03.app.backend.ai/v1/models \ -H "Content-Type: application/json"
Now, let's send input to the service with the
cURL
command and check the response:For model services run with CMD, the model name is already defined in the image, so after checking the model name, you must enter the model name as the value of the
model
key when sending a request.curl https://cmd-model-service.asia03.app.backend.ai/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "image-model", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}'
Serving Models in vLLM Mode
The vLLM mode is similar to the command method introduced earlier, but various options entered when running vLLM can be written as environment variables. The execution method is as follows:
How to Run
- On the start screen, mount the model folder for the model service to be served and select
vLLM
in the Inference Runtime Variant item.
- Select the environment to serve. As with the command method explained earlier, select
vllm:0.5.0
, and (although you can set the resources the same) this time we'll allocate CPU 16 Core, Memory 64 GiB, NVIDIA CUDA GPU 10 fGPU.
- Finally, select the cluster size and enter the environment variable
BACKEND_MODEL_NAME
. This value corresponds to the--model-name
option in vLLM and becomes the model value specified by the user when sending a request to the service.
Likewise, if the service has been successfully launched, the service status will change to
HEALTHY
, and the endpoint address where the service is launched will appear.Verifying the Service
Let's send input to the service with the
cURL
command and check the response value. At this time, enter the model value as theBACKEND_MODEL_NAME
value you set earlier. Once the input is complete, click theSTART
button to create the service.curl https://vllm-calm3-22b-chat.asia03.app.backend.ai/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vllm-model", "prompt": "初めて会う日本人ビジネスマンに渡す最高の挨拶は何でしょうか?", "max_tokens": 200, "temperature": 0 }'
Serving Models in NIM Mode
To run NIM, you need an API key issued from an account that can access NGC's NIM model registry. For how to obtain the key value, please refer to the following content: NVIDIA Docs Hub : How to get NGC API Key
The NIM (NVIDIA Inference Microservice) mode is also similar to the command mode, but it must be run with an image that has NVIDIA's NIM-supporting model server built-in. Also, when loading the model, the NGC API key value is needed. Assuming everything is ready, let's start the model service.
How to Run
- On the start screen, select an empty model type folder to cache the metadata to be received from NIM, and select
NIM
in the Inference Runtime Variant item.
- Select the environment to serve. Here, we use
ngc-nim:1.0.0-llama3.8b
and set to allocate CPU 8 Core, Memory 32 GiB, NVIDIA CUDA GPU 15 fGPU as resources.
- Finally, select the cluster size and enter the default path
/models
for the environment variableHF_HOME
. Then enterNGC_API_KEY
and input the issued key value. Once the input is complete, click theCREATE
button to create the service.
When using NIM, it may take some time for the first execution as it receives model metadata from the repository. You can check the progress by viewing the container logs for the routing session in service on the session page.
Like the command and vLLM modes, if the service has been successfully launched, the service status will change to
HEALTHY
. Let's input the content to send to the service using the endpoint address where the service is launched as follows, and check the response value.Verifying the Service
from openai import OpenAI client = OpenAI( base_url = "https://nim-model-service.asia03.app.backend.ai/v1", api_key = "$YOUR_NGC_API_KEY" ) completion = client.chat.completions.create( model="meta/llama3-8b-instruct", messages=[ { "role":"user", "content":"Hello! How are you?" }, { "role":"assistant", "content":"Hi! I am quite well, how can I help you today?" }, { "role":"user", "content":"Can you write me a song?" }], temperature=0.5, top_p=1, max_tokens=1024, stream=True ) for chunk in completion: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="")
Conclusion
The Model Variant feature will be of great help to researchers and companies aiming to provide actual services with already trained models. Based on a powerful resource management system and support for various AI accelerators such as NVIDIA GPU, AMD ROCm, TPU, Graphcore IPU, Furiosa Warboy, Rebellions ATOM, Hyperaccel LPU, etc., Backend.AI now provides an integrated environment that can easily deploy services beyond simply training models. Try serving your desired AI model anytime with Backend.AI!
11 July 2024
- On the start screen, select
Learning from Nature Again: Neuromorphic Computing and Deep Learning
By Jeongkyu ShinThis article originally appeared in Engineers and Scientists for Change, May 2022.
The field of artificial neural networks has been gaining serious attention for nearly a decade now. In that short time, along with the advancements in deep learning, the field has been solving countless problems at an astonishing pace. It is considered the most promising approach for achieving artificial intelligence.
At the cutting edge, news about hyperscale deep learning models and their implementations have been garnering attention. In April 2022, news about NVIDIA's new H100 GPU flooded the headlines. AMD's high-performance computing GPUs like the MI series, along with Intel's new Ponte Vecchio GPU, are expected to drastically improve AI and blockchain mining acceleration, creating a new battleground for hyperscale AI.
Amidst the buzz around hyperscale AI, there was one piece of news that didn't receive much public interest: Intel's announcement of the Loihi 2 chip in October last year[1]. This news comes with a fascinating history and technical background. While AI training and service acceleration chips are proliferating, let's explore the science behind this intriguing tech news.
Can we code intelligence with deep learning?
Since 2013, when deep learning took off with matrix computation acceleration using GPUs, the field has begun exploring various possibilities empowered by the scale of computation. Starting with the AlphaGo shock in 2016, deep learning has gradually expanded its scope beyond research into practical applications. The transformer model architecture[2], proposed in 2017 and widely adopted from 2018, introduced the concepts of attention and self-attention, greatly improving the process of deep learning models creating their own memory structures. The transformer architecture has since been used in a wide range of deep learning models, particularly excelling in the language processing and image processing domains, where data is abundant. Transformers have enabled deep learning models to solve various problems that previously seemed intractable.
This seemingly omnipotent model has led the trend of scaling up deep learning models since 2018. The size of a deep learning model is determined by the number of its parameters, which are the connection information between the perceptrons that make up the model, corresponding to the synaptic connections of actual neurons. More connections allow the deep learning model to differentiate and judge more complex inputs. As deep learning models become more complex and massive, the number of parameters grows exponentially. Until 2019, the number of parameters in deep learning models increased roughly 3 to 5 times each year, but since 2019, it has been increasing more than tenfold annually. The massive deep learning models that have emerged in the past 2-3 years are sometimes referred to as "hyperscale AI". Well-known hyperscale deep learning models in the language processing domain include OpenAI's GPT-3 and Google's LaMDA. For these huge models, such as GPT-3, the system cost for training the model (the cost of a single training run on the cloud without purchasing equipment) is estimated to be at least around 5 billion KRW (approximately 4 million USD)[3].
Hyperscale models are solving problems that were previously difficult or impossible to solve. They discover new gravitational lenses[4] and unwarp the distortions caused by lenses[5] to unravel the mysteries of the universe. They predict protein folding structures in much shorter times and at lower costs than previous methods[6] and discover new drugs[7]. They even solve problems like StarCraft II strategic simulations[8], which require understanding flows over long periods of time.
As these models solve various problems, naturally, some questions arise. Is this approach of pouring in massive amounts of resources to create deep learning models sustainable? And can we code "intelligence" using this method?
To answer these two questions, let's quickly understand deep neural networks and today's topic, neuromorphic computing.
Deep Neural Networks: Origins and Differences
Deep learning is actually an abbreviation. The full term is Deep Neural Network (DNN), or more elaborately, Artificial Neural Network with deep layers. The theory of artificial neural networks has its roots in mathematically imitating the electrical properties of neurons. It began by mathematically imitating the electrical properties of neurons along with the plasticity[e1] of strengthening or weakening the connections between neurons during information processing, and then simplifying it. The artificial neural network model is a mathematical model that consists of a perceptron[9], which is an extreme simplification of the activation process based on the connections between neurons; an activation function that simplifies the firing process of neurons into a function of signal input to neurons, excluding time dependency; and weights as parameters representing the strength of connections between neurons.
Although artificial neural network theory has its roots in the characteristics of actual neural networks, there is a fundamental difference: the presence or absence of dynamics that determine behavior over time. In real neural networks, various outcomes are determined by the dynamics between neurons. Neurons have their own dynamic characteristics when stimulated from the outside, and they have plasticity that physically strengthens or weakens accordingly. For example, neurons that are continuously used and connected together when making a certain decision are activated at similar times when receiving an input signal. We can observe that the axons corresponding to the connections between neurons activated at "temporally" similar times become physically thicker. In contrast, general artificial neural networks simulate plasticity using backpropagation theory instead of dynamics. Backpropagation theory is a method to simplify the calculation of strengthening the weights of connections between perceptrons that were used to make a correct decision. In artificial neural networks, the process of processing input information is instantaneous using the weights between perceptrons. Since the process of input information leading to output information is not calculated as a function of time, there is no dynamic element.
There are various other differences besides dynamics. These differences are usually the result of introducing assumptions that are impossible in biological neural networks in order to overcome the limitations of artificial neural network theory in the 1990s. One example is the use of ReLU[e2] as an activation function. Ordinary neurons have thresholds and weight limits. Infinite activation values are physically impossible. So mathematical models also used functions that well represent thresholds and weight limits as activation functions. However, as deep artificial neural networks became deeper, researchers discovered that the training of artificial neural networks stopped progressing.[e3] The ReLU activation function, although physically impossible, can have infinite weights mathematically[e4]. By introducing ReLU into deep artificial neural networks, new training became possible, and the difference from biological neural networks grew.
Artificial neural networks that do not need to consider dynamics can be transformed into a sequence of matrix operations, enabling incredibly fast computation. However, they have become very different from the neural networks seen in biology. So, are deep learning models and the neurological processes occurring in our brains now on completely different foundations?
Learning from Nature Again: Dynamics of Neural Networks
Individual neurons communicate signals in various ways. Some are electrical signals, and some are chemical signals. The electrical signal characteristics within single neurons were interpreted and formulated very early[10] and became the theoretical basis for perceptrons. The problem was that the formula was too complex to calculate dynamics without simplification. Later, various mathematical models were proposed to approximate the electrical responses over time with reduced computational burden, and various single neuron simulators based on these models have been released. A representative simulator is NEURON[11].
As mentioned earlier, simulating dynamics requires an enormous amount of computation. At some point, we are entering an era of abundant computational power. What would happen if we connected these single neuron simulations based on the overflowing computational power?
There are two approaches, in terms of algorithms and hardware, to solve the problem of computational speed due to the enormous amount of computation and create dynamics-based artificial neural networks. The algorithmic approach is the spiking neural network (SNN), which attempts to create a dynamics-based artificial neural network by introducing spike-based plasticity that occurs in actual neurons. The hardware approach is neuromorphic computing, which has been in full swing since 2012. It involves implementing an artificial neural network by creating physical objects corresponding to neurons. Computers are still too slow to solve the enormous amount of computation involved in dynamics simulation with general-purpose computation. To address this, the idea is to create dedicated devices that either make objects with mathematical properties corresponding to neurons at the circuit level or hardcode computations. Recently, there has been an integration of not distinguishing between neuromorphic computing and SNN, and referring to the implementation of SNN at the device level as neuromorphic computing. Both approaches are attempts to simulate the dynamic characteristics that traditional artificial neural network theory did not use, to discover new phenomena or possibilities of deep learning.
One of the companies making strides in the field of neuromorphic computing is Intel. In the fall of 2017, Intel unveiled the Loihi[e5] chip, a research neuromorphic chip containing approximately 130,000 neurons and 130 million synapses. They ported existing DNN-based algorithms onto the Loihi chip, performed various comparative tests[12], and interestingly, showed that similar results to DNN can be obtained using SNN as well.
Intel then connected multiple Loihi chips to create massive SNN systems. Nahuku implemented 4.1 billion synapses, and the Pohoiki Springs neuromorphic supercomputer[13] implemented approximately 101 million neurons and 100 billion synapses based on 768 Loihi chips. In this process, Intel developed a software stack to implement SNN on Loihi. As a result, last fall, along with Loihi 2, they released the Lava open-source software framework[14] for developing neuromorphic applications.
It was expected that DNN and SNN would show similar results. From a physics perspective, the process of artificial neural networks inferring various problems is ultimately defining an ultra-high-dimensional discontinuous state space based on information and projecting new information onto that space. Both DNN and SNN have the characteristic of being able to define ultra-high-dimensional discontinuous state spaces. Through evolution, biology has physically created the characteristic of adapting to information, and humanity has invented artificial neural network theory through biomimetics and developed deep learning.
Always Finding Answers, as Always
So far, we have learned that networks that imitate neurons at the dynamics level can also obtain results similar to what we expected from deep learning. Then, a question arises: if the results are similar, is there a need to use SNN and neuromorphic computing? The examples introduced today are just a tiny fraction of various attempts. Research is ongoing on how SNN and neuromorphic computing produce different results from existing approaches. There are also results showing that SNN performs better, especially in robotics and sensors, and studies suggesting that reflecting dynamic characteristics would be more powerful for inferring causality. There are even attempts to simulate the chemical signals occurring at synapses[15]. This is because, in addition to the connection structure of neural networks, there may be elements in the individual components that make up neural networks that evoke the emergence of intelligence that we are not yet aware of. However, this may not be a sufficient answer to why SNN is used.
Let's ask the two questions posed at the beginning of the article again. Is this approach of pouring in massive amounts of resources to create deep learning models sustainable? And can we code "intelligence" using this method? Could neuromorphic computing be the answer? It could be, or it might not be.
The reason why DNN and SNN each show high performance and results is ultimately because there is an information optimization theory that we do not yet know at the foundation of both implementations. If we come to understand that, we may be able to implement AI in a different way. It could be one path to finding an answer to the first question: "Is this approach of pouring in massive amounts of resources to create deep learning models sustainable?" Neuromorphic computing and SNN allow us to examine this problem from a new perspective.
And it could also be the answer to the second question. We always carry a question in our hearts: 'Who are we?' The approach of neuromorphic computing and SNN is the most easily understandable method when we physically approach this fundamental philosophical question. Because it explains using a system that we already know (although we don't yet know its framework).
Various fields, including neuromorphic computing, are challenging to answer the above two questions. One of them is quantum computing. Next time, let's take the opportunity to read an article together about quantum computing and deep learning.
References
- [1] https://www.anandtech.com/show/16960/intel-loihi-2-intel-4nm-4
- [2] https://arxiv.org/abs/1706.03762
- [3] https://lambdalabs.com/blog/demystifying-gpt-3
- [4] https://iopscience.iop.org/article/10.3847/1538-4357/abd62b
- [5] https://academic.oup.com/mnras/article-abstract/504/2/1825/6219095
- [6] https://www.nature.com/articles/s41586-021-03819-2
- [7] https://www.frontiersin.org/articles/10.3389/frai.2020.00065/full
- [8] https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii
- [9] https://doi.apa.org/doi/10.1037/h0042519
- [10] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413
- [11] https://neuron.yale.edu/neuron
- [12] https://ieeexplore.ieee.org/document/8259423
- [13] https://arxiv.org/abs/2004.12691
- [14] https://www.intel.com/content/www/us/en/newsroom/news/intel-unveils-neuromorphic-loihi-2-lava-software.html
- [15] https://www.ibm.com/blogs/research/2016/12/the-brains-architecture-efficiency-on-a-chip/
Endnotes
- [e1] Plasticity is the ability to adapt and change one's characteristics in response to changes in the external environment or stimuli.
- [e2] It stands for Rectified Linear Unit. It is an activation function that takes the shape of y=x for values greater than 0, which means y can continue to increase as x increases.
- [e3] This is known as the Vanishing Gradient problem.
- [e4] Neurons cannot produce outputs beyond the physical limits of the cell, no matter how much larger the input they receive. It's like not being able to pass an unlimited current through a wire. ReLU is a function where the output increases linearly and indefinitely in proportion to the input.
- [e5] Intel is using various Hawaiian place names as codenames for their neuromorphic chips and systems.
27 June 2024