Tag : LLM
2024 Winter Intern in Lablup
By Yonggeun KwonOverview
In May 2024, I came across a Facebook post from Lablup announcing their internship recruitment. I applied through the Typeform link provided. I still vividly remember the struggle of writing in English for the Typeform (even though it stated that either Korean or English was acceptable, I felt I have to write in English). Since Lablup develops Backend.AI, an open-source software platform, I thought technical development skills would be essential. However, as my studies had been focused on AI research, I confidently applied to the research team.
The interview was conducted in English with three interviewers: Sergey and Eunjin from the research team, and Kyujin from the DevOps team. I did my best to explain the projects in my portfolio in English, and before I knew it, 40 minutes had flown by. Although I felt I couldn’t fully convey my prepared points due to my limited English conversation skills, I unexpectedly received an acceptance email three days later.
That’s how I began my internship in July with three other interns. Lablup was my first experience in a corporate, so I tried to dress as neatly as possible for my first day. I wore a long-sleeved shirt and slacks despite the summer heat. When I arrived at the office just before 10 a.m., I was surprised to find Jong-eun, dressed casually in shorts and sneakers, as the only person there. He explained, "There are no assigned seats here, so feel free to sit wherever you’d like," and mentioned that others were working remotely. During the 10 a.m. all-hands meeting, I got to meet the team through Jong-eun’s screen. The atmosphere was quite different from the corporate image I had in mind, which initially caught me off guard. However, through self-introductions and coffee chats, I gradually adapted to Lablup's unique environment.
Work
Developing a Domain-Adaptive Language Model
After onboarding (installing Backend.AI), I joined the research team to begin my first task: developing a language model specialized for the trade domain.
The general workflow is depicted in the diagram above. The goal was to train a language model to extract keywords and summaries from trade-related email conversations. These outputs would then help customers generate quotations. Detailed information about the model development and evaluation process can be found on the Lablup blog.
To develop this language model, the process involved several steps: [Dataset Collection and Preprocessing] → [Model Training] → [Evaluation]. Throughout this process, I encountered many questions and faced initial challenges working in the Backend.AI environment. Since the internship was only two months long, every day felt precious. If I got stuck on something, it could cost me an entire day, so I quickly learned the importance of asking for help without hesitation.
One of the best thing of Lablup was its asynchronous communication style. Whenever I had a question, I could post it on the relevant channel. While it felt intimidating at first, knowing everyone could see my question, it was far better than wasting time being stuck. The team members were always kind in their responses, which helped me save time and work on other tasks while waiting for answers. This open-source culture of collaboration was truly a strength of Lablup.
Evaluating the Agent Helmsman with LLM
As mentioned earlier, Lablup develops Backend.AI, a software platform with numerous CLI-based commands. For new users, memorizing all the commands can be challenging. While there is a WebUI, even this can feel unfamiliar for first-time users.
Helmsman is an LLM-based agent designed to address this issue. Users can provide natural language instructions (e.g., "Create a session" or "I want to train the Llama 7b model") through a chat interface. Helmsman then either provides the appropriate CLI command or executes it directly. This allows users to achieve their goals through simple chat instructions, making the platform much more accessible.
However, Helmsman didn’t always provide accurate responses. Since it relied on Backend.AI’s CLI documentation and commands, outdated or incorrect documentation could lead to errors. Additionally, as an LLM-based agent, it was susceptible to hallucinations. To evaluate Helmsman’s performance and identify issues in the documentation, I developed the following LLM-based evaluation system after discussions with Sergey and Eunjin.
1. LLM1: Instruction Generator
The first step is performed by the Instruction Generator LLM. This LLM analyzes Backend.AI's documentation to generate user instructions that are likely to be used in Backend.AI.
For example, it might generate an instruction like, "Create a session with 4 CPU cores and 16GB of memory."
2. LLM2: CLI Command Generator
The second step is performed by the CLI Command Generator LLM. This LLM generates CLI commands based on the instruction created by the first LLM, along with the CLI documentation and a few-shot examples.
For instance, it might generate the following command:
'[Backend.AI](http://backend.ai/) session create --name cli-test-session --resources cpu=4 --resources mem=16g --resources cuda.shares=2 ,…’
3. Executing and Logging CLI Commands
The generated CLI command is executed, and the results are logged as follows:
- Success/Failure Status: Checks whether the command executed as intended.
- Error Logs: Records error messages if the command fails.
- Reference Documentation: Logs the documentation used to generate the command.
Through this process, users can verify the results of command execution and, if issues arise, obtain additional information to troubleshoot the problem.
Result Analysis
Errors can be categorized into three types:
- Instruction Errors: Caused by insufficient user input, which can often be resolved through user-agent interaction.
- Documentation Errors: Rare but critical, as they result in repeated failures. These require corrections in the documentation.
- Hallucination Errors: Mitigated by adding few-shot examples. However, prompt length becomes a concern, so matching few-shots to specific commands can help.
Due to time constraints, the tasks following the result analysis were left as future work.
Reflection
Started as a two-month internship was extended to six months. Looking back, I realize that two months might have been too short to fully experience and contribute to Lablup, given the depth of Backend.AI and the remote-friendly culture.
However, as mentioned earlier, Lablup encourages a culture of asking questions. By taking advantage of this, even two months can be enough for meaningful growth.
Beyond work, I also had the chance to participate in events like PyCon and the Lablup Conference. Notably, I even had the opportunity to be a speaker at the Lablup Conference. Though I felt there was much to improve in my presentation, it was an invaluable experience.
Lablup is a company filled with unique individuals. Everyone’s character stood out so much that I often thought of myself as the most ordinary person there. Yet, I found that their distinctiveness also came with extraordinary talents. I laughed a lot, learned a lot, and received a lot of help. This made work enjoyable, and I never dreaded going to the office.
My time at Lablup was not only a period of professional growth but also a time of personal enrichment through meaningful interactions and experiences. I hope to carry forward the values I learned here and continue to grow.
27 January 2025
Developing KIMCHI: An AI Image Generation Model Reflecting Korean Culture
By Yonggeun KwonIntroduction
Lablup's research team is working on various examples utilizing Backend.AI. We develop AI Agents that automatically generate CLI commands inside Backend.AI to match the user's instructions, and we also create simple demos for various exhibitions. For example, 'VisuTale', a service that creates images and fairy tales using images from users', is the work of our research team. VisuTale has been used at various exhibitions as an example of AI services that can be developed using Backend.AI, and it has been a hit with the audience.
During a recent internal study, we realized that AI-powered image generation models often fail to capture specific cultural contexts or languages. This is because global AI models are often trained on English, or trained on generic, general-purpose domain data. When a region or culture is entered into the prompts, its unique characteristics are often not represented. When we examined different models, we found that when using Korean text prompts to generate images, the Korean identity was not fully represented in the images.
For example, popular models such as Stable Diffusion v1.5 and DALL-E 3 demonstrate issues with not interpreting Korean text prompts correctly, or not fully reflecting Korean identity in the generated images.
-
Stable Diffusion v1.5 generated an image of a building instead of a ramyeon noodle with the prompt in Korean “Draw me a Korean Ramyeon”, indicating that the model was not handling the Korean input correctly.
-
DALL-E 3 generated an image of a noodle dish with the same prompt, it produced an image that looked more like Japanese ramen than Korean ramyeon and did not capture the Korean vibe.
Typing the prompt in English isn't the answer either - sometimes there are no one-to-one words that translate from Korean to English, or the translator doesn't understand the context and doesn't come up with the right translation. For example, if you type the food “Gochujang-jinmichaeboekum” into the translator, it will come up with “stir-fried vegetables with red pepper paste”, which isn't an accurate translation of the food.
As a solution to this problem, KIMCHI (Korean dIffusion Model adapting Culture and HIstory), an image generation model that fully reflects Korean culture and identity, was born.
Making 'KIMCHI'
Attached Image Source: Wikipedia & Lotte Group
Lablup participates in various overseas exhibitions and conferences such as CES, MWC, GTC, and SC, so there are many opportunities to showcase various demos. In this situation, we believe that 'KIMCHI' can be used as a good example of a culture-specific AI model. If Lablup's 'KIMCHI' is shown to generate customized images that reflect cultural characteristics in real time, it will attract attention overseas and effectively communicate the brand's differentiation. Furthermore, KIMCHI can also be used as marketing materials at various exhibitions and provide differentiated content.
'KIMCHI' was developed with two goals. 1. To understand Korean prompts directly and produce accurate images without the need for translation. 2. To generate images that reflect Korean culture and identity. If we can achieve both of these goals, we should be able to generate images of Korean ramyeon noodles and Lotte World Tower standing in a Korean city center in response to the prompts “Draw me ramyeon noodles in Korea” or “Draw me Lotte World Tower in Korea”
In this article, we'll describe the development of these 'KIMCHI' models and the limitations of the dataset preprocessing framework.
Preparing Dataset
Datasets collected from AIHub, such as Korean Food Dataset, Korean Historical Building and Landmark Dataset, External knowledge base multimodal Q&A data with Korean images were used to train the 'KIMCHI' model.
Processing the Data
1. VLM(Vision Language Model) Processing: First, we'll feed the prepared image with prompts as input to a VLM called LLaVA. The VLM is a multimodal AI model that can understand images and text together, meaning it analyzes the information in the image and translates it into human-understandable text. For example, it can look at a certain photo and generate an English caption that says, “A plate of food with a red sauce and a green vegetable." 2. LLM(Large Language Model) Refinement: The English captions are then passed to LLaMA, a language model. LLaMA is responsible for replacing the existing English caption with a more detailed and accurate Korean description. If the original English caption was “A plate of food with a red sauce and a green vegetable.” and contained only visual information about the image, LLaMA would turn it into a Korean sentence that reads, “Kimchi, a traditional Korean dish made with chunggat tossed in red sauce on a white plate.”
Through these two processes, Korean captions are generated that best describe the existing image. These captions are then used to train the image generation model.
Training a model
The Korean descriptions generated from this process were used as text prompts and utilized to train the Diffusion model.
In the image above, we obtained the Korean description of Gat-Kimchi (a Korean food which is a variant of kimchi) “a traditional Korean dish made of chunggat tossed in red sauce on a white plate”. Since it would be very expensive to fully refine the diffusion model, we applied the Low-Rank Adaptation (LoRA) technique to improve learning efficiency, and used the text encoder of the Korean-CLIP model to encode Korean prompts. The Korean-CLIP model is a model that is trained by adding the CLIP model to Korean-English parallel data. Through this process, we were able to further train the model and develop 'KIMCHI', which is capable of Korean prompts and generates images that naturally reflect Korean culture.
Evaluating experiments and performance
Image generation experiments
Experiment (1) - Representing cultural elements
The comparison experiment above shows the results of DALL-E 3, ChatGPT-4o, Stable Diffusion 2.1 (using English translation prompts), and the KIMCHI model generating images based on the same Korean prompts alongside real images.
While other models performed well, KIMCHI captures the texture and ingredients of food, details of traditional architecture, and more without looking artificial. This means that KIMCHI is trained to accurately understand Korean text and generate natural Korean images without the need for translation.
Experiment(2): Preserving the ability to create generic images
The image above is the result of a validation to see if KIMCHI retains the ability to generate the pretrained model after training. KIMCHI was trained using Stable Diffusion 1.5 as a base model, so we compared them together.
Evaluating performance
Result(1): Representing cultural elements
After the experiment, we evaluated the image generation results by surveying 53 evaluators, who selected one of five statements for each of the above items for each model's generated images, ranging from strongly agree to strongly disagree. We then scaled the evaluators' responses from 1 to 5, and measured the average value for each item.
The table above tabulates the experimental results for Experiment (1). KIMCHI outperformed all other models in Text-Image Alignment, Reality, and Domain-Specific Suitability. This means that KIMCHI is good at generating images that are close to the real image, reflecting the Korean prompt, and it is also good at reflecting cultural factors.
Result(2): Preserving the ability to create generic images
The table above shows the results of Experiment(2) through human evaluation. Compared to the Stable Diffusion 1.5 base model, KIMCHI performs slightly worse in Text-Image Alignment and Error Detection, indicating that KIMCHI's ability to generate general images has decreased somewhat as a result of focusing on that domain while learning to specialize in a specific domain. However, we found that it preserved its Reality performance well.
Limitations and conclusions
Limitations
KIMCHI is a fine-tuned model utilizing Stable Diffusion v1.5 as a base model. Stable Diffusion v1.5 was released in October 2022, which is more than two years old. Therefore, even with fine-tuning, image performance may not be as good as the latest version of the model. Also, while it can be trained to understand Korean prompts, it has limitations in that its text-image alignment ability is dependent on the ability of the base model.
Conclusions
KIMCHI is a model that directly understands Korean prompts and generates images that reflect cultural nuances, challenging the limitations of existing image generation models. Although there are still many limitations, this research has the following contributions.
- Build a culturally contextualized dataset preprocessing pipeline with VLM and LLM
- Generate images with Korean prompts without translation
- Generate more realistic and culturally contextualized images
If we can build on these contributions and collect images from specific cultures, we can leverage the dataset generation preprocessing pipeline utilized by KIMCHI to build a dataset that can be used to fine-tune the diffusion model, which will allow us to develop a generation model specific to that language and culture.
While this research started by generating images that accurately reflect Korean culture, it can be extended to AI technologies that understand various cultures. We look forward to further research in the future so that culture-specific models such as KIMCHI can contribute to the world's diverse cultures, and to AI technologies that understand and respect diverse cultures.
26 December 2024
-
Model Variant: Easily Serving Various Model Services
By Jihyun KangIntroduction
Imagine a scenario where you need to train an AI for research purposes and produce results. Our job would simply be to wait for the AI to correctly learn the data we've taught it. However, if we assume we're creating a service that 'utilizes' AI, things get more complicated. Every factor becomes a concern, from how to apply various models to the system to what criteria to use for scaling under load conditions. We can't carelessly modify the production environment where users exist to get answers to these concerns. If an accident occurs while expanding or reducing the production environment, terrible things could happen. If something terrible does happen, we'll need time to recover from it, and we can't expect the same patience from consumers using our service as we would from researchers waiting for model training. Besides engineering difficulties, there are also cost challenges. Obviously, there's a cost to serving models, and users are incurring expenses even at the moment of training models as resources are being consumed.
However, there's no need to worry. Many well-made models already exist in the world, and in many cases, it's sufficient for us to take these models and serve them. As those interested in our solution may already know, Backend.AI already supports various features you need when serving models. It's possible to increase or decrease services according to traffic, and to serve various models tailored to users' preferences.
But the Backend.AI team doesn't stop here. We have enhanced the model service provided from Backend.AI version 23.09 and improved it to easily serve various models. Through this post, we'll explore how to serve various models easily and conveniently.
This post introduces features that allow you to serve various types of models more conveniently. Since we've already given an explanation about model service when releasing the 23.09 version update, we'll skip the detailed explanation. If you're unfamiliar with Backend.AI's model service, we recommend reading the following post first: Backend.AI Model Service Preview
Existing Method
| Requirement | Existing Method | Model Variant | |-------------|-----------------|---------------| | Writing model definition file (model-definition.yaml) | O | X | | Uploading model definition file to model folder | O | X | | Model metadata needed | O | △ (Some can be received automatically) |
Backend.AI model service required a model definition file (model-definition.yaml) that contains commands to be executed when serving the model in a certain format, in addition to the model metadata needed to run. The service execution order was as follows: Write the model definition file, upload it to the model type folder so it can be read, and when starting the model service, mount the model folder. Then, an API server that automatically transfers the end user's input to the model according to the model definition file and sends back the response value would be executed. However, this method had the disadvantage of having to access the file every time the model definition file needed to be modified. Also, it was cumbersome to write different model definition files each time the model changed because the model path was already set in the model definition file.
The Model Variant introduced this time is a feature that allows you to serve models immediately by inputting a few configuration values or without any input at all, using only model metadata without a model definition file. Model Variant supports command, vLLM, and NIM (NVIDIA Inference Microservice) methods. The methods of serving and verifying model service execution are as follows.
Basically, model service requires metadata of the model to be served. Download the model you want to serve from Hugging Face, where you can easily access model metadata. In this example, we used the Llama-2-7b-hf model and Calm3-22b-chat model from Hugging Face. For how to upload model metadata to the model folder, refer to the "Preparing Model Storage" section in the previous post.
Automatically Serving Model from Built Image (Command Method)
The first introduced command method is a form where the command part that executes to serve the model in the model definition file is included in the execution image. After specifying the command to execute in the CMD environment variable, build the image and execute it immediately without any other input when actually serving the model. The command method does not support what's called a Health check, which verifies if the service is running properly. Therefore, it's more suitable for immediately setting up and checking a service as a prototype rather than performing large-scale services. The execution method is as follows:
- On the start screen, select
Llama-2-7b-hf
in the Model Storage To Mount item to mount the model folder containing the model metadata corresponding to the model service to be served, and select Predefined Image Command in the Inference Runtime Variant item.
Activate the Open To Public switch button if you want to provide model service accessible without a separate token.
- Select the environment to serve. Here, we use
vllm:0.5.0
and allocate CPU 4 Core, Memory 16 GiB, NVIDIA CUDA GPU 10 fGPU as resources.
- Finally, select the cluster size and click the start button. The cluster size is set to single node, single container.
If the service has been successfully launched, the service status will change to
HEALTHY
and the endpoint address will appear.Verifying the Service
If the service has been launched normally, check the service model name with the
cURL
command:curl https://cmd-model-service.asia03.app.backend.ai/v1/models \ -H "Content-Type: application/json"
Now, let's send input to the service with the
cURL
command and check the response:For model services run with CMD, the model name is already defined in the image, so after checking the model name, you must enter the model name as the value of the
model
key when sending a request.curl https://cmd-model-service.asia03.app.backend.ai/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "image-model", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}'
Serving Models in vLLM Mode
The vLLM mode is similar to the command method introduced earlier, but various options entered when running vLLM can be written as environment variables. The execution method is as follows:
How to Run
- On the start screen, mount the model folder for the model service to be served and select
vLLM
in the Inference Runtime Variant item.
- Select the environment to serve. As with the command method explained earlier, select
vllm:0.5.0
, and (although you can set the resources the same) this time we'll allocate CPU 16 Core, Memory 64 GiB, NVIDIA CUDA GPU 10 fGPU.
- Finally, select the cluster size and enter the environment variable
BACKEND_MODEL_NAME
. This value corresponds to the--model-name
option in vLLM and becomes the model value specified by the user when sending a request to the service.
Likewise, if the service has been successfully launched, the service status will change to
HEALTHY
, and the endpoint address where the service is launched will appear.Verifying the Service
Let's send input to the service with the
cURL
command and check the response value. At this time, enter the model value as theBACKEND_MODEL_NAME
value you set earlier. Once the input is complete, click theSTART
button to create the service.curl https://vllm-calm3-22b-chat.asia03.app.backend.ai/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vllm-model", "prompt": "初めて会う日本人ビジネスマンに渡す最高の挨拶は何でしょうか?", "max_tokens": 200, "temperature": 0 }'
Serving Models in NIM Mode
To run NIM, you need an API key issued from an account that can access NGC's NIM model registry. For how to obtain the key value, please refer to the following content: NVIDIA Docs Hub : How to get NGC API Key
The NIM (NVIDIA Inference Microservice) mode is also similar to the command mode, but it must be run with an image that has NVIDIA's NIM-supporting model server built-in. Also, when loading the model, the NGC API key value is needed. Assuming everything is ready, let's start the model service.
How to Run
- On the start screen, select an empty model type folder to cache the metadata to be received from NIM, and select
NIM
in the Inference Runtime Variant item.
- Select the environment to serve. Here, we use
ngc-nim:1.0.0-llama3.8b
and set to allocate CPU 8 Core, Memory 32 GiB, NVIDIA CUDA GPU 15 fGPU as resources.
- Finally, select the cluster size and enter the default path
/models
for the environment variableHF_HOME
. Then enterNGC_API_KEY
and input the issued key value. Once the input is complete, click theCREATE
button to create the service.
When using NIM, it may take some time for the first execution as it receives model metadata from the repository. You can check the progress by viewing the container logs for the routing session in service on the session page.
Like the command and vLLM modes, if the service has been successfully launched, the service status will change to
HEALTHY
. Let's input the content to send to the service using the endpoint address where the service is launched as follows, and check the response value.Verifying the Service
from openai import OpenAI client = OpenAI( base_url = "https://nim-model-service.asia03.app.backend.ai/v1", api_key = "$YOUR_NGC_API_KEY" ) completion = client.chat.completions.create( model="meta/llama3-8b-instruct", messages=[ { "role":"user", "content":"Hello! How are you?" }, { "role":"assistant", "content":"Hi! I am quite well, how can I help you today?" }, { "role":"user", "content":"Can you write me a song?" }], temperature=0.5, top_p=1, max_tokens=1024, stream=True ) for chunk in completion: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="")
Conclusion
The Model Variant feature will be of great help to researchers and companies aiming to provide actual services with already trained models. Based on a powerful resource management system and support for various AI accelerators such as NVIDIA GPU, AMD ROCm, TPU, Graphcore IPU, Furiosa Warboy, Rebellions ATOM, Hyperaccel LPU, etc., Backend.AI now provides an integrated environment that can easily deploy services beyond simply training models. Try serving your desired AI model anytime with Backend.AI!
11 July 2024
- On the start screen, select
[Featured articles] Scale entanglement
By Jeongkyu ShinThis article originally appeared in Crossroads, May 2023. [^Editor's note]
This article is translated from Korean.
The original order of the essay is 2023 > 2015 > 2020 > 2017 > 2018 > 2019 > 2021 > 2022 > 2023.
While the emotional flow follows that order, the essay has been re-edited in chronological order for the reader's understanding.After 2023, March 14th may no longer be known as Pi Day, but rather as Chatbot Day.
It was the day when all the language models that had been in storage were simultaneously unleashed into the world. Starting with Google's release of PaLM fine-tuning + generative models on Vertex AI, followed by OpenAI's announcement of GPT-4, Microsoft officially confirming that Bing was already using GPT-4, and Anthropic's formal release of Claude bot, all within a span of 12 hours.
That morning, after reviewing the GPT-4 tech report released by OpenAI, I left a post on Facebook about the technical aspects that impressed me.[1] A comment was left: "I feel both joy and pain seeing things I wondered if I would ever see in my lifetime becoming reality..." I replied, "Now, no one is interested in the Turing test anymore. Within a year, it has become a question of 'Isn't it obvious that it will pass without a wow point?'"
When actually facing the keyboard, the task of organizing knowledge seems to have already left human hands. I tap into human stories to find the meaning of recording.
To avoid sounding like an alien language, let me briefly touch upon the aspects of artificial neural networks necessary to understand this essay.
A program that simulates the connections between neurons is called an artificial neural network. Neurons are grouped into layers, and the network is designed by overlapping these layers and creating connections between neurons in adjacent layers. Deep learning is a term used when there are many layers in an artificial neural network. The results of various artificial neural networks are called deep learning models, or simply AI models to sound more impressive.
The connection strengths between neurons are called parameters[2]. The number of connection strengths is called the number of parameters. As the number of parameters increases, more memory is required, so it is said that the model becomes larger. Training a model involves connecting artificial neurons and adjusting the connection strengths between them so that the output for the input data takes the desired form. After training, an artificial neural network can mimic an extremely high-dimensional, discontinuous state space.
Those are the basic terms. So, when should we start recalling the story?
2015
I founded Lablup Inc. The name was chosen as a pun, meaning both "lab eul up" and "lab | up". (eul is an adjective to modify a noun in Korean language. | is linux pipe command)
People who had been suffering through their Ph.D. programs came together with the goal of creating a research automation platform in the field of computational science to alleviate the hardships of others. We believed that instead of clumsily running clusters by placing workload managers[4] on bare metal[3], we needed a computational environment that guarantees reproducibility and portability. The start was courageous, but there was no demand for a research platform. Within two months of founding the company, we learned the hard way that neither universities nor research institutions, despite easily purchasing equipment, are stingy when it comes to spending money on software. Universities lacked funds, but had an abundance of graduate students who would figure things out on their own when instructed. The industry did not yet have a large-scale demand for scientific computing. At the same time, we painfully learned that Ph.D. holders like us, who had only been in school, needed to go through a process of re-socialization to communicate like normal people.
It wasn't just our lack of re-socialization that was the problem. It was the content of what we were saying. Stories about science advancing based on technology or stories about accelerating innovation based on computation were considered science fiction topics wherever we went. We were getting exhausted. Yet, as they say, people only see what they can see. The potential of deep learning models was clearly emerging. In fact, various movements had started to prevent deep learning technologies and their outcomes from being subordinated to capital. One of the representative organizations was OpenAI[5]. Such changes seemed to be evidence that we were heading in the right direction. It felt like the direction would become a bit clearer after just one more year.
On the verge of our first anniversary, deep learning garnered significant social interest. Around the end of 2015, TensorFlow[6] was released to the world. As the first lecture material for the coding platform prototype, we translated the entire TensorFlow manual and uploaded it. When AlphaGo had its match in March 2016[7], our service crashed for the first time due to the influx of people. If it weren't for that match at that time, perhaps there wouldn't be a sequel to this story. Thankfully, because of that, the company survived.
We needed a demo of a research platform performing massive-scale computations. Language models were a topic that required enormous computational resources and could be immediately tackled as a hobby since my Ph.D. program. In 2016, we released a chatbot created by placing a language model on the platform we were developing. It attracted a lot of attention.[8][9] The chatbot quickly became an in-house project. However, after just over a year, the chatbot project was shelved.
2017
In the fall of that year, Lablup decided to halt the development of language models, which had been a side project, and focus solely on Backend.AI, an AI cluster management platform.
The shock of the Google Assistant demo I saw in Krakow, Poland, where I was invited by Google, was significant. The topic of that meeting, which was attended by just over ten people, clearly showed that language models had now become part of the arms race for resources, and without large-scale investment, it would be impossible to keep up with the changes thereafter. Over dinner with Donghyun Kwak, who attended the Google Developer Summit with me, and Sanghun Lee, who visited Krakow to attend a conference held at the same time and place, we discussed, "It seems like the future I saw in the field of physics will start here as well."
The Manhattan Project strongly appealed to all of humanity through nuclear weapons that technology becomes power nearly eighty years ago. Physics was no longer an object of romance, but an object of investment. The era of physicists making a living, which began that way, led to changes in the fields of massive science, connecting to the space program and particle physics. On the way back to the accommodation that night, I sent a message to the team members, "Let's not develop language models anymore. From now on, we'll lack the funds to keep up."
History always repeats itself. Then it's not difficult to predict what changes will occur in the future. It's just a matter of timing. We shared the opinion that 'the inflection point will probably be in 2020.'[10] By then, wouldn't it be possible to achieve profitability? It became the company's goal. Language models were advancing beyond LSTM-based machine translation. The AlphaGo shock became the source of countless jokes people made about AI. A tremendous number of "AI companies" were born. However, most of them became coin companies or metaverse companies two years later.
2018
The transformer[11] architecture began to be applied to various parts of all kinds of language models. In terms of letting us know 'what' to focus on, transformers solved many aspects of model context memory, maintenance, and emphasis. They could be used for encoding to put information into the state space or for decoding to extract information from the state space. Google introduced BERT[12], while OpenAI introduced GPT. Both were transformer-based language models, but they differed in their focus on the encoder and decoder, respectively. BERT focused on the encoder part, while GPT implemented an architecture that creates memory for causal relationships by linking the output to the input through the decoder, which is structurally different from BERT. BERT and GPT, and the later T5, no longer used labeled corpora. Using transformers, language models could be created by first training the model to learn the structure of language from the corpus itself and then fine-tuning it. It was still far from the end-to-end training philosophy of AI model development, which involves no human intervention in the data. However, the concept of data acquisition began to change from that point on. Quantity is more important than labeling. It was the beginning of general-purpose language models.
BERT seemed to be able to replace most of the existing language models at an incredibly fast pace. Its overwhelming performance raised expectations for making significant improvements in various language tasks such as document creation, chatbots, and document analysis. However, as of 2018, BERT was a model so large that it was unimaginable to train. It seemed like no one outside of Google would be able to create a model of that size. But that moment was fleeting. Facebook quickly released RoBERTa[13], which was based on the BERT paper but increased in size. It was a symbolic action that simultaneously announced that TPUs were not necessarily required and that anyone with capital could participate in this race.
The first bottleneck in increasing model size using GPUs appeared in GPU memory. Models could no longer fit on a single GPU device or be trained in time using a single device. Depending on the case, it became common to split models across multiple GPUs or distribute training across multiple computing nodes. Horovod and Distributed TensorFlow began to shine.
Technology continues to advance, and the cost per unit of computational resources continues to decrease. If this progress continues, the popularization of AI will eventually take place, and at that point, price competitiveness, which is the same in all markets, will become the most important factor. "The era of price competitiveness will come for AI as well," I wrote, praying not to go bankrupt until then.
- BERT, 340 million parameters
- GPT, 110 million parameters
2019
For several years while creating a distributed processing and distributed training platform, I occasionally wondered, 'Could we be creating a platform with no demand?' After 2019, I no longer had such thoughts. From 2020, there was no time for such thoughts.
As soon as the new year began, OpenAI announced GPT-2[14]. The GPT model, which focused on the decoding process of extracting information from the topological space, demonstrated remarkably stable text generation capabilities. GPT-2 became the basic code that anyone could use to create a language model. Along with PyTorch, Horovod, and Distributed TensorFlow, the difficulty of code access was rapidly decreasing. Google's XLNet and T5 (Text-To-Text Transfer Transformer) language models in 2019 seemed to have crossed the river of model sizes that were thought to be impossible for humans to surpass (by spending capital). Google strongly appealed that training T5 required enormous computational resources at the TPU level, emphasizing that it would require hundreds of NVIDIA's V100 units, which could be purchased with effort in the market. (A single V100 unit cost around 12K USD at the time.) Like BERT, T5 only released the paper and did not release the model. Google had the painful experience in 2017 when they released the BERT paper but did not immediately release the model alongside it (because training was not yet complete), and in that gap, Facebook preemptively released RoBERTa, which they had trained by increasing the size of the same model. Considering that they still did not release it, Google must have been confident that it would be difficult for anyone outside Google to reproduce the training of that model.
At the end of 2019, we ended our long wandering and moved into our own office. The era of massive deep learning models was coming, and to prepare for it, we needed to collaborate with more people. The size of language models was increasing tenfold every year. As I carried the not-so-many belongings in boxes, I questioned myself. At this rate, in three years, the model size would increase a thousand-fold, but were we ready to handle a thousand times the workload?
- RoBERTa, 350 million parameters
- Transfer ELMo, 460 million parameters
- GPT-2, 1.5 billion parameters
- T5, 11 billion parameters
2020
Two months had passed since we moved to the new office. The winter was long.
By the end of February, we were unable to complete the interior of the new office we had moved into. The wall finishing materials for the office, which were supposed to arrive from China at the end of February, never made it. The company's entire roadmap changed. All business trips to the United States were canceled. The office with unfinished interiors remained unfinished and guarded the empty space for the next two years.
COVID-19 not only tore apart the company's future but also separated people. My first child started elementary school by lying diagonally against the floor, watching the tiger (meaning of strict in Korean) teacher on the EBS educational broadcast TV channel. I rolled around next to the child rolling on the floor. How busy had I been? Even while raising my children, it wasn't until I was trapped in the same house due to the coronavirus that I felt the reality of being a father. I wondered how long that sad yet strangely stabilizing time would last.
That year, OpenAI released GPT-3. The theoretical foundation was not significantly different from GPT-2. However, one thing was vastly different: the size. It was a massive model with 175 billion parameters. Not only for model training but simply loading it onto a GPU was expected to occupy an entire NVIDIA DGX-2 supercomputing node. Unlike GPT-2, this time, neither the code for the language model nor the trained language model was released. Wow. Non-disclosure in the field of deep learning. Something was changing.
There was a movement against the supremacy of model size. Does performance increase accordingly as the size of deep learning models grows? A debate began between researchers at Google and Meta. One side argued yes, while the other argued no, engaging in a word battle in the form of papers. However, this debate, which lasted from 2019 to 2021, did not last long. As the size of language models increased, interesting phenomena were discovered. Deep learning models had scaling laws[15]. Something happened around 100 billion parameters. Regardless of the model structure, at some point beyond 100 billion parameters, language models began to do unexpected things beyond simply generating coherent speech. The phenomenon called in-context learning allowed models to learn various knowledge on the spot without model training and derive logical conclusions. It was the beginning of the race surrounding large language models (LLMs).
While language models were simultaneously immersed in the debate and discoveries surrounding the size problem, the introduction of deep learning in medical applications began at a tremendous pace. DeepMind's AlphaFold2 achieved high-accuracy structure prediction without Monte Carlo simulations, using only predictions. It reduced the required computation, which was a major challenge in the field of proteomics, to nearly one-thousandth of the previous level. From microscopic stages such as predicting coronavirus mutations, filtering vaccine candidates among synthetic substances, and predicting new synthetic structures to predicting transmission routes and estimating the number of infected individuals, the application of AI models expanded to various fields. Everyone started leaping without looking. The snowball of resource scale rolled at a tremendous speed.
In the second half of the year, discussions about the scale of computational resources to increase model training speed took place. It was different from the existing competition to secure deep learning computational resources for research goals. Scale gave rise to operational and optimization demands, and thanks to that, we were able to achieve the 'profitability from 2020' that we had anticipated in 2017. The demand for platforms increased, but at the same time, we were forced to work from home, and most communication became text-based. Although many people joined us later, some of them were destined to become colleagues who had never met each other in person until the workshop in early 2023.
The scale of compute resources was growing, and all eyes were on GPUs, but as the models got bigger and the number of GPUs increased, the GPUs became less of a bottleneck. The biggest challenge was data storage. In training, you need to feed data to hundreds of machines. The absolute speed of storage hasn't kept up with the growth in the number of GPUs. Most of the problems we had to solve in 2020 came from storage bottlenecks.
Another kind of change, slower but deeper than the deep learning race, has been happening: it's become second nature to people of all generations to treat online relationships as normal relationships. And then there comes a moment when you wonder, "Does it really make a difference to me if the person on the other end of the line is human or not, as long as they speak good enough?"
- T-NLG, 17 billion parameters
- GPT-3, 175 billion parameters
- Gshard, 600 billion parameters
2021
The race for developing large language models that followed T5 and GPT-3 was growing in fascination. The simplest way to find out if performance improvement continues as the size increases is to make it even bigger. Various theories emerged as to why large language models produce peculiar results, but the answer was still unclear. A hypothesis emerged that when the state space is sufficiently large, a kind of phase transition occurs in the process of handling information. One of the candidate explanations for why the transformer structure handles these tasks well is that the transformer structure is a special case of graph neural networks (GNNs).[16] Graph neural networks, which gained attention since 2018, are neural networks that learn the relationships of objects and are known to be very powerful in processing semantics or taxonomies.
Microsoft's DeepSpeed framework[17], which is most commonly used for distributed model training, began to be widely used. DeepSpeed's ZeRo optimizer focuses on reducing GPU memory usage by distributing workloads across various hardware from CPUs to GPUs and partitioning model states. Open-source language models also emerged. OpenAI was no longer releasing models and was selling exclusive rights to use models. Due to the reduced accessibility, various language models appeared, but they could not meet the high expectations as they did not match the scale of large language models.
The scale of GPUs handled by users easily began to exceed triple digits. Various massive-scale tests tailored to the actual workloads run by institutions became necessary. We started running language models again for system testing purposes, which we had sent to the realm of hobbies at the end of 2017. The largest Portuguese language model in the world was born on our platform and was briefly introduced in passing during the keynote at the NVIDIA GTC conference. In the same conference, a tutorial session titled "Fine-tuning BERT in 60 seconds" was held. BERT was no longer a massive model but a subject of practice.
As the model size rapidly grew, the problems to be solved also changed. When models had to be split and loaded onto multiple GPUs, communication between GPUs became extremely important. GPUs not only shared memory access within a node but also increasingly communicated across multiple nodes. It became common to attach one InfiniBand, which transmits 200 Gb per second, to each GPU to create a GPU network.
Amidst the complex and hectic changes, a thought occurred to me. The process of large language models learning 'language' is based on unclassified corpora. What does the large language model 'learn' in that process? Although corpora are used for the purpose of learning the structure of language, language cannot be separated from information. In fact, don't language models that have not been explicitly taught knowledge readily answer questions? Language itself is a protocol for humans to convey information to each other. The conversation process involves computing answers to data transmitted through the protocol and responding with data again. Then, is our perception of having developed an 'AI that converses well' really about developing an AI model that creates language well, or have we created something beyond that?
The following year was going to be the first year of services that are only possible with AI, not services that have been improved with AI. However, no one was yet thinking about providing the outputs of large language models as services. That was a task for someone in the future.
- GPT-J, 6 billion parameters
- LaMDA, 160 billion parameters
- PanGU-alpha, 200 billion parameters
- Gopher, 280 billion parameters
- Pathways, 530 billion parameters
- Switch-C, 1.6 trillion parameters
- Wudao 2, 1.75 trillion parameters
2022
The COVID-19 endemic was creating tremendous aftereffects. Numerous IT companies that had grown due to the special circumstances of the coronavirus and many companies that had tried to shift their offline operations online were dumbfounded by the demand for the metaverse, which suddenly disappeared like a mirage. The field of deep learning had not yet generated significant revenue sources. Numerous companies began downsizing their AI teams. Many researchers came out.
It wasn't that there was no need for technological advancements in AI. The massive scale underlying AI development overwhelmed all other advancements. In the era of big science, equipment was the most expensive one, just as it had been. It was the result of three years since innovation started coming from scale. The singularity occurring in large language models began to be regarded as a kind of emergence.[18] Small-scale studies were no longer attractive. Deep learning researchers were anxious. It wasn't the diminished interest that was the problem. A spoonful of mild despair over what research could be done with just a few GPUs was more of an issue.
Nevertheless, there were several innovations that emerged from the beginning of the year. In addition to training with well-defined data, there was a model tuning method where humans actually evaluate the answers and assign higher weights to better answers. The RLHF (Reinforcement Learning by Human Feedback) method, which applied reinforcement learning to language model training by involving humans in the middle, showed tremendous improvement in the performance of language models of the same size in InstructGPT in 2022. Numerous models began to apply RLHF. If there were scaling laws for model size, there would be no reason not to apply them. In March, µ-Parametrization,[19] which could tremendously reduce the cost of model training, was introduced. The conclusion of the study, which showed that it was possible to predict the hyperparameters of a large model in advance using a small model, relatively reduced the effort of parameter search when creating massive models. This research became the basis for GPT-4 training.
As a result of the U.S.-China trade conflict, the United States banned NVIDIA's export of AI training GPUs to China. Within a few days, China announced a large language model trained solely with its own semiconductors[20]. Soon after, NVIDIA slightly changed the name of the same GPU with its GPU networking features removed and resumed exporting. The interest in the AI service sector continued to grow due to models like DALL-E2 and Stable Diffusion, and the market for generative AI models, such as images, began to fluctuate.
In late November, OpenAI opened a chatbot service to the public. It was based on GPT-3.5, an improved version of GPT-3. An interesting point was that instead of creating a language model that excels at programming by training programming code on a human language model, it was created by training human language on a model trained with programming language data. It became clear how training with programming code influenced the logical structure training of large language models. In early December, the service named ChatGPT[21] sparked interest in large language models based on its tremendous accessibility open to the entire public.
Towards the end of the year, my acquaintances in the AI field who seemed to be on the verge of losing their jobs were bewildered by the support that suddenly improved in real-time. The movement of companies that had been downsizing their AI organizations, riding the wave of endemic layoffs, came to a halt. Leaders who had been pressuring to downsize research organizations and evaluate outcomes just a week before were now shouting for AI. The requirements for model service frameworks suddenly began to change. The goal of large language models became commercialization. Models had become so large that there was no longer any meaning in distinguishing between computational resources for training and services. AI model training and services, which originally belonged to different domains, suddenly merged into one.
Bigger problems await. Large language models consume an enormous amount of power. GPUs consume a tremendous amount of power. Although their power-to-performance ratio is tremendously better compared to CPUs, their absolute power consumption is too high. A node with 8 NVIDIA A100 GPUs[22] consumes about 7 kW, and a node with 8 H100 GPUs, the highest-performing GPUs as of 2023, consumes about 12 kW[23]. Since 2019, it has become no joke to say that you have to construct a building first to install the equipment. After experiencing power issues at a supercomputing cluster located in Brazil in 2021, we ported the entire platform to Arm-based systems. It was in anticipation of power issues becoming a problem a few years later. In the case of Microsoft, they shared their experience of building a GPU center right next to a hydroelectric power plant, taking into account the electricity costs[24].
Weekends diminished. There was too much to do. There was no time. It wasn't just us.
Now, no one had time.
- Flan-T5, 110 billion parameters
- GLM-130B, 130 billion parameters
- OPT-175B, 175 billion parameters
- BLOOM, 176 billion parameters
- PaLM, 540 billion parameters
2023
On February 8th, Microsoft and Google made announcements about their large language model-based services at a 17-hour interval. Microsoft announced plans to introduce GPT models into its search engine Bing, its Office suite, and Windows 11. Google introduced Bard based on LaMDA. Baidu released Ernie Bot. The two companies promoted the future rather than tangible services. Tools that couldn't be tried out relatively failed to attract interest.
The "era of AI price competitiveness" that I thought would come someday had arrived. However, the initial expense itself was too high. Services like ChatGPT and Bard consume service costs that are too expensive to be explained by economic logic.[25] It corresponds to a future that has been brought forward too quickly by competition. It was after everyone had firsthand experience of that future. The problem was that expectations had skyrocketed.
The suddenly approaching large language model services are creating another bottleneck. Services that perform inference based on CPUs have been affected by the significant reduction in memory bandwidth per CPU core. This is because the number of cores per CPU has rapidly increased. Services that perform inference based on GPUs lack both the capacity and speed of GPU memory where models are loaded. The bottleneck of memory, which was expected to come someday, suddenly became a direct problem due to the commercialization of large language model services. It was an anticipated bottleneck since 2021. CPU and GPU developers such as Intel, AMD, and NVIDIA had prepared for this situation in advance. From the end of 2022 to early 2023, they introduced various hardware such as Intel's Xeon Max, AMD's MI200, and NVIDIA GraceHopper.
If an AI model is extremely large, computational power becomes relatively less important. When NVIDIA A100 was first announced, it was released with a 40 GB model, but a year later, they released an 80 GB memory model. Whether it was the training process or the inference process, the size was too large to load and unload models from memory. Additionally, the process of "inferring" large language models brought about a reversal in thinking about GPUs and NPUs. Unlike the training process, which requires constantly updating weight matrices, the inference process operates by flowing input data through a fixed model structure loaded into memory and observing the results. Therefore, the proportion of computation is tremendously reduced, and the speed of memory becomes extremely important. In the second half of 2022, NVIDIA announced the H100 with 80 GB of memory capacity. However, less than half a year later, when hardly anyone had received the actual H100, they introduced the H100 NVL with 188 GB capacity.[26]
Meta introduced LLaMA[27], a language model that could be run on personal servers with some effort. Despite being attached with all sorts of license restrictions, LLaMA spread through illegally leaked copies, and Alpaca-LLaMA, fine-tuned by Stanford, showed the possibility of achieving significant performance even with (relatively) small models. Since then, various language models without license issues have been continuously released[28], fueling the potential of open language models while raising new questions about what size of parameters would be satisfactory. If the model is small, emergence are not discovered, and it cannot be used as a multi-modal model. If the model is large, it costs too much money to actually operate.
How large can large language models grow? Signs of preparation for even larger models can be seen everywhere. Microsoft's DeepSpeed framework, which is most commonly used for distributed model training, added ZeRO Infinity[29] in 2021, an extension that can train models with 1 trillion to 10 trillion parameters by utilizing NVMe SSDs. However, models with such a large number of parameters are practically impossible to serve. In practice, the approach is to set a limit on the model size that can be served and fine-tune within that range. Technologies like ZeRO were developed to train ultra-large-scale models, but they are being widely applied as they enable fine-tuning with very few resources.
- PaLM-e, 560 billion parameters
- Pythia, 12 billion parameters
- LLaMA, 6.5 billion parameters
And numerous other models with ~12 billion parameters
Various attempts are being made on numerous 12-120 billion parameter models that are 'good at talking.' LLaMA unintentionally spread foundation models that individuals could experience. Many people realized that the level of "models that are good at talking" that can satisfy ordinary people had been achieved long ago. Individuals or organizations with some computer knowledge and the ability to spend money have gained the courage to attempt fine-tuning language models in various ways.
At the same time, it is becoming known that the computational resource requirements of models beyond just being good at talking are on a different level. For about half a year, the size of newly emerging large language models has been maintained at less than 600 billion parameters. It could be that further size expansion does not yield enough results, or it could be a technical barrier created by the current hardware and costs. Or, it could be a movement to keep the size below a certain level because that size is in a range where commercialization is impossible.
Backend.AI, which started as an open-source project with 4 GPUs in 2015, handles several thousand GPUs in 2023 and will soon reach ten thousand. All environments, including ours, have changed tremendously. The more you dig into problems, the more problems keep emerging like potato stems. While living and solving numerous problems entangled with the size of large language models, I sometimes wonder where the end of this problem will reach.
On nights full of thoughts, I sometimes think that, like the Turing test that unknowingly drifted away from our attention, we may have all passed a certain point without realizing it. It seems like we have either solved a problem that needed to be solved or solved a problem that shouldn't have been solved yet. Complex emotions of excitement turning into dizziness and expectations turning into depression come and go.
- [^Editor's note] Crossroads is a science web journal launched by the Asia Pacific Center for Theoretical Physics, aimed at 'Science, Future, and Humanity' to showcase a scientific vision of the future through various genres of scientific writing, including science features, essays, columns and novels.
- [1] Facebook Post
- [2] There are various parameters besides the connections between neurons, but for convenience, it has been tremendously simplified as the model size becomes relatively small.
- [3] Physical computers in their raw form, not virtual machines. In the cloud, it is common to run virtual machines on top of a hypervisor or manage them based on containers to reduce management overhead and flexibly manage resources. Due to cost issues, it has not yet been popularized in small research institutes and universities.
- [4] Job Scheduler. Software that helps manage and execute processes. Slurm and others are commonly used.
- [5] https://www.openai.com (2015). Since 2020, OpenAI has not released implementations, and since 2023, they have only provided tech reports instead of papers. As of 2023, there are various opinions on whether OpenAI is still an AI development organization that pursues openness.
- [6] https://www.tensorflow.org, Google (2015)
- [7] "AlphaGo - The Movie" For those who couldn't feel the atmosphere at the time, refer to the documentary (2018)
- [8] J. Shin "Creating AI chatbot with Python 3 and TensorFlow" PyCon APAC 2016 (Korean) / (English) (2016) Although there are various presentation videos on the same topic as I had the opportunity to introduce it in several countries, these two are the first presentations.
- [9] J. Shin "Android Dreaming of Electric Sheep: Implementing Chatbot Emotion Model Using Python, NLTK, and TensorFlow" PyCon KR 2017 (2017)
- [10] A record of an interview at the Google Startup Campus remains on YouTube.
- [11] "Transformer (machine learning model)"
- [12] J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" Arxiv:1810.04805 (2018)
- [13] Y. Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach" Arxiv:1907.11692 (2019)
- [14] A. Radford et al., "Language Models are Unsupervised Multitask Learners", (2019)
- [15] J. Kaplan et al., "Scaling laws for neural language mod- els" Arxiv:2001.08361 (2020)
- [16] C K. Joshi, "Transformers are Graph Neural Networks", The Gradient (2020)
- [17] Microsoft, "DeepSpeed: Extreme Speed and Scale for DL Training and Inference", (2019)
- [18] J. Wei et al., "Emergent abilities of large language models" Arxiv:2206.07682 (2022)
- [19] E.Hu, G. Yang, J.Gao, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" Arxiv:203.03466 (2022)
- [20] A Zeng et al., "GLM-130B: An Open Bilingual Pre-trained Model" Arxiv:2210.02414 (2022)
- [21] OpenAI, "Introducing ChatGPT" (2023)
- [22] If a computer installed in a data center cabinet called a rack is considered a node, a single node with 8 A100 GPUs typically occupies 6 to 8 slots in a rack, and a rack can accommodate around 40 nodes.
- [23] The power of a single floor in a typical university building is around 100 kW.
- [24] "NVIDIA Teams With Microsoft to Build Massive Cloud AI Computer" (2022)
- [25] According to my personal estimate, in the case of ChatGPT, the cost based on GPT-3.5 is over $42 per month. Refer to the link for the calculation process. Facebook Post
- [26] "NVIDIA H100 NVL for High-End AI Inference Launched" (2023)
- [27] H Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" Arxiv:2302.13971 (2023)
- [28] Representative examples include Dolly 2 (2023), which combines EleutherAI's Pythia-12B model with its own data.
- [29] S. Rajbhandari et al., "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme-Scale Deep Learning", Arxiv:2104.07857 (2021) To load a model with 1 trillion parameters onto a GPU without memory offload for training, 320 NVIDIA A100 GPU (80 GB) models are required.
25 March 2024