Tag : LLM

  • Model Variant: Easily Serving Various Model Services

    By Jihyun Kang

    Introduction

    Imagine a scenario where you need to train an AI for research purposes and produce results. Our job would simply be to wait for the AI to correctly learn the data we've taught it. However, if we assume we're creating a service that 'utilizes' AI, things get more complicated. Every factor becomes a concern, from how to apply various models to the system to what criteria to use for scaling under load conditions. We can't carelessly modify the production environment where users exist to get answers to these concerns. If an accident occurs while expanding or reducing the production environment, terrible things could happen. If something terrible does happen, we'll need time to recover from it, and we can't expect the same patience from consumers using our service as we would from researchers waiting for model training. Besides engineering difficulties, there are also cost challenges. Obviously, there's a cost to serving models, and users are incurring expenses even at the moment of training models as resources are being consumed.

    However, there's no need to worry. Many well-made models already exist in the world, and in many cases, it's sufficient for us to take these models and serve them. As those interested in our solution may already know, Backend.AI already supports various features you need when serving models. It's possible to increase or decrease services according to traffic, and to serve various models tailored to users' preferences.

    But the Backend.AI team doesn't stop here. We have enhanced the model service provided from Backend.AI version 23.09 and improved it to easily serve various models. Through this post, we'll explore how to serve various models easily and conveniently.

    This post introduces features that allow you to serve various types of models more conveniently. Since we've already given an explanation about model service when releasing the 23.09 version update, we'll skip the detailed explanation. If you're unfamiliar with Backend.AI's model service, we recommend reading the following post first: Backend.AI Model Service Preview

    Existing Method

    | Requirement | Existing Method | Model Variant | |-------------|-----------------|---------------| | Writing model definition file (model-definition.yaml) | O | X | | Uploading model definition file to model folder | O | X | | Model metadata needed | O | △ (Some can be received automatically) |

    Backend.AI model service required a model definition file (model-definition.yaml) that contains commands to be executed when serving the model in a certain format, in addition to the model metadata needed to run. The service execution order was as follows: Write the model definition file, upload it to the model type folder so it can be read, and when starting the model service, mount the model folder. Then, an API server that automatically transfers the end user's input to the model according to the model definition file and sends back the response value would be executed. However, this method had the disadvantage of having to access the file every time the model definition file needed to be modified. Also, it was cumbersome to write different model definition files each time the model changed because the model path was already set in the model definition file.

    The Model Variant introduced this time is a feature that allows you to serve models immediately by inputting a few configuration values or without any input at all, using only model metadata without a model definition file. Model Variant supports command, vLLM, and NIM (NVIDIA Inference Microservice) methods. The methods of serving and verifying model service execution are as follows.

    Basically, model service requires metadata of the model to be served. Download the model you want to serve from Hugging Face, where you can easily access model metadata. In this example, we used the Llama-2-7b-hf model and Calm3-22b-chat model from Hugging Face. For how to upload model metadata to the model folder, refer to the "Preparing Model Storage" section in the previous post.

    Automatically Serving Model from Built Image (Command Method)

    The first introduced command method is a form where the command part that executes to serve the model in the model definition file is included in the execution image. After specifying the command to execute in the CMD environment variable, build the image and execute it immediately without any other input when actually serving the model. The command method does not support what's called a Health check, which verifies if the service is running properly. Therefore, it's more suitable for immediately setting up and checking a service as a prototype rather than performing large-scale services. The execution method is as follows:

    1. On the start screen, select Llama-2-7b-hf in the Model Storage To Mount item to mount the model folder containing the model metadata corresponding to the model service to be served, and select Predefined Image Command in the Inference Runtime Variant item.

    Activate the Open To Public switch button if you want to provide model service accessible without a separate token.

    모델-서비스-시작화면-모델-메타데이터-마운트-및-CMD-선택

    1. Select the environment to serve. Here, we use vllm:0.5.0 and allocate CPU 4 Core, Memory 16 GiB, NVIDIA CUDA GPU 10 fGPU as resources.

    모델-서비스-시작화면-실행환경-선택-및-자원할당

    1. Finally, select the cluster size and click the start button. The cluster size is set to single node, single container.

    모델-서비스-시작-화면-클러스터-크기-선택-및-시작

    If the service has been successfully launched, the service status will change to HEALTHY and the endpoint address will appear.

    모델-서비스-상세-화면

    Verifying the Service

    If the service has been launched normally, check the service model name with the cURL command:

    curl https://cmd-model-service.asia03.app.backend.ai/v1/models \
    -H "Content-Type: application/json"
    

    모델명-확인하기

    Now, let's send input to the service with the cURL command and check the response:

    For model services run with CMD, the model name is already defined in the image, so after checking the model name, you must enter the model name as the value of the model key when sending a request.

    curl https://cmd-model-service.asia03.app.backend.ai/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "image-model",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0}'
    

    모델-서비스-요청-결과-화면

    Serving Models in vLLM Mode

    The vLLM mode is similar to the command method introduced earlier, but various options entered when running vLLM can be written as environment variables. The execution method is as follows:

    How to Run

    1. On the start screen, mount the model folder for the model service to be served and select vLLM in the Inference Runtime Variant item.

    모델-서비스-시작-화면-모델-메타데이터-마운트-및-vLLM-선택

    1. Select the environment to serve. As with the command method explained earlier, select vllm:0.5.0, and (although you can set the resources the same) this time we'll allocate CPU 16 Core, Memory 64 GiB, NVIDIA CUDA GPU 10 fGPU.

    모델-서비스-시작-화면-실행환경-선택-및-자원-할당

    1. Finally, select the cluster size and enter the environment variable BACKEND_MODEL_NAME. This value corresponds to the --model-name option in vLLM and becomes the model value specified by the user when sending a request to the service. 모델-서비스-시작-화면-실행환경-선택-및-자원-할당

    Likewise, if the service has been successfully launched, the service status will change to HEALTHY, and the endpoint address where the service is launched will appear.

    모델-서비스-상세-화면

    Verifying the Service

    Let's send input to the service with the cURL command and check the response value. At this time, enter the model value as the BACKEND_MODEL_NAME value you set earlier. Once the input is complete, click the START button to create the service.

    curl https://vllm-calm3-22b-chat.asia03.app.backend.ai/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "vllm-model",
    "prompt": "初めて会う日本人ビジネスマンに渡す最高の挨拶は何でしょうか?",
    "max_tokens":  200,
    "temperature": 0
    }'
    

    모델-서비스-요청-결과-화면

    Serving Models in NIM Mode

    To run NIM, you need an API key issued from an account that can access NGC's NIM model registry. For how to obtain the key value, please refer to the following content: NVIDIA Docs Hub : How to get NGC API Key

    The NIM (NVIDIA Inference Microservice) mode is also similar to the command mode, but it must be run with an image that has NVIDIA's NIM-supporting model server built-in. Also, when loading the model, the NGC API key value is needed. Assuming everything is ready, let's start the model service.

    How to Run

    1. On the start screen, select an empty model type folder to cache the metadata to be received from NIM, and select NIM in the Inference Runtime Variant item.

    모델-서비스-시작-화면-모델-폴더-마운트-및-NIM-선택

    1. Select the environment to serve. Here, we use ngc-nim:1.0.0-llama3.8b and set to allocate CPU 8 Core, Memory 32 GiB, NVIDIA CUDA GPU 15 fGPU as resources.

    모델-서비스-시작-화면-실행환경-선택-및-자원-할당

    1. Finally, select the cluster size and enter the default path /models for the environment variable HF_HOME. Then enter NGC_API_KEY and input the issued key value. Once the input is complete, click the CREATE button to create the service.

    모델-서비스-시작-화면-클러스터-크기-선택-환경변수-입력-및-시작

    When using NIM, it may take some time for the first execution as it receives model metadata from the repository. You can check the progress by viewing the container logs for the routing session in service on the session page. 모델-서비스에-대응하는-라우팅-세션 NIM-에서-데이터를-받고-있는-로그가-띄워진-컨테이너-로그-화면

    Like the command and vLLM modes, if the service has been successfully launched, the service status will change to HEALTHY. Let's input the content to send to the service using the endpoint address where the service is launched as follows, and check the response value.

    Verifying the Service

    from openai import OpenAI
    
    client = OpenAI(
      base_url = "https://nim-model-service.asia03.app.backend.ai/v1",
      api_key = "$YOUR_NGC_API_KEY"
    )
    
    completion = client.chat.completions.create(
      model="meta/llama3-8b-instruct",
      messages=[
          {        
            "role":"user", 
            "content":"Hello! How are you?"
          },
          {
            "role":"assistant",
            "content":"Hi! I am quite well, how can I help you today?"
          },
          {
            "role":"user",
            "content":"Can you write me a song?"
          }],
      temperature=0.5,
      top_p=1,
      max_tokens=1024,
      stream=True
    )
    
    for chunk in completion:
      if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
    

    모델-서비스-요청-결과-화면

    Conclusion

    The Model Variant feature will be of great help to researchers and companies aiming to provide actual services with already trained models. Based on a powerful resource management system and support for various AI accelerators such as NVIDIA GPU, AMD ROCm, TPU, Graphcore IPU, Furiosa Warboy, Rebellions ATOM, Hyperaccel LPU, etc., Backend.AI now provides an integrated environment that can easily deploy services beyond simply training models. Try serving your desired AI model anytime with Backend.AI!

    11 July 2024

  • [Featured articles] Scale entanglement

    By Jeongkyu Shin

    This article originally appeared in Crossroads, May 2023. [^Editor's note]

    This article is translated from Korean.

    The original order of the essay is 2023 > 2015 > 2020 > 2017 > 2018 > 2019 > 2021 > 2022 > 2023.
    While the emotional flow follows that order, the essay has been re-edited in chronological order for the reader's understanding.

    After 2023, March 14th may no longer be known as Pi Day, but rather as Chatbot Day.

    It was the day when all the language models that had been in storage were simultaneously unleashed into the world. Starting with Google's release of PaLM fine-tuning + generative models on Vertex AI, followed by OpenAI's announcement of GPT-4, Microsoft officially confirming that Bing was already using GPT-4, and Anthropic's formal release of Claude bot, all within a span of 12 hours.

    That morning, after reviewing the GPT-4 tech report released by OpenAI, I left a post on Facebook about the technical aspects that impressed me.[1] A comment was left: "I feel both joy and pain seeing things I wondered if I would ever see in my lifetime becoming reality..." I replied, "Now, no one is interested in the Turing test anymore. Within a year, it has become a question of 'Isn't it obvious that it will pass without a wow point?'"

    When actually facing the keyboard, the task of organizing knowledge seems to have already left human hands. I tap into human stories to find the meaning of recording.


    To avoid sounding like an alien language, let me briefly touch upon the aspects of artificial neural networks necessary to understand this essay.

    A program that simulates the connections between neurons is called an artificial neural network. Neurons are grouped into layers, and the network is designed by overlapping these layers and creating connections between neurons in adjacent layers. Deep learning is a term used when there are many layers in an artificial neural network. The results of various artificial neural networks are called deep learning models, or simply AI models to sound more impressive.

    The connection strengths between neurons are called parameters[2]. The number of connection strengths is called the number of parameters. As the number of parameters increases, more memory is required, so it is said that the model becomes larger. Training a model involves connecting artificial neurons and adjusting the connection strengths between them so that the output for the input data takes the desired form. After training, an artificial neural network can mimic an extremely high-dimensional, discontinuous state space.

    Those are the basic terms. So, when should we start recalling the story?


    2015

    I founded Lablup Inc. The name was chosen as a pun, meaning both "lab eul up" and "lab | up". (eul is an adjective to modify a noun in Korean language. | is linux pipe command)

    People who had been suffering through their Ph.D. programs came together with the goal of creating a research automation platform in the field of computational science to alleviate the hardships of others. We believed that instead of clumsily running clusters by placing workload managers[4] on bare metal[3], we needed a computational environment that guarantees reproducibility and portability. The start was courageous, but there was no demand for a research platform. Within two months of founding the company, we learned the hard way that neither universities nor research institutions, despite easily purchasing equipment, are stingy when it comes to spending money on software. Universities lacked funds, but had an abundance of graduate students who would figure things out on their own when instructed. The industry did not yet have a large-scale demand for scientific computing. At the same time, we painfully learned that Ph.D. holders like us, who had only been in school, needed to go through a process of re-socialization to communicate like normal people.

    It wasn't just our lack of re-socialization that was the problem. It was the content of what we were saying. Stories about science advancing based on technology or stories about accelerating innovation based on computation were considered science fiction topics wherever we went. We were getting exhausted. Yet, as they say, people only see what they can see. The potential of deep learning models was clearly emerging. In fact, various movements had started to prevent deep learning technologies and their outcomes from being subordinated to capital. One of the representative organizations was OpenAI[5]. Such changes seemed to be evidence that we were heading in the right direction. It felt like the direction would become a bit clearer after just one more year.

    On the verge of our first anniversary, deep learning garnered significant social interest. Around the end of 2015, TensorFlow[6] was released to the world. As the first lecture material for the coding platform prototype, we translated the entire TensorFlow manual and uploaded it. When AlphaGo had its match in March 2016[7], our service crashed for the first time due to the influx of people. If it weren't for that match at that time, perhaps there wouldn't be a sequel to this story. Thankfully, because of that, the company survived.

    We needed a demo of a research platform performing massive-scale computations. Language models were a topic that required enormous computational resources and could be immediately tackled as a hobby since my Ph.D. program. In 2016, we released a chatbot created by placing a language model on the platform we were developing. It attracted a lot of attention.[8][9] The chatbot quickly became an in-house project. However, after just over a year, the chatbot project was shelved.

    2017

    In the fall of that year, Lablup decided to halt the development of language models, which had been a side project, and focus solely on Backend.AI, an AI cluster management platform.

    The shock of the Google Assistant demo I saw in Krakow, Poland, where I was invited by Google, was significant. The topic of that meeting, which was attended by just over ten people, clearly showed that language models had now become part of the arms race for resources, and without large-scale investment, it would be impossible to keep up with the changes thereafter. Over dinner with Donghyun Kwak, who attended the Google Developer Summit with me, and Sanghun Lee, who visited Krakow to attend a conference held at the same time and place, we discussed, "It seems like the future I saw in the field of physics will start here as well."

    The Manhattan Project strongly appealed to all of humanity through nuclear weapons that technology becomes power nearly eighty years ago. Physics was no longer an object of romance, but an object of investment. The era of physicists making a living, which began that way, led to changes in the fields of massive science, connecting to the space program and particle physics. On the way back to the accommodation that night, I sent a message to the team members, "Let's not develop language models anymore. From now on, we'll lack the funds to keep up."

    History always repeats itself. Then it's not difficult to predict what changes will occur in the future. It's just a matter of timing. We shared the opinion that 'the inflection point will probably be in 2020.'[10] By then, wouldn't it be possible to achieve profitability? It became the company's goal. Language models were advancing beyond LSTM-based machine translation. The AlphaGo shock became the source of countless jokes people made about AI. A tremendous number of "AI companies" were born. However, most of them became coin companies or metaverse companies two years later.

    2018

    The transformer[11] architecture began to be applied to various parts of all kinds of language models. In terms of letting us know 'what' to focus on, transformers solved many aspects of model context memory, maintenance, and emphasis. They could be used for encoding to put information into the state space or for decoding to extract information from the state space. Google introduced BERT[12], while OpenAI introduced GPT. Both were transformer-based language models, but they differed in their focus on the encoder and decoder, respectively. BERT focused on the encoder part, while GPT implemented an architecture that creates memory for causal relationships by linking the output to the input through the decoder, which is structurally different from BERT. BERT and GPT, and the later T5, no longer used labeled corpora. Using transformers, language models could be created by first training the model to learn the structure of language from the corpus itself and then fine-tuning it. It was still far from the end-to-end training philosophy of AI model development, which involves no human intervention in the data. However, the concept of data acquisition began to change from that point on. Quantity is more important than labeling. It was the beginning of general-purpose language models.

    BERT seemed to be able to replace most of the existing language models at an incredibly fast pace. Its overwhelming performance raised expectations for making significant improvements in various language tasks such as document creation, chatbots, and document analysis. However, as of 2018, BERT was a model so large that it was unimaginable to train. It seemed like no one outside of Google would be able to create a model of that size. But that moment was fleeting. Facebook quickly released RoBERTa[13], which was based on the BERT paper but increased in size. It was a symbolic action that simultaneously announced that TPUs were not necessarily required and that anyone with capital could participate in this race.

    The first bottleneck in increasing model size using GPUs appeared in GPU memory. Models could no longer fit on a single GPU device or be trained in time using a single device. Depending on the case, it became common to split models across multiple GPUs or distribute training across multiple computing nodes. Horovod and Distributed TensorFlow began to shine.

    Technology continues to advance, and the cost per unit of computational resources continues to decrease. If this progress continues, the popularization of AI will eventually take place, and at that point, price competitiveness, which is the same in all markets, will become the most important factor. "The era of price competitiveness will come for AI as well," I wrote, praying not to go bankrupt until then.

    • BERT, 340 million parameters
    • GPT, 110 million parameters

    2019

    For several years while creating a distributed processing and distributed training platform, I occasionally wondered, 'Could we be creating a platform with no demand?' After 2019, I no longer had such thoughts. From 2020, there was no time for such thoughts.

    As soon as the new year began, OpenAI announced GPT-2[14]. The GPT model, which focused on the decoding process of extracting information from the topological space, demonstrated remarkably stable text generation capabilities. GPT-2 became the basic code that anyone could use to create a language model. Along with PyTorch, Horovod, and Distributed TensorFlow, the difficulty of code access was rapidly decreasing. Google's XLNet and T5 (Text-To-Text Transfer Transformer) language models in 2019 seemed to have crossed the river of model sizes that were thought to be impossible for humans to surpass (by spending capital). Google strongly appealed that training T5 required enormous computational resources at the TPU level, emphasizing that it would require hundreds of NVIDIA's V100 units, which could be purchased with effort in the market. (A single V100 unit cost around 12K USD at the time.) Like BERT, T5 only released the paper and did not release the model. Google had the painful experience in 2017 when they released the BERT paper but did not immediately release the model alongside it (because training was not yet complete), and in that gap, Facebook preemptively released RoBERTa, which they had trained by increasing the size of the same model. Considering that they still did not release it, Google must have been confident that it would be difficult for anyone outside Google to reproduce the training of that model.

    At the end of 2019, we ended our long wandering and moved into our own office. The era of massive deep learning models was coming, and to prepare for it, we needed to collaborate with more people. The size of language models was increasing tenfold every year. As I carried the not-so-many belongings in boxes, I questioned myself. At this rate, in three years, the model size would increase a thousand-fold, but were we ready to handle a thousand times the workload?

    • RoBERTa, 350 million parameters
    • Transfer ELMo, 460 million parameters
    • GPT-2, 1.5 billion parameters
    • T5, 11 billion parameters

    2020

    Two months had passed since we moved to the new office. The winter was long.

    By the end of February, we were unable to complete the interior of the new office we had moved into. The wall finishing materials for the office, which were supposed to arrive from China at the end of February, never made it. The company's entire roadmap changed. All business trips to the United States were canceled. The office with unfinished interiors remained unfinished and guarded the empty space for the next two years.

    COVID-19 not only tore apart the company's future but also separated people. My first child started elementary school by lying diagonally against the floor, watching the tiger (meaning of strict in Korean) teacher on the EBS educational broadcast TV channel. I rolled around next to the child rolling on the floor. How busy had I been? Even while raising my children, it wasn't until I was trapped in the same house due to the coronavirus that I felt the reality of being a father. I wondered how long that sad yet strangely stabilizing time would last.

    That year, OpenAI released GPT-3. The theoretical foundation was not significantly different from GPT-2. However, one thing was vastly different: the size. It was a massive model with 175 billion parameters. Not only for model training but simply loading it onto a GPU was expected to occupy an entire NVIDIA DGX-2 supercomputing node. Unlike GPT-2, this time, neither the code for the language model nor the trained language model was released. Wow. Non-disclosure in the field of deep learning. Something was changing.

    There was a movement against the supremacy of model size. Does performance increase accordingly as the size of deep learning models grows? A debate began between researchers at Google and Meta. One side argued yes, while the other argued no, engaging in a word battle in the form of papers. However, this debate, which lasted from 2019 to 2021, did not last long. As the size of language models increased, interesting phenomena were discovered. Deep learning models had scaling laws[15]. Something happened around 100 billion parameters. Regardless of the model structure, at some point beyond 100 billion parameters, language models began to do unexpected things beyond simply generating coherent speech. The phenomenon called in-context learning allowed models to learn various knowledge on the spot without model training and derive logical conclusions. It was the beginning of the race surrounding large language models (LLMs).

    While language models were simultaneously immersed in the debate and discoveries surrounding the size problem, the introduction of deep learning in medical applications began at a tremendous pace. DeepMind's AlphaFold2 achieved high-accuracy structure prediction without Monte Carlo simulations, using only predictions. It reduced the required computation, which was a major challenge in the field of proteomics, to nearly one-thousandth of the previous level. From microscopic stages such as predicting coronavirus mutations, filtering vaccine candidates among synthetic substances, and predicting new synthetic structures to predicting transmission routes and estimating the number of infected individuals, the application of AI models expanded to various fields. Everyone started leaping without looking. The snowball of resource scale rolled at a tremendous speed.

    In the second half of the year, discussions about the scale of computational resources to increase model training speed took place. It was different from the existing competition to secure deep learning computational resources for research goals. Scale gave rise to operational and optimization demands, and thanks to that, we were able to achieve the 'profitability from 2020' that we had anticipated in 2017. The demand for platforms increased, but at the same time, we were forced to work from home, and most communication became text-based. Although many people joined us later, some of them were destined to become colleagues who had never met each other in person until the workshop in early 2023.

    The scale of compute resources was growing, and all eyes were on GPUs, but as the models got bigger and the number of GPUs increased, the GPUs became less of a bottleneck. The biggest challenge was data storage. In training, you need to feed data to hundreds of machines. The absolute speed of storage hasn't kept up with the growth in the number of GPUs. Most of the problems we had to solve in 2020 came from storage bottlenecks.

    Another kind of change, slower but deeper than the deep learning race, has been happening: it's become second nature to people of all generations to treat online relationships as normal relationships. And then there comes a moment when you wonder, "Does it really make a difference to me if the person on the other end of the line is human or not, as long as they speak good enough?"

    • T-NLG, 17 billion parameters
    • GPT-3, 175 billion parameters
    • Gshard, 600 billion parameters

    2021

    The race for developing large language models that followed T5 and GPT-3 was growing in fascination. The simplest way to find out if performance improvement continues as the size increases is to make it even bigger. Various theories emerged as to why large language models produce peculiar results, but the answer was still unclear. A hypothesis emerged that when the state space is sufficiently large, a kind of phase transition occurs in the process of handling information. One of the candidate explanations for why the transformer structure handles these tasks well is that the transformer structure is a special case of graph neural networks (GNNs).[16] Graph neural networks, which gained attention since 2018, are neural networks that learn the relationships of objects and are known to be very powerful in processing semantics or taxonomies.

    Microsoft's DeepSpeed framework[17], which is most commonly used for distributed model training, began to be widely used. DeepSpeed's ZeRo optimizer focuses on reducing GPU memory usage by distributing workloads across various hardware from CPUs to GPUs and partitioning model states. Open-source language models also emerged. OpenAI was no longer releasing models and was selling exclusive rights to use models. Due to the reduced accessibility, various language models appeared, but they could not meet the high expectations as they did not match the scale of large language models.

    The scale of GPUs handled by users easily began to exceed triple digits. Various massive-scale tests tailored to the actual workloads run by institutions became necessary. We started running language models again for system testing purposes, which we had sent to the realm of hobbies at the end of 2017. The largest Portuguese language model in the world was born on our platform and was briefly introduced in passing during the keynote at the NVIDIA GTC conference. In the same conference, a tutorial session titled "Fine-tuning BERT in 60 seconds" was held. BERT was no longer a massive model but a subject of practice.

    As the model size rapidly grew, the problems to be solved also changed. When models had to be split and loaded onto multiple GPUs, communication between GPUs became extremely important. GPUs not only shared memory access within a node but also increasingly communicated across multiple nodes. It became common to attach one InfiniBand, which transmits 200 Gb per second, to each GPU to create a GPU network.

    Amidst the complex and hectic changes, a thought occurred to me. The process of large language models learning 'language' is based on unclassified corpora. What does the large language model 'learn' in that process? Although corpora are used for the purpose of learning the structure of language, language cannot be separated from information. In fact, don't language models that have not been explicitly taught knowledge readily answer questions? Language itself is a protocol for humans to convey information to each other. The conversation process involves computing answers to data transmitted through the protocol and responding with data again. Then, is our perception of having developed an 'AI that converses well' really about developing an AI model that creates language well, or have we created something beyond that?

    The following year was going to be the first year of services that are only possible with AI, not services that have been improved with AI. However, no one was yet thinking about providing the outputs of large language models as services. That was a task for someone in the future.

    • GPT-J, 6 billion parameters
    • LaMDA, 160 billion parameters
    • PanGU-alpha, 200 billion parameters
    • Gopher, 280 billion parameters
    • Pathways, 530 billion parameters
    • Switch-C, 1.6 trillion parameters
    • Wudao 2, 1.75 trillion parameters

    2022

    The COVID-19 endemic was creating tremendous aftereffects. Numerous IT companies that had grown due to the special circumstances of the coronavirus and many companies that had tried to shift their offline operations online were dumbfounded by the demand for the metaverse, which suddenly disappeared like a mirage. The field of deep learning had not yet generated significant revenue sources. Numerous companies began downsizing their AI teams. Many researchers came out.

    It wasn't that there was no need for technological advancements in AI. The massive scale underlying AI development overwhelmed all other advancements. In the era of big science, equipment was the most expensive one, just as it had been. It was the result of three years since innovation started coming from scale. The singularity occurring in large language models began to be regarded as a kind of emergence.[18] Small-scale studies were no longer attractive. Deep learning researchers were anxious. It wasn't the diminished interest that was the problem. A spoonful of mild despair over what research could be done with just a few GPUs was more of an issue.

    Nevertheless, there were several innovations that emerged from the beginning of the year. In addition to training with well-defined data, there was a model tuning method where humans actually evaluate the answers and assign higher weights to better answers. The RLHF (Reinforcement Learning by Human Feedback) method, which applied reinforcement learning to language model training by involving humans in the middle, showed tremendous improvement in the performance of language models of the same size in InstructGPT in 2022. Numerous models began to apply RLHF. If there were scaling laws for model size, there would be no reason not to apply them. In March, µ-Parametrization,[19] which could tremendously reduce the cost of model training, was introduced. The conclusion of the study, which showed that it was possible to predict the hyperparameters of a large model in advance using a small model, relatively reduced the effort of parameter search when creating massive models. This research became the basis for GPT-4 training.

    As a result of the U.S.-China trade conflict, the United States banned NVIDIA's export of AI training GPUs to China. Within a few days, China announced a large language model trained solely with its own semiconductors[20]. Soon after, NVIDIA slightly changed the name of the same GPU with its GPU networking features removed and resumed exporting. The interest in the AI service sector continued to grow due to models like DALL-E2 and Stable Diffusion, and the market for generative AI models, such as images, began to fluctuate.

    In late November, OpenAI opened a chatbot service to the public. It was based on GPT-3.5, an improved version of GPT-3. An interesting point was that instead of creating a language model that excels at programming by training programming code on a human language model, it was created by training human language on a model trained with programming language data. It became clear how training with programming code influenced the logical structure training of large language models. In early December, the service named ChatGPT[21] sparked interest in large language models based on its tremendous accessibility open to the entire public.

    Towards the end of the year, my acquaintances in the AI field who seemed to be on the verge of losing their jobs were bewildered by the support that suddenly improved in real-time. The movement of companies that had been downsizing their AI organizations, riding the wave of endemic layoffs, came to a halt. Leaders who had been pressuring to downsize research organizations and evaluate outcomes just a week before were now shouting for AI. The requirements for model service frameworks suddenly began to change. The goal of large language models became commercialization. Models had become so large that there was no longer any meaning in distinguishing between computational resources for training and services. AI model training and services, which originally belonged to different domains, suddenly merged into one.

    Bigger problems await. Large language models consume an enormous amount of power. GPUs consume a tremendous amount of power. Although their power-to-performance ratio is tremendously better compared to CPUs, their absolute power consumption is too high. A node with 8 NVIDIA A100 GPUs[22] consumes about 7 kW, and a node with 8 H100 GPUs, the highest-performing GPUs as of 2023, consumes about 12 kW[23]. Since 2019, it has become no joke to say that you have to construct a building first to install the equipment. After experiencing power issues at a supercomputing cluster located in Brazil in 2021, we ported the entire platform to Arm-based systems. It was in anticipation of power issues becoming a problem a few years later. In the case of Microsoft, they shared their experience of building a GPU center right next to a hydroelectric power plant, taking into account the electricity costs[24].

    Weekends diminished. There was too much to do. There was no time. It wasn't just us.

    Now, no one had time.

    • Flan-T5, 110 billion parameters
    • GLM-130B, 130 billion parameters
    • OPT-175B, 175 billion parameters
    • BLOOM, 176 billion parameters
    • PaLM, 540 billion parameters

    2023

    On February 8th, Microsoft and Google made announcements about their large language model-based services at a 17-hour interval. Microsoft announced plans to introduce GPT models into its search engine Bing, its Office suite, and Windows 11. Google introduced Bard based on LaMDA. Baidu released Ernie Bot. The two companies promoted the future rather than tangible services. Tools that couldn't be tried out relatively failed to attract interest.

    The "era of AI price competitiveness" that I thought would come someday had arrived. However, the initial expense itself was too high. Services like ChatGPT and Bard consume service costs that are too expensive to be explained by economic logic.[25] It corresponds to a future that has been brought forward too quickly by competition. It was after everyone had firsthand experience of that future. The problem was that expectations had skyrocketed.

    The suddenly approaching large language model services are creating another bottleneck. Services that perform inference based on CPUs have been affected by the significant reduction in memory bandwidth per CPU core. This is because the number of cores per CPU has rapidly increased. Services that perform inference based on GPUs lack both the capacity and speed of GPU memory where models are loaded. The bottleneck of memory, which was expected to come someday, suddenly became a direct problem due to the commercialization of large language model services. It was an anticipated bottleneck since 2021. CPU and GPU developers such as Intel, AMD, and NVIDIA had prepared for this situation in advance. From the end of 2022 to early 2023, they introduced various hardware such as Intel's Xeon Max, AMD's MI200, and NVIDIA GraceHopper.

    If an AI model is extremely large, computational power becomes relatively less important. When NVIDIA A100 was first announced, it was released with a 40 GB model, but a year later, they released an 80 GB memory model. Whether it was the training process or the inference process, the size was too large to load and unload models from memory. Additionally, the process of "inferring" large language models brought about a reversal in thinking about GPUs and NPUs. Unlike the training process, which requires constantly updating weight matrices, the inference process operates by flowing input data through a fixed model structure loaded into memory and observing the results. Therefore, the proportion of computation is tremendously reduced, and the speed of memory becomes extremely important. In the second half of 2022, NVIDIA announced the H100 with 80 GB of memory capacity. However, less than half a year later, when hardly anyone had received the actual H100, they introduced the H100 NVL with 188 GB capacity.[26]

    Meta introduced LLaMA[27], a language model that could be run on personal servers with some effort. Despite being attached with all sorts of license restrictions, LLaMA spread through illegally leaked copies, and Alpaca-LLaMA, fine-tuned by Stanford, showed the possibility of achieving significant performance even with (relatively) small models. Since then, various language models without license issues have been continuously released[28], fueling the potential of open language models while raising new questions about what size of parameters would be satisfactory. If the model is small, emergence are not discovered, and it cannot be used as a multi-modal model. If the model is large, it costs too much money to actually operate.

    How large can large language models grow? Signs of preparation for even larger models can be seen everywhere. Microsoft's DeepSpeed framework, which is most commonly used for distributed model training, added ZeRO Infinity[29] in 2021, an extension that can train models with 1 trillion to 10 trillion parameters by utilizing NVMe SSDs. However, models with such a large number of parameters are practically impossible to serve. In practice, the approach is to set a limit on the model size that can be served and fine-tune within that range. Technologies like ZeRO were developed to train ultra-large-scale models, but they are being widely applied as they enable fine-tuning with very few resources.

    • PaLM-e, 560 billion parameters
    • Pythia, 12 billion parameters
    • LLaMA, 6.5 billion parameters

    And numerous other models with ~12 billion parameters


    Various attempts are being made on numerous 12-120 billion parameter models that are 'good at talking.' LLaMA unintentionally spread foundation models that individuals could experience. Many people realized that the level of "models that are good at talking" that can satisfy ordinary people had been achieved long ago. Individuals or organizations with some computer knowledge and the ability to spend money have gained the courage to attempt fine-tuning language models in various ways.

    At the same time, it is becoming known that the computational resource requirements of models beyond just being good at talking are on a different level. For about half a year, the size of newly emerging large language models has been maintained at less than 600 billion parameters. It could be that further size expansion does not yield enough results, or it could be a technical barrier created by the current hardware and costs. Or, it could be a movement to keep the size below a certain level because that size is in a range where commercialization is impossible.

    Backend.AI, which started as an open-source project with 4 GPUs in 2015, handles several thousand GPUs in 2023 and will soon reach ten thousand. All environments, including ours, have changed tremendously. The more you dig into problems, the more problems keep emerging like potato stems. While living and solving numerous problems entangled with the size of large language models, I sometimes wonder where the end of this problem will reach.

    On nights full of thoughts, I sometimes think that, like the Turing test that unknowingly drifted away from our attention, we may have all passed a certain point without realizing it. It seems like we have either solved a problem that needed to be solved or solved a problem that shouldn't have been solved yet. Complex emotions of excitement turning into dizziness and expectations turning into depression come and go.


    • [^Editor's note] Crossroads is a science web journal launched by the Asia Pacific Center for Theoretical Physics, aimed at 'Science, Future, and Humanity' to showcase a scientific vision of the future through various genres of scientific writing, including science features, essays, columns and novels.
    • [1] Facebook Post
    • [2] There are various parameters besides the connections between neurons, but for convenience, it has been tremendously simplified as the model size becomes relatively small.
    • [3] Physical computers in their raw form, not virtual machines. In the cloud, it is common to run virtual machines on top of a hypervisor or manage them based on containers to reduce management overhead and flexibly manage resources. Due to cost issues, it has not yet been popularized in small research institutes and universities.
    • [4] Job Scheduler. Software that helps manage and execute processes. Slurm and others are commonly used.
    • [5] https://www.openai.com (2015). Since 2020, OpenAI has not released implementations, and since 2023, they have only provided tech reports instead of papers. As of 2023, there are various opinions on whether OpenAI is still an AI development organization that pursues openness.
    • [6] https://www.tensorflow.org, Google (2015)
    • [7] "AlphaGo - The Movie" For those who couldn't feel the atmosphere at the time, refer to the documentary (2018)
    • [8] J. Shin "Creating AI chatbot with Python 3 and TensorFlow" PyCon APAC 2016 (Korean) / (English) (2016) Although there are various presentation videos on the same topic as I had the opportunity to introduce it in several countries, these two are the first presentations.
    • [9] J. Shin "Android Dreaming of Electric Sheep: Implementing Chatbot Emotion Model Using Python, NLTK, and TensorFlow" PyCon KR 2017 (2017)
    • [10] A record of an interview at the Google Startup Campus remains on YouTube.
    • [11] "Transformer (machine learning model)"
    • [12] J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" Arxiv:1810.04805 (2018)
    • [13] Y. Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach" Arxiv:1907.11692 (2019)
    • [14] A. Radford et al., "Language Models are Unsupervised Multitask Learners", (2019)
    • [15] J. Kaplan et al., "Scaling laws for neural language mod- els" Arxiv:2001.08361 (2020)
    • [16] C K. Joshi, "Transformers are Graph Neural Networks", The Gradient (2020)
    • [17] Microsoft, "DeepSpeed: Extreme Speed and Scale for DL Training and Inference", (2019)
    • [18] J. Wei et al., "Emergent abilities of large language models" Arxiv:2206.07682 (2022)
    • [19] E.Hu, G. Yang, J.Gao, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" Arxiv:203.03466 (2022)
    • [20] A Zeng et al., "GLM-130B: An Open Bilingual Pre-trained Model" Arxiv:2210.02414 (2022)
    • [21] OpenAI, "Introducing ChatGPT" (2023)
    • [22] If a computer installed in a data center cabinet called a rack is considered a node, a single node with 8 A100 GPUs typically occupies 6 to 8 slots in a rack, and a rack can accommodate around 40 nodes.
    • [23] The power of a single floor in a typical university building is around 100 kW.
    • [24] "NVIDIA Teams With Microsoft to Build Massive Cloud AI Computer" (2022)
    • [25] According to my personal estimate, in the case of ChatGPT, the cost based on GPT-3.5 is over $42 per month. Refer to the link for the calculation process. Facebook Post
    • [26] "NVIDIA H100 NVL for High-End AI Inference Launched" (2023)
    • [27] H Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" Arxiv:2302.13971 (2023)
    • [28] Representative examples include Dolly 2 (2023), which combines EleutherAI's Pythia-12B model with its own data.
    • [29] S. Rajbhandari et al., "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme-Scale Deep Learning", Arxiv:2104.07857 (2021) To load a model with 1 trillion parameters onto a GPU without memory offload for training, 320 NVIDIA A100 GPU (80 GB) models are required.

    25 March 2024

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

8F, 577, Seolleung-ro, Gangnam-gu, Seoul, Republic of Korea

© Lablup Inc. All rights reserved.