Culture
Lablup with PyCon Korea 2024: lambda submit: Starbucks if submit == "duck" else None
By Jinho HeoHello, I'm Jinho Heo, a Technical Writer at Lablup.
Lablup participated in PyCon Korea 2024 as a platinum sponsor from October 26th to October 27th, at the Suwon Convention Center.
Lablup's founding philosophy is strongly tied to open source. It's not a stretch to say that open source is in Lablup's blood. There are a variety of open sources, but Lablup has a particularly strong relationship with Python. Lablup is an active contributor to open source, such as aiohttp, which was developed using asyncio, and also contributes to Python itself. Our deep ties go beyond Python to PyCon. Members of Lablup have delivered presentations on diverse topics at PyCons globally and have been a sponsor partner of PyCon Korea five times.
This year's PyCon was a big one for Lablup. This is because Lablup's CEO Jeonkyu Shin and CTO Joongi Kim were keynote speakers on both days of PyCon. Jeongkyu Shin gave a talk on 'Python, PyCon, the Dinosaur Age, and the Planet of Chickens', which is a confusing title at first glance, but he likened two qualities of Python to dinosaurs, the first evolutionary victors: Python's rapid growth and ascension to the number one language, and its status as a key language in the era of massive AI: which creates innovations with massive computation. This leads to 'chicken', the most consumed meat in the modern world, which he said of its accessibility and universality, making it easy for anyone to learn and utilize.Shin's presentation was well-received by the audience, as he entertained the audience with his witty titles, verbal skills, and various AI-generated illustrations.
Joongi Kim, the CTO, delivered a presentation entitled "10 Years of PyCon and Me." Spanning four chapters, he reflected on his PyCon presentations worldwide, beginning with asyncio's development, the company's expansion, managing code scaling, and motivating fellow Python enthusiasts. His clear explanation of the challenges developers encounter or may encounter garnered considerable applause from the audience.
Kyujin Cho, Senior Software Engineer | Sergey Leksikov, Researcher | Joongi Kim, Chief Technology Officer
In addition, Kyujin Cho, Senior Software Engineer, Sergey Leksikov, Researcher, and Joongi Kim, CTO, each presented a session. Kyujin presented “Automated Python Web Framework API Schema Creator: The long way around', where he shared with the audience the series of shovelfuls he went through to overcome the challenges of automatically generating API documentation using aiohttp. Sergey presented 'Automating CLI commands execution with LLM and LangGraph: A new frontier in Python automation', where he talked about how complex CLIs can be transformed into user-friendly tools using LLM and LangGraph frameworks to improve user experience and operational efficiency. Both talks were well attended and received great interest from the audience. Joongi presented 'Engineering Python for enterprise delivery', where he shared his experience in developing packaging/installers for Python app delivery.
👨🏫 Jinho: We'd love to hear your PyCon session recaps.
👨💻 Sergey: Although my presentation was in English, the PyCon Korea organizers offered a real-time translation service, enabling me to deliver my talk smoothly. The highlight of the session was the interaction with eager audience members who approached me with questions about my presentation topic afterwards.
'AI Score Reader' event (Drawing a duck and swan)
Lablup organized an 'AI Score Reader' event at PyCon Korea 2024. Attendees could join the event by scanning a QR code on their mobile devices, tablets, or laptops and were invited to draw "ducks and swans" for a chance to win prizes. Every participant had the opportunity to receive a Lablup Folding Pouch, while the grand prize for the best drawing was a Starbucks gift card.
The stickers Lablup carries to every exhibition
Why ducks and why swans? They are the ones in stickers we always bring to exhibitions. While they appear to glide smoothly across the water, they are paddling hard beneath the surface. This imagery serves as a metaphor for our desire to have our customers enjoy seamless AI services on the front end while Backend.AI handles the complex processes in the background.
The concept of the event is straightforward. If you've ever used generative AI tools such as Stable Diffusion or Dall-E for image creation, you're aware that the structure of generative AI is highly sensitive to its input. The output may vary with each attempt, even with identical inputs, or change drastically with minor input modifications. This method of instructing generative AI to yield specific results is known as 'prompt engineering.' All participants at our booth had the opportunity to engage in some prompt engineering themselves.
We asked Kyujin Cho, a senior software engineer who was responsible for developing the backend of our “AI Score Reader” event, to give us an overview.
👨🏫 Jinho: Can you explain how the “AI Picture Reader” event page was created?
👨💻 Kyujin: The "AI Score Reader" event page's backend was built on three microservices: the Web Application Server (WAS), the image generation pipeline, and the image similarity pipeline. The WAS includes a user database and an API to manage image generation requests. These microservices were all deployed on Backend.AI using the 'Model Service' feature. Similar to the Visutale demo by Sergey from our research team, this serves as a testament to the versatility of Backend.AI in developing AI services.
👨🏫 Jinho: Specifically, what process leads to the similarity between a user's drawing and a given image?
👨💻 Kyujin: When a user submits a prompt through a page accessed by scanning a QR code to generate an image, the Web Application Server (WAS) forwards the text to an image generation service and retrieves the created image. The WAS then sends this image to the similarity discriminator, receives a similarity score as a percentage, and delivers both the score and the generated image to the user.
👨🏫 Jinho: What criteria should we draw by to get a high score?
👨💻 Kyujin: I have no idea what the image similarity pipeline is determining similarity. Perhaps asking the AI directly would be a better way to get an answer.
'Massive' engagement
Our members perceived it as a "minor event." However, it took less than an hour for this perception to be completely overturned.
Our booth started to fill up, and the front of the booth became a scene of people staring hungrily at the leaderboard, eager to see where they stood and defend their top placement.
There were even reports of people sitting in lounges staring at their phone to draw swans.
Let's take a look at the stats.
On the 26th and 27th of October, the event saw the participation of 428 individuals. Throughout this period, we received 11,639 image creation requests, with the most surprising contribution from a single participant who submitted over 1,000 images, a detail we'll delve into later.
Unintended Side Effects Raised
Seated in the booth and observing the leaderboard, I noticed two familiar nicknames: "cloudshin" and "achimnol," which belong to CEO Shin Jeongkyu and CTO Kim Joongi, respectively. They were on a streak, achieving similarity scores above 90%. They kept drawing ducks non-stop, even while proceeding to lunch.
DevRel Lead WooYoung Yoo attempted to stop them, yet it is said that their enthusiasm was difficult to diminish. (P.S.1: Of course, we did forcefully erase their data when finalizing the scores).
(P.S.2: Indeed, the participants performed so well that they effortlessly exceeded the scores of the latter two, resulting in a dominant podium presence...)In addition to the internal side effects, there were also external issues. While running a booth on Day 1, adeveloper screamed in the distance.
"I believe someone is executing a macro."
We discovered that duplicate submission requests were being sent to the same prompt every 1-5 seconds, due to a macro exploiting the AI's generative capabilities to yield varied results with identical prompts. However, addressing the issue was challenging at the moment of detection, because the backend developer was slated to present at PyCon on the second day. The situation was deteriorating.
“I can't submit” ‘The button doesn't work’ ”I get a gray blank space instead of an image”
An attendee opened our event website on their laptop, connected the developer tools, and pointed out that no requests were coming through. We found a developer tucked away in a corner preparing to present at a stage, and discovered that the earlier macro was bombarding our NVIDIA H100 with concurring requests.
Day 1 concluded with a trickle of image submissions. While tidying up the booth and reflecting on the day, we recognized the need to avert this issue on Day 2.
Poor developer pulled an all-nighter to prepare for Day 2, putting off his presentation and adding a couple new features to prevent the accident.
First, added same-prompt submission protection
We added the ability to validate submitted prompts on the server side. If a prompt has been submitted previously, the response has been altered to provide the initially generated image and its similarity score rather than creating a new image.
Second, added captcha when submitting images
To deter macros, we implemented a captcha for submitting responses. Although this may have been slightly inconvenient for participants, the introduction of the captcha successfully prevented random macros on the second day of the event.
Other minor improvements We were receiving data with quite a few decimal places to calculate participants' scores, but we were truncating them in the GUI to the first decimal place for user convenience. However, as the competition heated up, people started to show up with identical scores to the first decimal place, so we patched the GUI to show the second decimal place to reduce user confusion.
Thanks to Kyujin Cho, Senior Engineer, and Soojin Kim, Frontend Engineer, who worked tirelessly on the backend to make the event a success 🙂 .
Wrapping up PyCon Korea 2024
Numerous Python enthusiasts attended PyCon Korea 2024, where Lablup also engaged with many attendees. Lablup is committed to ongoing contributions and growth within the open source community. Echoing Chef Jeong Ji-sun's words from the hit show “Culinary Class War,” “Opening up recipes leads to more recipes. With many contributing ideas, it creates a larger tapestry.”
Here at Lablup, we will continue to think big and never lose sight of the open source spirit that has always been the cornerstone of our foundation.
20 November 2024
Learning from Nature Again: Neuromorphic Computing and Deep Learning
By Jeongkyu ShinThis article originally appeared in Engineers and Scientists for Change, May 2022.
The field of artificial neural networks has been gaining serious attention for nearly a decade now. In that short time, along with the advancements in deep learning, the field has been solving countless problems at an astonishing pace. It is considered the most promising approach for achieving artificial intelligence.
At the cutting edge, news about hyperscale deep learning models and their implementations have been garnering attention. In April 2022, news about NVIDIA's new H100 GPU flooded the headlines. AMD's high-performance computing GPUs like the MI series, along with Intel's new Ponte Vecchio GPU, are expected to drastically improve AI and blockchain mining acceleration, creating a new battleground for hyperscale AI.
Amidst the buzz around hyperscale AI, there was one piece of news that didn't receive much public interest: Intel's announcement of the Loihi 2 chip in October last year[1]. This news comes with a fascinating history and technical background. While AI training and service acceleration chips are proliferating, let's explore the science behind this intriguing tech news.
Can we code intelligence with deep learning?
Since 2013, when deep learning took off with matrix computation acceleration using GPUs, the field has begun exploring various possibilities empowered by the scale of computation. Starting with the AlphaGo shock in 2016, deep learning has gradually expanded its scope beyond research into practical applications. The transformer model architecture[2], proposed in 2017 and widely adopted from 2018, introduced the concepts of attention and self-attention, greatly improving the process of deep learning models creating their own memory structures. The transformer architecture has since been used in a wide range of deep learning models, particularly excelling in the language processing and image processing domains, where data is abundant. Transformers have enabled deep learning models to solve various problems that previously seemed intractable.
This seemingly omnipotent model has led the trend of scaling up deep learning models since 2018. The size of a deep learning model is determined by the number of its parameters, which are the connection information between the perceptrons that make up the model, corresponding to the synaptic connections of actual neurons. More connections allow the deep learning model to differentiate and judge more complex inputs. As deep learning models become more complex and massive, the number of parameters grows exponentially. Until 2019, the number of parameters in deep learning models increased roughly 3 to 5 times each year, but since 2019, it has been increasing more than tenfold annually. The massive deep learning models that have emerged in the past 2-3 years are sometimes referred to as "hyperscale AI". Well-known hyperscale deep learning models in the language processing domain include OpenAI's GPT-3 and Google's LaMDA. For these huge models, such as GPT-3, the system cost for training the model (the cost of a single training run on the cloud without purchasing equipment) is estimated to be at least around 5 billion KRW (approximately 4 million USD)[3].
Hyperscale models are solving problems that were previously difficult or impossible to solve. They discover new gravitational lenses[4] and unwarp the distortions caused by lenses[5] to unravel the mysteries of the universe. They predict protein folding structures in much shorter times and at lower costs than previous methods[6] and discover new drugs[7]. They even solve problems like StarCraft II strategic simulations[8], which require understanding flows over long periods of time.
As these models solve various problems, naturally, some questions arise. Is this approach of pouring in massive amounts of resources to create deep learning models sustainable? And can we code "intelligence" using this method?
To answer these two questions, let's quickly understand deep neural networks and today's topic, neuromorphic computing.
Deep Neural Networks: Origins and Differences
Deep learning is actually an abbreviation. The full term is Deep Neural Network (DNN), or more elaborately, Artificial Neural Network with deep layers. The theory of artificial neural networks has its roots in mathematically imitating the electrical properties of neurons. It began by mathematically imitating the electrical properties of neurons along with the plasticity[e1] of strengthening or weakening the connections between neurons during information processing, and then simplifying it. The artificial neural network model is a mathematical model that consists of a perceptron[9], which is an extreme simplification of the activation process based on the connections between neurons; an activation function that simplifies the firing process of neurons into a function of signal input to neurons, excluding time dependency; and weights as parameters representing the strength of connections between neurons.
Although artificial neural network theory has its roots in the characteristics of actual neural networks, there is a fundamental difference: the presence or absence of dynamics that determine behavior over time. In real neural networks, various outcomes are determined by the dynamics between neurons. Neurons have their own dynamic characteristics when stimulated from the outside, and they have plasticity that physically strengthens or weakens accordingly. For example, neurons that are continuously used and connected together when making a certain decision are activated at similar times when receiving an input signal. We can observe that the axons corresponding to the connections between neurons activated at "temporally" similar times become physically thicker. In contrast, general artificial neural networks simulate plasticity using backpropagation theory instead of dynamics. Backpropagation theory is a method to simplify the calculation of strengthening the weights of connections between perceptrons that were used to make a correct decision. In artificial neural networks, the process of processing input information is instantaneous using the weights between perceptrons. Since the process of input information leading to output information is not calculated as a function of time, there is no dynamic element.
There are various other differences besides dynamics. These differences are usually the result of introducing assumptions that are impossible in biological neural networks in order to overcome the limitations of artificial neural network theory in the 1990s. One example is the use of ReLU[e2] as an activation function. Ordinary neurons have thresholds and weight limits. Infinite activation values are physically impossible. So mathematical models also used functions that well represent thresholds and weight limits as activation functions. However, as deep artificial neural networks became deeper, researchers discovered that the training of artificial neural networks stopped progressing.[e3] The ReLU activation function, although physically impossible, can have infinite weights mathematically[e4]. By introducing ReLU into deep artificial neural networks, new training became possible, and the difference from biological neural networks grew.
Artificial neural networks that do not need to consider dynamics can be transformed into a sequence of matrix operations, enabling incredibly fast computation. However, they have become very different from the neural networks seen in biology. So, are deep learning models and the neurological processes occurring in our brains now on completely different foundations?
Learning from Nature Again: Dynamics of Neural Networks
Individual neurons communicate signals in various ways. Some are electrical signals, and some are chemical signals. The electrical signal characteristics within single neurons were interpreted and formulated very early[10] and became the theoretical basis for perceptrons. The problem was that the formula was too complex to calculate dynamics without simplification. Later, various mathematical models were proposed to approximate the electrical responses over time with reduced computational burden, and various single neuron simulators based on these models have been released. A representative simulator is NEURON[11].
As mentioned earlier, simulating dynamics requires an enormous amount of computation. At some point, we are entering an era of abundant computational power. What would happen if we connected these single neuron simulations based on the overflowing computational power?
There are two approaches, in terms of algorithms and hardware, to solve the problem of computational speed due to the enormous amount of computation and create dynamics-based artificial neural networks. The algorithmic approach is the spiking neural network (SNN), which attempts to create a dynamics-based artificial neural network by introducing spike-based plasticity that occurs in actual neurons. The hardware approach is neuromorphic computing, which has been in full swing since 2012. It involves implementing an artificial neural network by creating physical objects corresponding to neurons. Computers are still too slow to solve the enormous amount of computation involved in dynamics simulation with general-purpose computation. To address this, the idea is to create dedicated devices that either make objects with mathematical properties corresponding to neurons at the circuit level or hardcode computations. Recently, there has been an integration of not distinguishing between neuromorphic computing and SNN, and referring to the implementation of SNN at the device level as neuromorphic computing. Both approaches are attempts to simulate the dynamic characteristics that traditional artificial neural network theory did not use, to discover new phenomena or possibilities of deep learning.
One of the companies making strides in the field of neuromorphic computing is Intel. In the fall of 2017, Intel unveiled the Loihi[e5] chip, a research neuromorphic chip containing approximately 130,000 neurons and 130 million synapses. They ported existing DNN-based algorithms onto the Loihi chip, performed various comparative tests[12], and interestingly, showed that similar results to DNN can be obtained using SNN as well.
Intel then connected multiple Loihi chips to create massive SNN systems. Nahuku implemented 4.1 billion synapses, and the Pohoiki Springs neuromorphic supercomputer[13] implemented approximately 101 million neurons and 100 billion synapses based on 768 Loihi chips. In this process, Intel developed a software stack to implement SNN on Loihi. As a result, last fall, along with Loihi 2, they released the Lava open-source software framework[14] for developing neuromorphic applications.
It was expected that DNN and SNN would show similar results. From a physics perspective, the process of artificial neural networks inferring various problems is ultimately defining an ultra-high-dimensional discontinuous state space based on information and projecting new information onto that space. Both DNN and SNN have the characteristic of being able to define ultra-high-dimensional discontinuous state spaces. Through evolution, biology has physically created the characteristic of adapting to information, and humanity has invented artificial neural network theory through biomimetics and developed deep learning.
Always Finding Answers, as Always
So far, we have learned that networks that imitate neurons at the dynamics level can also obtain results similar to what we expected from deep learning. Then, a question arises: if the results are similar, is there a need to use SNN and neuromorphic computing? The examples introduced today are just a tiny fraction of various attempts. Research is ongoing on how SNN and neuromorphic computing produce different results from existing approaches. There are also results showing that SNN performs better, especially in robotics and sensors, and studies suggesting that reflecting dynamic characteristics would be more powerful for inferring causality. There are even attempts to simulate the chemical signals occurring at synapses[15]. This is because, in addition to the connection structure of neural networks, there may be elements in the individual components that make up neural networks that evoke the emergence of intelligence that we are not yet aware of. However, this may not be a sufficient answer to why SNN is used.
Let's ask the two questions posed at the beginning of the article again. Is this approach of pouring in massive amounts of resources to create deep learning models sustainable? And can we code "intelligence" using this method? Could neuromorphic computing be the answer? It could be, or it might not be.
The reason why DNN and SNN each show high performance and results is ultimately because there is an information optimization theory that we do not yet know at the foundation of both implementations. If we come to understand that, we may be able to implement AI in a different way. It could be one path to finding an answer to the first question: "Is this approach of pouring in massive amounts of resources to create deep learning models sustainable?" Neuromorphic computing and SNN allow us to examine this problem from a new perspective.
And it could also be the answer to the second question. We always carry a question in our hearts: 'Who are we?' The approach of neuromorphic computing and SNN is the most easily understandable method when we physically approach this fundamental philosophical question. Because it explains using a system that we already know (although we don't yet know its framework).
Various fields, including neuromorphic computing, are challenging to answer the above two questions. One of them is quantum computing. Next time, let's take the opportunity to read an article together about quantum computing and deep learning.
References
- [1] https://www.anandtech.com/show/16960/intel-loihi-2-intel-4nm-4
- [2] https://arxiv.org/abs/1706.03762
- [3] https://lambdalabs.com/blog/demystifying-gpt-3
- [4] https://iopscience.iop.org/article/10.3847/1538-4357/abd62b
- [5] https://academic.oup.com/mnras/article-abstract/504/2/1825/6219095
- [6] https://www.nature.com/articles/s41586-021-03819-2
- [7] https://www.frontiersin.org/articles/10.3389/frai.2020.00065/full
- [8] https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii
- [9] https://doi.apa.org/doi/10.1037/h0042519
- [10] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413
- [11] https://neuron.yale.edu/neuron
- [12] https://ieeexplore.ieee.org/document/8259423
- [13] https://arxiv.org/abs/2004.12691
- [14] https://www.intel.com/content/www/us/en/newsroom/news/intel-unveils-neuromorphic-loihi-2-lava-software.html
- [15] https://www.ibm.com/blogs/research/2016/12/the-brains-architecture-efficiency-on-a-chip/
Endnotes
- [e1] Plasticity is the ability to adapt and change one's characteristics in response to changes in the external environment or stimuli.
- [e2] It stands for Rectified Linear Unit. It is an activation function that takes the shape of y=x for values greater than 0, which means y can continue to increase as x increases.
- [e3] This is known as the Vanishing Gradient problem.
- [e4] Neurons cannot produce outputs beyond the physical limits of the cell, no matter how much larger the input they receive. It's like not being able to pass an unlimited current through a wire. ReLU is a function where the output increases linearly and indefinitely in proportion to the input.
- [e5] Intel is using various Hawaiian place names as codenames for their neuromorphic chips and systems.
27 June 2024
[Featured articles] Scale entanglement
By Jeongkyu ShinThis article originally appeared in Crossroads, May 2023. [^Editor's note]
This article is translated from Korean.
The original order of the essay is 2023 > 2015 > 2020 > 2017 > 2018 > 2019 > 2021 > 2022 > 2023.
While the emotional flow follows that order, the essay has been re-edited in chronological order for the reader's understanding.After 2023, March 14th may no longer be known as Pi Day, but rather as Chatbot Day.
It was the day when all the language models that had been in storage were simultaneously unleashed into the world. Starting with Google's release of PaLM fine-tuning + generative models on Vertex AI, followed by OpenAI's announcement of GPT-4, Microsoft officially confirming that Bing was already using GPT-4, and Anthropic's formal release of Claude bot, all within a span of 12 hours.
That morning, after reviewing the GPT-4 tech report released by OpenAI, I left a post on Facebook about the technical aspects that impressed me.[1] A comment was left: "I feel both joy and pain seeing things I wondered if I would ever see in my lifetime becoming reality..." I replied, "Now, no one is interested in the Turing test anymore. Within a year, it has become a question of 'Isn't it obvious that it will pass without a wow point?'"
When actually facing the keyboard, the task of organizing knowledge seems to have already left human hands. I tap into human stories to find the meaning of recording.
To avoid sounding like an alien language, let me briefly touch upon the aspects of artificial neural networks necessary to understand this essay.
A program that simulates the connections between neurons is called an artificial neural network. Neurons are grouped into layers, and the network is designed by overlapping these layers and creating connections between neurons in adjacent layers. Deep learning is a term used when there are many layers in an artificial neural network. The results of various artificial neural networks are called deep learning models, or simply AI models to sound more impressive.
The connection strengths between neurons are called parameters[2]. The number of connection strengths is called the number of parameters. As the number of parameters increases, more memory is required, so it is said that the model becomes larger. Training a model involves connecting artificial neurons and adjusting the connection strengths between them so that the output for the input data takes the desired form. After training, an artificial neural network can mimic an extremely high-dimensional, discontinuous state space.
Those are the basic terms. So, when should we start recalling the story?
2015
I founded Lablup Inc. The name was chosen as a pun, meaning both "lab eul up" and "lab | up". (eul is an adjective to modify a noun in Korean language. | is linux pipe command)
People who had been suffering through their Ph.D. programs came together with the goal of creating a research automation platform in the field of computational science to alleviate the hardships of others. We believed that instead of clumsily running clusters by placing workload managers[4] on bare metal[3], we needed a computational environment that guarantees reproducibility and portability. The start was courageous, but there was no demand for a research platform. Within two months of founding the company, we learned the hard way that neither universities nor research institutions, despite easily purchasing equipment, are stingy when it comes to spending money on software. Universities lacked funds, but had an abundance of graduate students who would figure things out on their own when instructed. The industry did not yet have a large-scale demand for scientific computing. At the same time, we painfully learned that Ph.D. holders like us, who had only been in school, needed to go through a process of re-socialization to communicate like normal people.
It wasn't just our lack of re-socialization that was the problem. It was the content of what we were saying. Stories about science advancing based on technology or stories about accelerating innovation based on computation were considered science fiction topics wherever we went. We were getting exhausted. Yet, as they say, people only see what they can see. The potential of deep learning models was clearly emerging. In fact, various movements had started to prevent deep learning technologies and their outcomes from being subordinated to capital. One of the representative organizations was OpenAI[5]. Such changes seemed to be evidence that we were heading in the right direction. It felt like the direction would become a bit clearer after just one more year.
On the verge of our first anniversary, deep learning garnered significant social interest. Around the end of 2015, TensorFlow[6] was released to the world. As the first lecture material for the coding platform prototype, we translated the entire TensorFlow manual and uploaded it. When AlphaGo had its match in March 2016[7], our service crashed for the first time due to the influx of people. If it weren't for that match at that time, perhaps there wouldn't be a sequel to this story. Thankfully, because of that, the company survived.
We needed a demo of a research platform performing massive-scale computations. Language models were a topic that required enormous computational resources and could be immediately tackled as a hobby since my Ph.D. program. In 2016, we released a chatbot created by placing a language model on the platform we were developing. It attracted a lot of attention.[8][9] The chatbot quickly became an in-house project. However, after just over a year, the chatbot project was shelved.
2017
In the fall of that year, Lablup decided to halt the development of language models, which had been a side project, and focus solely on Backend.AI, an AI cluster management platform.
The shock of the Google Assistant demo I saw in Krakow, Poland, where I was invited by Google, was significant. The topic of that meeting, which was attended by just over ten people, clearly showed that language models had now become part of the arms race for resources, and without large-scale investment, it would be impossible to keep up with the changes thereafter. Over dinner with Donghyun Kwak, who attended the Google Developer Summit with me, and Sanghun Lee, who visited Krakow to attend a conference held at the same time and place, we discussed, "It seems like the future I saw in the field of physics will start here as well."
The Manhattan Project strongly appealed to all of humanity through nuclear weapons that technology becomes power nearly eighty years ago. Physics was no longer an object of romance, but an object of investment. The era of physicists making a living, which began that way, led to changes in the fields of massive science, connecting to the space program and particle physics. On the way back to the accommodation that night, I sent a message to the team members, "Let's not develop language models anymore. From now on, we'll lack the funds to keep up."
History always repeats itself. Then it's not difficult to predict what changes will occur in the future. It's just a matter of timing. We shared the opinion that 'the inflection point will probably be in 2020.'[10] By then, wouldn't it be possible to achieve profitability? It became the company's goal. Language models were advancing beyond LSTM-based machine translation. The AlphaGo shock became the source of countless jokes people made about AI. A tremendous number of "AI companies" were born. However, most of them became coin companies or metaverse companies two years later.
2018
The transformer[11] architecture began to be applied to various parts of all kinds of language models. In terms of letting us know 'what' to focus on, transformers solved many aspects of model context memory, maintenance, and emphasis. They could be used for encoding to put information into the state space or for decoding to extract information from the state space. Google introduced BERT[12], while OpenAI introduced GPT. Both were transformer-based language models, but they differed in their focus on the encoder and decoder, respectively. BERT focused on the encoder part, while GPT implemented an architecture that creates memory for causal relationships by linking the output to the input through the decoder, which is structurally different from BERT. BERT and GPT, and the later T5, no longer used labeled corpora. Using transformers, language models could be created by first training the model to learn the structure of language from the corpus itself and then fine-tuning it. It was still far from the end-to-end training philosophy of AI model development, which involves no human intervention in the data. However, the concept of data acquisition began to change from that point on. Quantity is more important than labeling. It was the beginning of general-purpose language models.
BERT seemed to be able to replace most of the existing language models at an incredibly fast pace. Its overwhelming performance raised expectations for making significant improvements in various language tasks such as document creation, chatbots, and document analysis. However, as of 2018, BERT was a model so large that it was unimaginable to train. It seemed like no one outside of Google would be able to create a model of that size. But that moment was fleeting. Facebook quickly released RoBERTa[13], which was based on the BERT paper but increased in size. It was a symbolic action that simultaneously announced that TPUs were not necessarily required and that anyone with capital could participate in this race.
The first bottleneck in increasing model size using GPUs appeared in GPU memory. Models could no longer fit on a single GPU device or be trained in time using a single device. Depending on the case, it became common to split models across multiple GPUs or distribute training across multiple computing nodes. Horovod and Distributed TensorFlow began to shine.
Technology continues to advance, and the cost per unit of computational resources continues to decrease. If this progress continues, the popularization of AI will eventually take place, and at that point, price competitiveness, which is the same in all markets, will become the most important factor. "The era of price competitiveness will come for AI as well," I wrote, praying not to go bankrupt until then.
- BERT, 340 million parameters
- GPT, 110 million parameters
2019
For several years while creating a distributed processing and distributed training platform, I occasionally wondered, 'Could we be creating a platform with no demand?' After 2019, I no longer had such thoughts. From 2020, there was no time for such thoughts.
As soon as the new year began, OpenAI announced GPT-2[14]. The GPT model, which focused on the decoding process of extracting information from the topological space, demonstrated remarkably stable text generation capabilities. GPT-2 became the basic code that anyone could use to create a language model. Along with PyTorch, Horovod, and Distributed TensorFlow, the difficulty of code access was rapidly decreasing. Google's XLNet and T5 (Text-To-Text Transfer Transformer) language models in 2019 seemed to have crossed the river of model sizes that were thought to be impossible for humans to surpass (by spending capital). Google strongly appealed that training T5 required enormous computational resources at the TPU level, emphasizing that it would require hundreds of NVIDIA's V100 units, which could be purchased with effort in the market. (A single V100 unit cost around 12K USD at the time.) Like BERT, T5 only released the paper and did not release the model. Google had the painful experience in 2017 when they released the BERT paper but did not immediately release the model alongside it (because training was not yet complete), and in that gap, Facebook preemptively released RoBERTa, which they had trained by increasing the size of the same model. Considering that they still did not release it, Google must have been confident that it would be difficult for anyone outside Google to reproduce the training of that model.
At the end of 2019, we ended our long wandering and moved into our own office. The era of massive deep learning models was coming, and to prepare for it, we needed to collaborate with more people. The size of language models was increasing tenfold every year. As I carried the not-so-many belongings in boxes, I questioned myself. At this rate, in three years, the model size would increase a thousand-fold, but were we ready to handle a thousand times the workload?
- RoBERTa, 350 million parameters
- Transfer ELMo, 460 million parameters
- GPT-2, 1.5 billion parameters
- T5, 11 billion parameters
2020
Two months had passed since we moved to the new office. The winter was long.
By the end of February, we were unable to complete the interior of the new office we had moved into. The wall finishing materials for the office, which were supposed to arrive from China at the end of February, never made it. The company's entire roadmap changed. All business trips to the United States were canceled. The office with unfinished interiors remained unfinished and guarded the empty space for the next two years.
COVID-19 not only tore apart the company's future but also separated people. My first child started elementary school by lying diagonally against the floor, watching the tiger (meaning of strict in Korean) teacher on the EBS educational broadcast TV channel. I rolled around next to the child rolling on the floor. How busy had I been? Even while raising my children, it wasn't until I was trapped in the same house due to the coronavirus that I felt the reality of being a father. I wondered how long that sad yet strangely stabilizing time would last.
That year, OpenAI released GPT-3. The theoretical foundation was not significantly different from GPT-2. However, one thing was vastly different: the size. It was a massive model with 175 billion parameters. Not only for model training but simply loading it onto a GPU was expected to occupy an entire NVIDIA DGX-2 supercomputing node. Unlike GPT-2, this time, neither the code for the language model nor the trained language model was released. Wow. Non-disclosure in the field of deep learning. Something was changing.
There was a movement against the supremacy of model size. Does performance increase accordingly as the size of deep learning models grows? A debate began between researchers at Google and Meta. One side argued yes, while the other argued no, engaging in a word battle in the form of papers. However, this debate, which lasted from 2019 to 2021, did not last long. As the size of language models increased, interesting phenomena were discovered. Deep learning models had scaling laws[15]. Something happened around 100 billion parameters. Regardless of the model structure, at some point beyond 100 billion parameters, language models began to do unexpected things beyond simply generating coherent speech. The phenomenon called in-context learning allowed models to learn various knowledge on the spot without model training and derive logical conclusions. It was the beginning of the race surrounding large language models (LLMs).
While language models were simultaneously immersed in the debate and discoveries surrounding the size problem, the introduction of deep learning in medical applications began at a tremendous pace. DeepMind's AlphaFold2 achieved high-accuracy structure prediction without Monte Carlo simulations, using only predictions. It reduced the required computation, which was a major challenge in the field of proteomics, to nearly one-thousandth of the previous level. From microscopic stages such as predicting coronavirus mutations, filtering vaccine candidates among synthetic substances, and predicting new synthetic structures to predicting transmission routes and estimating the number of infected individuals, the application of AI models expanded to various fields. Everyone started leaping without looking. The snowball of resource scale rolled at a tremendous speed.
In the second half of the year, discussions about the scale of computational resources to increase model training speed took place. It was different from the existing competition to secure deep learning computational resources for research goals. Scale gave rise to operational and optimization demands, and thanks to that, we were able to achieve the 'profitability from 2020' that we had anticipated in 2017. The demand for platforms increased, but at the same time, we were forced to work from home, and most communication became text-based. Although many people joined us later, some of them were destined to become colleagues who had never met each other in person until the workshop in early 2023.
The scale of compute resources was growing, and all eyes were on GPUs, but as the models got bigger and the number of GPUs increased, the GPUs became less of a bottleneck. The biggest challenge was data storage. In training, you need to feed data to hundreds of machines. The absolute speed of storage hasn't kept up with the growth in the number of GPUs. Most of the problems we had to solve in 2020 came from storage bottlenecks.
Another kind of change, slower but deeper than the deep learning race, has been happening: it's become second nature to people of all generations to treat online relationships as normal relationships. And then there comes a moment when you wonder, "Does it really make a difference to me if the person on the other end of the line is human or not, as long as they speak good enough?"
- T-NLG, 17 billion parameters
- GPT-3, 175 billion parameters
- Gshard, 600 billion parameters
2021
The race for developing large language models that followed T5 and GPT-3 was growing in fascination. The simplest way to find out if performance improvement continues as the size increases is to make it even bigger. Various theories emerged as to why large language models produce peculiar results, but the answer was still unclear. A hypothesis emerged that when the state space is sufficiently large, a kind of phase transition occurs in the process of handling information. One of the candidate explanations for why the transformer structure handles these tasks well is that the transformer structure is a special case of graph neural networks (GNNs).[16] Graph neural networks, which gained attention since 2018, are neural networks that learn the relationships of objects and are known to be very powerful in processing semantics or taxonomies.
Microsoft's DeepSpeed framework[17], which is most commonly used for distributed model training, began to be widely used. DeepSpeed's ZeRo optimizer focuses on reducing GPU memory usage by distributing workloads across various hardware from CPUs to GPUs and partitioning model states. Open-source language models also emerged. OpenAI was no longer releasing models and was selling exclusive rights to use models. Due to the reduced accessibility, various language models appeared, but they could not meet the high expectations as they did not match the scale of large language models.
The scale of GPUs handled by users easily began to exceed triple digits. Various massive-scale tests tailored to the actual workloads run by institutions became necessary. We started running language models again for system testing purposes, which we had sent to the realm of hobbies at the end of 2017. The largest Portuguese language model in the world was born on our platform and was briefly introduced in passing during the keynote at the NVIDIA GTC conference. In the same conference, a tutorial session titled "Fine-tuning BERT in 60 seconds" was held. BERT was no longer a massive model but a subject of practice.
As the model size rapidly grew, the problems to be solved also changed. When models had to be split and loaded onto multiple GPUs, communication between GPUs became extremely important. GPUs not only shared memory access within a node but also increasingly communicated across multiple nodes. It became common to attach one InfiniBand, which transmits 200 Gb per second, to each GPU to create a GPU network.
Amidst the complex and hectic changes, a thought occurred to me. The process of large language models learning 'language' is based on unclassified corpora. What does the large language model 'learn' in that process? Although corpora are used for the purpose of learning the structure of language, language cannot be separated from information. In fact, don't language models that have not been explicitly taught knowledge readily answer questions? Language itself is a protocol for humans to convey information to each other. The conversation process involves computing answers to data transmitted through the protocol and responding with data again. Then, is our perception of having developed an 'AI that converses well' really about developing an AI model that creates language well, or have we created something beyond that?
The following year was going to be the first year of services that are only possible with AI, not services that have been improved with AI. However, no one was yet thinking about providing the outputs of large language models as services. That was a task for someone in the future.
- GPT-J, 6 billion parameters
- LaMDA, 160 billion parameters
- PanGU-alpha, 200 billion parameters
- Gopher, 280 billion parameters
- Pathways, 530 billion parameters
- Switch-C, 1.6 trillion parameters
- Wudao 2, 1.75 trillion parameters
2022
The COVID-19 endemic was creating tremendous aftereffects. Numerous IT companies that had grown due to the special circumstances of the coronavirus and many companies that had tried to shift their offline operations online were dumbfounded by the demand for the metaverse, which suddenly disappeared like a mirage. The field of deep learning had not yet generated significant revenue sources. Numerous companies began downsizing their AI teams. Many researchers came out.
It wasn't that there was no need for technological advancements in AI. The massive scale underlying AI development overwhelmed all other advancements. In the era of big science, equipment was the most expensive one, just as it had been. It was the result of three years since innovation started coming from scale. The singularity occurring in large language models began to be regarded as a kind of emergence.[18] Small-scale studies were no longer attractive. Deep learning researchers were anxious. It wasn't the diminished interest that was the problem. A spoonful of mild despair over what research could be done with just a few GPUs was more of an issue.
Nevertheless, there were several innovations that emerged from the beginning of the year. In addition to training with well-defined data, there was a model tuning method where humans actually evaluate the answers and assign higher weights to better answers. The RLHF (Reinforcement Learning by Human Feedback) method, which applied reinforcement learning to language model training by involving humans in the middle, showed tremendous improvement in the performance of language models of the same size in InstructGPT in 2022. Numerous models began to apply RLHF. If there were scaling laws for model size, there would be no reason not to apply them. In March, µ-Parametrization,[19] which could tremendously reduce the cost of model training, was introduced. The conclusion of the study, which showed that it was possible to predict the hyperparameters of a large model in advance using a small model, relatively reduced the effort of parameter search when creating massive models. This research became the basis for GPT-4 training.
As a result of the U.S.-China trade conflict, the United States banned NVIDIA's export of AI training GPUs to China. Within a few days, China announced a large language model trained solely with its own semiconductors[20]. Soon after, NVIDIA slightly changed the name of the same GPU with its GPU networking features removed and resumed exporting. The interest in the AI service sector continued to grow due to models like DALL-E2 and Stable Diffusion, and the market for generative AI models, such as images, began to fluctuate.
In late November, OpenAI opened a chatbot service to the public. It was based on GPT-3.5, an improved version of GPT-3. An interesting point was that instead of creating a language model that excels at programming by training programming code on a human language model, it was created by training human language on a model trained with programming language data. It became clear how training with programming code influenced the logical structure training of large language models. In early December, the service named ChatGPT[21] sparked interest in large language models based on its tremendous accessibility open to the entire public.
Towards the end of the year, my acquaintances in the AI field who seemed to be on the verge of losing their jobs were bewildered by the support that suddenly improved in real-time. The movement of companies that had been downsizing their AI organizations, riding the wave of endemic layoffs, came to a halt. Leaders who had been pressuring to downsize research organizations and evaluate outcomes just a week before were now shouting for AI. The requirements for model service frameworks suddenly began to change. The goal of large language models became commercialization. Models had become so large that there was no longer any meaning in distinguishing between computational resources for training and services. AI model training and services, which originally belonged to different domains, suddenly merged into one.
Bigger problems await. Large language models consume an enormous amount of power. GPUs consume a tremendous amount of power. Although their power-to-performance ratio is tremendously better compared to CPUs, their absolute power consumption is too high. A node with 8 NVIDIA A100 GPUs[22] consumes about 7 kW, and a node with 8 H100 GPUs, the highest-performing GPUs as of 2023, consumes about 12 kW[23]. Since 2019, it has become no joke to say that you have to construct a building first to install the equipment. After experiencing power issues at a supercomputing cluster located in Brazil in 2021, we ported the entire platform to Arm-based systems. It was in anticipation of power issues becoming a problem a few years later. In the case of Microsoft, they shared their experience of building a GPU center right next to a hydroelectric power plant, taking into account the electricity costs[24].
Weekends diminished. There was too much to do. There was no time. It wasn't just us.
Now, no one had time.
- Flan-T5, 110 billion parameters
- GLM-130B, 130 billion parameters
- OPT-175B, 175 billion parameters
- BLOOM, 176 billion parameters
- PaLM, 540 billion parameters
2023
On February 8th, Microsoft and Google made announcements about their large language model-based services at a 17-hour interval. Microsoft announced plans to introduce GPT models into its search engine Bing, its Office suite, and Windows 11. Google introduced Bard based on LaMDA. Baidu released Ernie Bot. The two companies promoted the future rather than tangible services. Tools that couldn't be tried out relatively failed to attract interest.
The "era of AI price competitiveness" that I thought would come someday had arrived. However, the initial expense itself was too high. Services like ChatGPT and Bard consume service costs that are too expensive to be explained by economic logic.[25] It corresponds to a future that has been brought forward too quickly by competition. It was after everyone had firsthand experience of that future. The problem was that expectations had skyrocketed.
The suddenly approaching large language model services are creating another bottleneck. Services that perform inference based on CPUs have been affected by the significant reduction in memory bandwidth per CPU core. This is because the number of cores per CPU has rapidly increased. Services that perform inference based on GPUs lack both the capacity and speed of GPU memory where models are loaded. The bottleneck of memory, which was expected to come someday, suddenly became a direct problem due to the commercialization of large language model services. It was an anticipated bottleneck since 2021. CPU and GPU developers such as Intel, AMD, and NVIDIA had prepared for this situation in advance. From the end of 2022 to early 2023, they introduced various hardware such as Intel's Xeon Max, AMD's MI200, and NVIDIA GraceHopper.
If an AI model is extremely large, computational power becomes relatively less important. When NVIDIA A100 was first announced, it was released with a 40 GB model, but a year later, they released an 80 GB memory model. Whether it was the training process or the inference process, the size was too large to load and unload models from memory. Additionally, the process of "inferring" large language models brought about a reversal in thinking about GPUs and NPUs. Unlike the training process, which requires constantly updating weight matrices, the inference process operates by flowing input data through a fixed model structure loaded into memory and observing the results. Therefore, the proportion of computation is tremendously reduced, and the speed of memory becomes extremely important. In the second half of 2022, NVIDIA announced the H100 with 80 GB of memory capacity. However, less than half a year later, when hardly anyone had received the actual H100, they introduced the H100 NVL with 188 GB capacity.[26]
Meta introduced LLaMA[27], a language model that could be run on personal servers with some effort. Despite being attached with all sorts of license restrictions, LLaMA spread through illegally leaked copies, and Alpaca-LLaMA, fine-tuned by Stanford, showed the possibility of achieving significant performance even with (relatively) small models. Since then, various language models without license issues have been continuously released[28], fueling the potential of open language models while raising new questions about what size of parameters would be satisfactory. If the model is small, emergence are not discovered, and it cannot be used as a multi-modal model. If the model is large, it costs too much money to actually operate.
How large can large language models grow? Signs of preparation for even larger models can be seen everywhere. Microsoft's DeepSpeed framework, which is most commonly used for distributed model training, added ZeRO Infinity[29] in 2021, an extension that can train models with 1 trillion to 10 trillion parameters by utilizing NVMe SSDs. However, models with such a large number of parameters are practically impossible to serve. In practice, the approach is to set a limit on the model size that can be served and fine-tune within that range. Technologies like ZeRO were developed to train ultra-large-scale models, but they are being widely applied as they enable fine-tuning with very few resources.
- PaLM-e, 560 billion parameters
- Pythia, 12 billion parameters
- LLaMA, 6.5 billion parameters
And numerous other models with ~12 billion parameters
Various attempts are being made on numerous 12-120 billion parameter models that are 'good at talking.' LLaMA unintentionally spread foundation models that individuals could experience. Many people realized that the level of "models that are good at talking" that can satisfy ordinary people had been achieved long ago. Individuals or organizations with some computer knowledge and the ability to spend money have gained the courage to attempt fine-tuning language models in various ways.
At the same time, it is becoming known that the computational resource requirements of models beyond just being good at talking are on a different level. For about half a year, the size of newly emerging large language models has been maintained at less than 600 billion parameters. It could be that further size expansion does not yield enough results, or it could be a technical barrier created by the current hardware and costs. Or, it could be a movement to keep the size below a certain level because that size is in a range where commercialization is impossible.
Backend.AI, which started as an open-source project with 4 GPUs in 2015, handles several thousand GPUs in 2023 and will soon reach ten thousand. All environments, including ours, have changed tremendously. The more you dig into problems, the more problems keep emerging like potato stems. While living and solving numerous problems entangled with the size of large language models, I sometimes wonder where the end of this problem will reach.
On nights full of thoughts, I sometimes think that, like the Turing test that unknowingly drifted away from our attention, we may have all passed a certain point without realizing it. It seems like we have either solved a problem that needed to be solved or solved a problem that shouldn't have been solved yet. Complex emotions of excitement turning into dizziness and expectations turning into depression come and go.
- [^Editor's note] Crossroads is a science web journal launched by the Asia Pacific Center for Theoretical Physics, aimed at 'Science, Future, and Humanity' to showcase a scientific vision of the future through various genres of scientific writing, including science features, essays, columns and novels.
- [1] Facebook Post
- [2] There are various parameters besides the connections between neurons, but for convenience, it has been tremendously simplified as the model size becomes relatively small.
- [3] Physical computers in their raw form, not virtual machines. In the cloud, it is common to run virtual machines on top of a hypervisor or manage them based on containers to reduce management overhead and flexibly manage resources. Due to cost issues, it has not yet been popularized in small research institutes and universities.
- [4] Job Scheduler. Software that helps manage and execute processes. Slurm and others are commonly used.
- [5] https://www.openai.com (2015). Since 2020, OpenAI has not released implementations, and since 2023, they have only provided tech reports instead of papers. As of 2023, there are various opinions on whether OpenAI is still an AI development organization that pursues openness.
- [6] https://www.tensorflow.org, Google (2015)
- [7] "AlphaGo - The Movie" For those who couldn't feel the atmosphere at the time, refer to the documentary (2018)
- [8] J. Shin "Creating AI chatbot with Python 3 and TensorFlow" PyCon APAC 2016 (Korean) / (English) (2016) Although there are various presentation videos on the same topic as I had the opportunity to introduce it in several countries, these two are the first presentations.
- [9] J. Shin "Android Dreaming of Electric Sheep: Implementing Chatbot Emotion Model Using Python, NLTK, and TensorFlow" PyCon KR 2017 (2017)
- [10] A record of an interview at the Google Startup Campus remains on YouTube.
- [11] "Transformer (machine learning model)"
- [12] J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" Arxiv:1810.04805 (2018)
- [13] Y. Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach" Arxiv:1907.11692 (2019)
- [14] A. Radford et al., "Language Models are Unsupervised Multitask Learners", (2019)
- [15] J. Kaplan et al., "Scaling laws for neural language mod- els" Arxiv:2001.08361 (2020)
- [16] C K. Joshi, "Transformers are Graph Neural Networks", The Gradient (2020)
- [17] Microsoft, "DeepSpeed: Extreme Speed and Scale for DL Training and Inference", (2019)
- [18] J. Wei et al., "Emergent abilities of large language models" Arxiv:2206.07682 (2022)
- [19] E.Hu, G. Yang, J.Gao, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" Arxiv:203.03466 (2022)
- [20] A Zeng et al., "GLM-130B: An Open Bilingual Pre-trained Model" Arxiv:2210.02414 (2022)
- [21] OpenAI, "Introducing ChatGPT" (2023)
- [22] If a computer installed in a data center cabinet called a rack is considered a node, a single node with 8 A100 GPUs typically occupies 6 to 8 slots in a rack, and a rack can accommodate around 40 nodes.
- [23] The power of a single floor in a typical university building is around 100 kW.
- [24] "NVIDIA Teams With Microsoft to Build Massive Cloud AI Computer" (2022)
- [25] According to my personal estimate, in the case of ChatGPT, the cost based on GPT-3.5 is over $42 per month. Refer to the link for the calculation process. Facebook Post
- [26] "NVIDIA H100 NVL for High-End AI Inference Launched" (2023)
- [27] H Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" Arxiv:2302.13971 (2023)
- [28] Representative examples include Dolly 2 (2023), which combines EleutherAI's Pythia-12B model with its own data.
- [29] S. Rajbhandari et al., "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme-Scale Deep Learning", Arxiv:2104.07857 (2021) To load a model with 1 trillion parameters onto a GPU without memory offload for training, 320 NVIDIA A100 GPU (80 GB) models are required.
25 March 2024
2023 Winter Intern in Lablup
By Byeongjo KimOverview
I applied to the Open Source Contribution Academy (hereinafter referred to as the Contribution Academy) hosted by OpenUp, and worked as a fall intern at Lablup Inc. (hereinafter referred to as Lablup) from November to December for 8 weeks. Afterwards, I extended it for an additional 8 weeks from January to February, working a total of 16 weeks.
After being discharged from the military, I have written about the experiences I had while working at Lablup, my first company as a developer.
Motivation for Applying
Even before the Contribution Academy, I was interested in Lablup, and coincidentally, I had an opportunity to contribute through the Contribution Academy.
During the Contribution Academy period, I worked on resolving issues and refactoring the webui of Backend.AI.
While participating in the Contribution Academy, I felt a lot of affection, interest, and enjoyment towards Backend.AI, and I began to think that I wanted to continue contributing after the program ended.
Lablup happened to provide an opportunity to work there in conjunction with the Contribution Academy, so I applied without hesitation.
Onboarding
For the first 3 weeks of the internship, I underwent an onboarding process.
I went through implementing a RealTime Chat, setting up the Backend.AI environment, and then the Pebble Seminar in that order.
RealTime Chat
This was the first assignment to become familiar with the core side of Backend.AI's code. I implemented a real-time chat app using Python, utilizing the aiohttp, aioredis, and asyncio libraries.
Since there was no condition to store the chat contents, I used the in-memory database redis.
I made it so that when a user enters a chat room, they subscribe to that chat room, and when a user enters a message, it publishes the entered message to the subscribed users, meaning the other users in the same chat room.
RealTime Chat in action
While I was able to handle Python at a basic level from preparing for coding tests, I had no experience using libraries like aiohttp, asyncio, and aioredis, so it took me some time to understand and grasp the concepts.
However, this assignment helped me a lot in understanding the core side of Backend.AI's code, and it was good to be able to study new libraries.
Setting up the Backend.AI environment
Since I had already installed Backend.AI during the Contribution Academy, setting up the environment during the internship period wasn't too difficult.
However, I was well aware that installing Backend.AI is not easy, as I had encountered many errors and failures while trying to install it during the Contribution Academy, and the other person doing the internship with me also had a lot of difficulties during the installation process.
Since I had already experienced those failures and knew the solutions, I was able to help, and we were able to install it quickly and move on to other tasks.
While setting up the environment, I also configured a virtual machine and VPN, and set up the environment on a virtual machine as well, so that I could work even if there were problems on my local machine. After setting up the configuration on the virtual machine, I mainly used the local for development during the subsequent work, and the virtual machine as a test server. The company's VM Farm, which allows for easy management and configuration of virtual machines, made it great for setting up development and testing environments.
Pebble Seminar
After completing the RealTime Chat and setting up the Backend.AI environment, I prepared a short seminar based on understanding the structure and code of Backend.AI. I was tasked with presenting on GraphQL and Relay, which are used in the Backend.AI WebUI.
While I had experience with GraphQL, I felt that my knowledge was lacking for presenting in front of others, and Relay was a new library to me, so I was quite worried about preparing for the Pebble Seminar and read through many documents to prepare. First, I familiarized myself with the concepts by reading the official documentation for GraphQL and Relay, and then analyzed the Backend.AI code one by one to understand how they were applied and functioning in Backend.AI.
Pebble Seminar preparation materials
By analyzing the code while preparing for the Pebble Seminar, I naturally came to understand the code running in the WebUI, and this greatly helped me in finding and resolving issues during the subsequent work.
Resolving Backend.AI issues and implementing features
After completing the onboarding, I finally joined the frontend team and started resolving Backend.AI issues and implementing features. I had a coffee chat with the frontend lead to define the categories of work for this internship period:
- Creating a Table Column Setting component
- Researching E2E Testing
- Daily tasks
During the 8-week internship period from November to December, I created a total of 19 Pull Requests, of which 18 were merged, and 1 is still under review. Since I had experience finding and assigning issues during the Contribution Academy, I had less difficulty with it, and because I enjoyed resolving issues, I was able to resolve more issues than others.
Feature Addition PRs
1. Implementing Table Columns Setting
https://github.com/lablup/backend.ai-webui/pull/2071
This was one of the issues I aimed to work on during the internship period. It was the only component that I conceived and implemented from scratch during the fall internship, rather than refactoring an existing component. Before implementing this feature, I thought it was a simple task that I could finish quickly, but things turned out differently from my expectations.
First, I realized that I had been thinking too simplistically about creating new components. Even though I had designed and considered the props to be received before creating components in the past, through this issue, I felt that when creating a new component, I should invest more time and effort while considering scalability. I also realized that I should pay more attention to how other sites are designed and what features are applied.
Table Columns Setting
2. Adding service endpoint and owner columns to the Model Serving page table
https://github.com/lablup/backend.ai-webui/pull/2047
Previously, when creating a model service, users had to go to the detail page to check the endpoint, which is a frequently used feature. So there was a request to add the endpoint to the table column. Additionally, since the admin account can see services of users in the same group, there was a suggestion to have a column showing the service owner. Since the GraphQL fields for retrieving this data were already implemented, I added the fields to the query to fetch the endpoint and service owner data, and then added columns to the table to display the data. The owner column is only shown for admin accounts.
Implementation view. Screen for admin account (left) and user account (right)
3. Disabling log button for sessions in CANCELLED state
https://github.com/lablup/backend.ai-webui/pull/2045
The CANCELLED state means that the container has never been created or failed to be created. Previously, the log button was enabled even for sessions in the CANCELLED state, and if a user clicked the log button, the agent could not find the container information, resulting in a 500 error. In this PR, I made it so that the log button is disabled for sessions in the CANCELLED state, preventing users from clicking it.
Session in TERMINATED state (session 1) and CANCELLED state (session 2)
4. Testing and creating a custom hook for dark mode
https://github.com/lablup/backend.ai-webui/pull/2120
Before implementing dark mode, I found components with hardcoded colors and implemented a custom hook named useThemeMode for applying dark mode. When creating the custom hook, I tried to use the useLocalStorageState hook from ahooks, but contrary to my expectations that it would automatically manage states with the same key value, I found that they operated independently. To handle states with the same key value automatically updating when the value changes, I added a custom hook named useLocalStorageGlobalState, and then used that to create the useThemeMode custom hook for setting the dark mode.
Bug fix PR
1. Allowing signup without invitation token
https://github.com/lablup/backend.ai-webui/pull/2046
In the config.toml, when the allowSignupWithoutConfirmation option is set to true, users can sign up without an invitation token. However, when a user clicked the sign up button, an error occurred because the token value was undefined. In this PR, I modified it so that if allowSignupWithoutConfirmation is true, the token variable is not used. Additionally, previously users could modify other input values while the core was processing the data after clicking the sign up button, and the previous data remained when the dialog was closed and reopened. In this PR, I made it so that other input values cannot be entered while data is being processed, and the previous input values are cleared when the dialog is closed.
2. Displaying the correct screen for the selected sub-tab on the user management page
https://github.com/lablup/backend.ai-webui/pull/2055
On the user management page, there are sub-tabs for displaying active users and deactivated users. However, if a user navigated to another page and then returned to the user management page, even though the sub-tab was set to inactive, the screen displayed the list of active users, causing confusion. This PR resolved that issue by remembering the current sub-tab when navigating to another page, and displaying the appropriate screen for that sub-tab when returning to the user management page.
Before (left) and after (right) the fix
Extending the internship
As I resolved issues, the 8-week period flew by, and it was time to wrap up the fall internship.
Working at Lablup as my first job after being discharged from the military was an important period for me. During the internship at Lablup, I was able to experience my strengths and weaknesses, what skills I needed to further prepare, and how other developers work. The 2-month period felt very short, and since I had enjoyed working so much during that time, I wanted to continue working. So I expressed my desire to extend the internship to the lead, and we agreed to extend it for another 8 weeks until February. During the fall internship, I had thought a lot about my weaknesses, but I couldn't find my strengths. So, I started with these 3 personal goals:
- Find my strengths during this period
- Read the documentation whenever I have time
- Work even harder, leaving no regrets
Resolving issues and implementing features during the extended period
The work during the extended period did not differ much from before. Without the onboarding process and installation, I could focus more on resolving issues.
Feature Addition PRs
1. Refactoring ErrorLogList
https://github.com/lablup/backend.ai-webui/pull/2131
I refactored the ErrorLog List, which was previously implemented using Lit elements, to React. This feature was the most satisfying issue for me, as I personally use it frequently after the refactoring.
Before (left) and after (right) refactoring
During the refactoring, new Search and Error filter features were added.
Added Search feature (left) and Filter feature (right)
2. Modal drag functionality
https://github.com/lablup/backend.ai-webui/pull/2179
I used the React-draggable library to add functionality for modals to be dragged. By adding the Draggable prop to a modal, it can be applied to any modal that requires dragging.
Draggable Modal
By clicking the icon on the left side of the modal title and moving the mouse, the modal can be moved to the desired position on the screen.
Currently, it is applied to the modal for viewing user information on the user management page and the modal for changing user settings, where you can check it.
While it is not being used in many places yet, I think this PR will be useful as more components and features are added.
Bug fix PR
1. Modifying Vfolder invitation permissions
https://github.com/lablup/backend.ai-webui/pull/2143
There was an issue where the user permissions for group vfolders were not being updated. When trying to modify the permissions, the items were not displayed or selectable properly in the select. Previously, the items were being displayed using option tags, but I changed it to use mwc-list-item to display the items and modified the overflow option to resolve this issue.
Before (left) and after (right) the PR
2. ResourceGroupSelect extending outside the card
https://github.com/lablup/backend.ai-webui/pull/2166
There was an issue where the ResourceGroupSelect value would be displayed outside the card if it was too large.
Symptoms of the issue
To resolve this issue, I set the max-width CSS on the Select component so that it cannot exceed the width of the card.
Additionally, in this PR, I added a Search feature to the Select component, for which I used the useControllableValue hook from ahooks. The useControllableValue hook helps manage props from either the parent or itself. While it was a simple PR, it took me longer than expected since it was my first time using useControllableValue. I was able to resolve this issue with the help of the lead and another intern.
3. Key pair list not showing when clicking the generate & manage key pair button on the summary page
https://github.com/lablup/backend.ai-webui/pull/2194
On the summary page, there are buttons for "Generate New Key Pair" and "Manage Key Pairs." However, when clicking these buttons, instead of showing the key pair list, it simply navigated to the user management page, displaying the user list.
"Generate New Key Pair" and "Manage Key Pairs" buttons on the summary page
When clicking the "Generate New Key Pair" button (left) and when clicking "Manage Key Pairs" (right)
While this issue was not critical, I resolved it because I had experienced a lot of confusion when I first used Backend.AI and didn't fully understand the key pair feature.
After resolving this issue, I could confirm that the key pair list was displayed on the screen as intended.
After resolving the issue, when clicking the "Generate New Key Pair" button (left) and when clicking "Manage Key Pairs" (right)
Completing the Internship
Thanks to the Contribution Academy, which started from a friend's recommendation after being discharged from the military, I was able to contribute at Lablup for an extended period. Since I had no previous internship or project experience at other companies, this was a very important period for me as I was starting anew after being discharged. It was great that I could experience my strengths and weaknesses, the skills I lack, and the culture of an open-source company at Lablup. How many companies make you want to go to work every day with their horizontal structure, free atmosphere, pleasant working environment, and good equipment? Although I worked at Lablup for 4 months, I felt like I wanted to go to work every day, and if it was Lablup, I could work at a company while doing interesting and desirable work for a long time. Over the 4-month period, I also developed a fondness for Backend.AI, the service provided by Lablup, and I plan to attend the conference hosted by Lablup every year whenever possible to see its advancements and technologies.
Lablup Office
This post was also published on the author's personal blog. https://gee05053.tistory.com/32 This post is automatically translated from Korean
11 March 2024
2023 Summer Intern in Lablup
By Dongjin Park#Overview
I applied to CUop, a collaboration between universities specializing in science and technology, and worked as an intern at Lablup for 8 weeks.
I wrote about my experiences through onboarding, developing Backend.AI, and attending PyCon.
Motivation for applying
I first learned about Lablup through a PyCon presentation session I stumbled upon. I could tell that the company had a lot of technically deep and passionate members. I applied to Lablup because I was interested in Python and asynchronous programming.
onboarding
During the first two weeks, we conducted an onboarding process.
We went through the Realtime Web Chat implementation, Backend.AI development environment, and code base seminar.
Realtime Web Chat
This is an assignment to familiarize yourself with Python asyncio. You will develop a real-time chat app using aiohttp, an asynchronous web framework, and redis, an in-memory database. The task also includes setting up docker compose to build python and redis at once. For more information, see the Github Readme.
To broadcast the messages through redis, we used Redis pub/sub. pub/sub acts as a platform that delivers messages without storing them. Since we didn't have a requirement to store messages, we used Redis pub/sub. We also registered the Redis pub/sub process as a task using asyncio.create_task() to run it in an event loop. Realtime Web Chat Launch Screen
I was able to understand the basic behavior of asyncio. I was able to ask questions and solve the difficult parts. I think Lablup has a great environment for interns and junior developers to grow because they can freely ask questions through Microsoft Teams.
Build the Backend.AI development environment
I installed Backend.AI on a VM Farm and a local VM and tried to run it myself. I read the official documentation, but the process was not smooth. I encountered various errors, which I solved by sharing with my fellow interns and asking questions on Teams.
💡 A VM farm is an environment where virtual machines are managed and run. Lablup has its own VM Farm.
This was my first experience of developing with a VM Farm and VSCode connected via SSH. To develop Backend.AI, I need to run multiple processes (Manager, Agent, Storage proxy, Web server, etc.) and Docker Containers. If I use a laptop, it's a bit of a battery drain. However, with VM Farm, you can develop Backend.AI lightly because all you need is an SSH connection. In fact, I was able to develop for a long time using VM Farm when I was out of the office and couldn't charge my laptop.
Code Base Seminar
After looking at the code, focusing on the difficult parts of Backend.AI, I prepared a seminar based on my understanding. I was in charge of presenting the Manager, Agent, and GraphQL parts.
Since Backend.AI is open source, the official documentation is well written. I studied the architecture of Backend.AI by looking at the official documentation to understand the overall structure and asking the employees directly if I had any questions. Since the topic of the presentation was session/kernel creation and scheduling control of Backend.AI Manager, I studied the manager code and analyzed the logs of the manager process. A sequence diagram I drew in preparation for a seminar presentation.
Analyzing the logs, I found a bug that caused the session state to change from Preparing back to Pulling. It was rewarding to analyze the logs one by one. It was difficult to analyze the logs in order because of the asynchronous code base, but drawing a call graph and a sequence diagram was very helpful.
Develop Backend.AI
After onboarding, I started working on Backend.AI. I worked on GitHub issues and volunteered to help or found and resolved issues myself.
I created 9 pull requests from the Backend.AI repository repository and 2 pull requests from the Backend.AI-WebUI repository, and they all merged!
I chose high-priority issues that I was confident in addressing. I wanted to make as many contributions as I could in the short time frame of two months.
First PR
https://github.com/lablup/backend.ai/pull/1395
I created a PR to fix a bug I found while preparing for a seminar. It was an easy PR to fix the API parameter. However, it was a good experience to learn about branch name convention, commit convention, and news fragment, and to experience the CI (Continuous Integration) process with GitHub Actions and to do some Git-related shoveling beforehand.
💡 A news fragment is a one-sentence Markdown description of what the branch created by the PR is trying to do. You want to keep it simple and clear so that if you see this PR again in the future, you'll know what it was trying to do.
PRs for vfolder
When I heard that Teams had an open issue for an intern, I jumped at the chance to apply. I had to learn a new concept called vfolder, but I knew it would be important to get to know the product.
PR (1)
https://github.com/lablup/backend.ai/pull/1397
Only admin can create a project type vfolder. It should be possible to create a vfolder regardless of the max_vfolder_count of the keypair resource policy, but if the user type vfolder exceeds the max_vfolder_count, the project type vfolder cannot be created. At first, I was confused by the terminology, but by analyzing the code and asking questions, I was able to interpret the terminology.
PR (2)
https://github.com/lablup/backend.ai/pull/1400
Fixed new bugs discovered while addressing PR (1).
PR (3)
https://github.com/lablup/backend.ai/pull/1417
I found an issue related to the PR (1) issue. DB migration and GraphQL were new to me, but I wanted to try them out, so I supported them. I used a DB migration tool called Alembic, studied GraphQL schema concepts, and modified my query and mutation code to support backward compatibility. I tried to use cURL to test the modified code, but GraphQL has a much longer request format than the REST API, which was cumbersome. I wrote the test code by asking questions to interns and employees who are familiar with GraphQL. I wrote python code to test the modified query and mutation in the form of CLI to make testing easier.
WSProxy related PRs
Supported an issue in Teams. There was a bug in the WebUI where it was not possible to delete a session if the wsproxy address of the resource group was invalid. I also wanted to get some experience with WebUI development.
PR (1)
https://github.com/lablup/backend.ai/pull/1423
I read through the WebUI code to troubleshoot the issue, but I couldn't quite grasp the concept of wsproxy. I realized that wsproxy has v1 and v2, but it was not easy to understand the difference between the two, so I asked the employees. The main difference between v1 and v2 is the path of the traffic: v1 goes through the manager to communicate with the container, while v2 can communicate directly with the container without going through the manager, which is faster. Once I understood what wsproxy does and the difference between v1 and v2, it was easier to understand how the code worked, and I realized that a lot of people didn't know the difference. I realized that questions that seemed easy to ask might not have been asked in the organization.
PR (2)
https://github.com/lablup/backend.ai-webui/pull/1819
I also modified the webui code to fix the issue. To fix the JavaScript code, I learned about callback functions, promise objects, and async/await. I handled errors so that they didn't affect other logic, and defined duplicate code as functions to eliminate duplication of code.
PR (3)
https://github.com/lablup/backend.ai-webui/pull/1833
However, since the WebUI needs to be backwards compatible with Backend.AI 22.09, we had a review from our CEO that it should also handle HTTP Status 404, so we made it handle both 404 and 500.
PR (4)
https://github.com/lablup/backend.ai/pull/1466
However, after the code was merged, a bug occurred: when setting up v1 wsproxy, the return value for wsproxy-version disappeared. This happened because I was modifying the core code and didn't handle all the branches. I fixed the code in a hurry, but it was a simple mistake, and I realized that I should write test code to prevent such mistakes.
PRs related to Manager
https://github.com/lablup/backend.ai/pull/1444
In preparation for the seminar, I came across an issue with manager that I had been studying. With my internship coming to an end, I thought I could contribute by resolving the issue on the code I knew best.
This PR has been heavily modified by code reviews. Initially, we designed scheduler health check and scheduler trigger as the same API. After receiving code reviews, I split the two functions into different APIs to separate responsibilities. We originally stored health information only for the schedule function, but we also stored health information for the prepare function and the scale_services function because we needed to create the trigger API after storing health information for the three functions that run periodically according to the scheduler's global timer to get a complete picture of the scheduler's health. We also changed the design so that we can store the scheduler's health based on the manager ID because there may be multiple manager processes.
The scheduler's storage was also reviewed. Initially, we looked at the code in the existing manager state API to store manager state in etcd and did the same for scheduler state. etcd is good for storing configuration information that needs to be kept consistent, but it is slow to write. redis is a volatile database, but it performs well even with many reads/writes. Since the scheduler state is read/written periodically and does not need to be consistent, we switched to storing it in redis.
Agent-related PRs
https://github.com/lablup/backend.ai/pull/1472
Now that I had a good understanding of the Manager part of Backend.AI, I wanted to understand another important component: the Agent. I came across an issue about the Agent, so I took a look.
While Backend.AI was running, there was a bug where the internal state of the Agent did not match the state of the actual working container. As a result, when creating a session, an InsufficientResource Error was thrown during the resource allocation process, even though there were actually enough resources. When the error occurred, we needed to improve the logging to understand what went wrong during the resource allocation process.
It took a while to figure out the resource allocation process. The concurrency issues were difficult, and it took a lot of Q&A with the CTO to get a general idea of the flow and what to log.
A few weeks after my internship ended, the CTO merged it with more than 10 commits, adding refactorings and test code. What impressed me was that he wrote test code to reproduce the error. I had to go through a complex process (see PR) to reproduce the error myself, which took a lot of development time. I could see the difference in productivity here. Of course, I thought about writing test code, but I thought the implementation would be too complicated, and I thought that writing test code would be the end of my internship. In the future, I shouldn't be so intimidated by writing test code and just try it out and learn as I go. Also, it seems that the refactorings were done with a focus on code readability. The functions in the part I modified were too long and not very readable, but after refactoring, the functions were shorter and the logging was cleaner. I realized that I shouldn't stop at implementation, but try to make good code.
Attending PyCon
On August 12 and 13, Lablup will have a booth at PyCon. Companies that sponsor PyCon are given the opportunity to have a booth. I was an intern, but I wanted to participate in the booth activities and listen to some of the talks. The company had some PyCon tickets left over, so I was able to participate.
At the Lablup booth, we had an event where we challenged Llama2 to write a 10-line pyramid at a prompt. The challenge wasn't that easy, and it was important to explain it in a way that Llama2 could understand. Two lucky people who submitted the correct answer were drawn to win a Nintendo Switch and a Logitech mouse. My role at the booth was to help direct PyCon attendees to the event, and since there were a lot of employees at PyCon, I was free to attend any talks I wanted to hear. Since Rableup is an open source company, we encourage people to contribute to open source and participate in conferences. In fact, four of our presenters attended PyCon, so we value participating in conferences.
Lablup Inc. booth
During the RustPython session, a tool called ruff was introduced as a replacement for the python lint tools flake8 and isort. Since ruff is composed of Rust, it is 100x faster than flake8. At Backend.AI, we were using flake8 and isort for lint, but after our CTO reviewed ruff, I watched him adopt ruff for the Backend.AI project right on the stairs of COEX. I realized that he was really good at the process involved in coding, applying new tools to the project in a short time and even revising the official documentation. I realized that I want to be a proficient developer someday. After PyCon, I read the updated official documentation, applied ruff to the Backend.AI development environment, and experienced linting 100x faster. If I hadn't participated in PyCon, I wouldn't have gotten up to speed on these great tools so quickly. I hope to continue participating in developer conferences in the future. Group photo with members of Lablup Inc.
End of internship
During my internship, I tried to get as much experience as possible, and I wanted to contribute a lot. In the end, I was able to experience a lot because I tried to contribute a lot. I was quick to volunteer for issues that came up in Teams, so I was able to understand the core components of Backend.AI: vfolder, wsproxy, web-ui, manager, and agent. I also learned new concepts like DB Migration, GraphQL, and etcd. Although it was a bit physically demanding to attend the conference from morning to evening on the weekend, it was fun to listen to more than 10 presentation sessions, get inspired, and meet various people through booth activities.
During my internship, I think I actively asked questions about things I didn't understand, which helped me to solve issues quickly. I think the reason why I was able to ask a lot of questions was because there was a culture of horizontal rabble-rousing, and there were many people who were kind enough to answer my questions, so I was able to actively ask questions. I would like to take this opportunity to thank the members for their support.
I was able to experience a variety of things, including asynchronous programming experience, GitHub collaboration, presenting English seminars, and attending conferences. I feel like I've grown a lot as a developer through this program. I recommend the Lablup internship to anyone who is thirsty for growth.
This post is automatically translated from Korean
22 November 2023
2023 Lablup DevOps Summer Retrospect
By Gyubong LeeIn this post, I'll share my experience as a developer at Lablup over the past 9 months.
Table of Contents
- Motivation to apply
- From Intern to DevOps!
- rraft-py Development
- Open Source Contribution Academy Regional Sprint Backend.AI Mentoring
- Attending various conferences
- 2023 Open Source Contribution Academy
- Presenting at PyCon
- Conclusion
Motivation to apply
Even before I joined Lablup, I knew that I wanted to have a career where I could continue to help others through the programs I develop, whether as a hobby or during work hours.
Open source was particularly appealing to me because it meant that not only could my code help others, but that they could freely modify and utilize it if they wanted to.
One of the things I realized after working on my own project, Arvis, for my graduation project, is that it's not really easy to keep a project going simply because it's something I love to do, as it keeps growing in size. I tried to plan and execute the project carefully from the beginning, but in the end, I realized that I underestimated the time and effort required to maintain the project.
In that regard, Lablup, which actively encourages and supports open source-related activities and even develops core parts of its source code as open source, was the company of my dreams.
From Intern to DevOps!
The last three weeks of My internship at OSSCA Lablupwere spent studying and researching distributed systems, specifically implementing the Raft algorithm. Although my job title changed from intern to DevOps, I still felt like I was expanding on my internship learning, including Raft, to solve issues I worked on during my internship.
I've been involved in a variety of other activities that I'll mention below, but my main work at the company to date has been writing a Python binding of the Raft algorithm implementation to replace the existing distributed lock structure, including writing rraft-py, and thinking about how to integrate it with Backend.AI.
rraft-py Development
rraft-py is a Python binding implementation of
tikv/raft-rs
, and you can read more about it in the GitHub Readme / Wiki. I'll also be presenting some technical details on the topic in my PyCon 2023 KR talk next month, if you're interested.For now, I'm going to focus on my experience as a Lablup developer, leaving aside the technical details of what I learned while developing rraft-py.
I had to think a lot about rraft-py because it was not just about fixing an issue in Backend.AI, but also about creating a separate project and integrating that project with Backend.AI.
Overall, there were several mile stones in the project, and I feel like I was able to move forward with the project with a little more stability after each mile stone. There was definitely a high sense of accomplishment each time, but there were also many times when I was frustrated because I realized later that the code I had initially written didn't work the way I intended. But Lablup allowed me the time to do these shoveling sessions, and I think I've gotten to where I am today because of the things I've learned that I would have otherwise dismissed as "shoveling".
Results of running the rraft-py example code
There's still a long way to go to integrate rraft-py into Backend.AI, but the bottom line is that it's great to have the experience of thinking for yourself and making your own decisions as you continue to evolve your project, and for developers who like this kind of experience, Lablup could be one of the best options out there.
Open Source Contribution Academy Regional Sprints Backend.AI Mentoring
While rraft-py development was my main focus, as it required more time than I had anticipated, I also had the opportunity to work on a variety of other projects.
One of the most memorable experiences was participating in the 1st Daegu Open Source Contribution Academy Regional Sprint as a Backend.AI mentor.
In fact, I participated as a mentor without a deep understanding of Backend.AI, and to make matters worse, the sprint period was only 2 days, so I was worried about many things.
In order to make sure that the mentees learn at least one thing and go home as satisfied as possible, I had to think about how to explain Backend.AI to those who don't know it at all, and how to build a development environment on different platforms (personally, I usually only develop on macOS + docker desktop environment, but some of the mentees were working on Windows environment, so I had to shovel while building the development environment). I had to think about a lot of things and prepare.
In conclusion, I was able to learn a lot more than I thought because I was unfamiliar with these processes, and the mentees followed along better than I thought, so I think it was a meaningful time for everyone to create more than one PR.
The 1st Daegu Open Source Contribution Academy Regional Sprint
Participation in various conferences
We had the opportunity to participate in various conferences and exhibitions such as AI Expo, AWS Summit, and Next Rise. It was great to learn how to explain Backend.AI to different types of people, and it was also interesting to see the different technologies of other companies.
AI EXPO KOREA 2023
2023 Open Source Contribution Academy
As a company with an open source culture, Lablup actively participates in the Open Source Contribution Academy every year. This year, I also participated in the Open Source Contribution Academy, which encourages participation in various other projects besides the Backend.AI team, so I've been working on GlueSQL as a mentee.
I think this culture of freedom is very attractive to developers with a strong desire to grow.
(In addition to myself, there are two other people involved in other projects in the 2023 Contribution Academy).
PyCon announcement
Based on my experience in developing rraft-py at my company, I was also given the opportunity to present at 2023 PyCon KR.
Personally, I'm a bit nervous because it's my first time presenting in public, but I'm doing my best to prepare. For those who are interested in presenting, I am looking forward to sharing not only the presentation materials but also the source code and work history through GitHub.
Conclusion
Lablup is a company with a strong open source culture, encouraging participation in various open source and community-related events such as the Open Source Contribution Academy (https://www.oss.kr/contribution_academy) and PyCon, and giving developers the opportunity to take initiative in their work.
I hope to continue to participate, learn, grow, and contribute to open source activities of various nature at Lablup.
This post is automatically translated from Korean
18 July 2023
Learning from history
By Sanghyeon SeoLearning from history
ChatGPT and other language giants that have gained traction in the last half-decade didn't just fall out of the sky out of nowhere. We've seen it many times throughout history, where a cumulative development of technology reaches an inflection point and rapidly transforms society. Sometimes, the path to that inflection point is strikingly similar for technologies that evolved in very different times and contexts.
Mankind has long wanted to fly - it's no coincidence that Civilization 6 has "Dream of Flying" as its theme song.
How did the Wright brothers realize their dream? In 1899, Wilbur Wright writes to the Smithsonian Institution and asks them to send him what is known about airplanes. After three months of reviewing the material he receives, Wilbur concludes that not much is known about flight, except that it's a problem. Plausible theories have turned out to be false and improbable theories have turned out to be true, he writes, so he can't believe anything he hasn't seen for himself.
What Wilbur wanted to know from his literature review was this: what do we need to know about flying? What of it is already known? What are the remaining problems that need to be solved? Surprisingly, Wilbur was able to answer all of these questions from his literature review, something his competitors were unable to do.
In a 1901 lecture, Wilbur summarized his conclusion: "There are three problems with flying. You need wings to make the airplane float, you need an engine to make the airplane go, and you need a way to control the airplane."[^1]
[^1]: Some Aeronautical Experiments (Wilbur Wright, 1901).
Wilbur saw that the wing problem and the engine problem had been solved to a certain extent, so he needed to solve the piloting problem. To solve the piloting problem, he needed an airplane, and to build an airplane, he needed to solve the piloting problem. Wilbur concluded that he could solve the problem of controlling an airplane by taking the problem of controlling a glider.
To test the glider, he needed high hills and strong winds, and they had to be sand dunes for the safety of the experimenters. In 1900, Wilbur requested data from the Weather Bureau to review the windiest places in the United States. The staff at the Kitty Hawk Weather Station wrote back that the beach next to the station was unobstructed and would be suitable for the experiment."[^2]
[^2]: Letter from J. J. Dosher, Weather Bureau, to Wilbur Wright, August 16, 1900.
The 1901 experiment was a disappointment: the wing didn't have enough lift. The Wright brothers had used data from the Auto Lilienthal to calculate the area of the wing, and they had become suspicious of its accuracy.
After analyzing their experimental data, they concluded that John Smeaton's value of the proportionality constant, which had been used without question for over 100 years, including by Otto Lilienthal, was incorrect.
In order to systematically analyze the lift of wings without the time-consuming and laborious glider experiments, the Wright brothers built a wind tunnel. Their analysis showed that the data from the Auto Lilienthal was correct, except for the incorrect value of the proportionality constant, but the wing used by the Auto Lilienthal was inefficient.
The 1902 glider that resulted from this analysis had a larger area to offset the revised value of Smithen's constant and a flatter shape to increase efficiency. (They changed the flatness of the wing from 1/12 to 1/24.) The new glider flew very well.
That's how the Wright brothers were able to make their historic first flight at Kitty Hawk in 1903.
Humans have long wanted to talk to machines - countless science fiction novels and movies bear witness.
To create AI, OpenAI had to solve three problems. Computing infrastructure, models, and data. You can think of the computing infrastructure as the engine, the models as the wings, and the data as the controls.
To manage the compute infrastructure, OpenAI used Kubernetes, but it wasn't something they could just grab and go. When they hit 2,500 nodes, they ran into problems with the Linux kernel ARP cache overflowing,[^3] and when they hit 7,500 nodes, they had to fix a bug in Kubernetes to enable anti-affinity.[^4]
[^3]: Scaling Kubernetes to 2,500 nodes, January 18, 2018.
[^4]: Scaling Kubernetes to 7,500 nodes, January 25, 2021.
(Advertisement: Lablup's Backend.AI has already been used in practice on large clusters for AI training and inference, solving a number of scaling problems and implementing its own scheduling algorithms to support features like affinity and anti-affinity.)
The scaling law for AI is the equivalent of the lift equation for an airplane. Just as the lift equation describes the lift of a wing as the area of the wing, the lift coefficient, and the proportionality constant, the scaling law describes the loss of an AI model as the size of the model, the size of the data, and the power law exponent.
Just as the Wright brothers discovered that John Smeaton's constant of proportionality was 0.003, not 0.005, the power law exponent of scaling law was initially thought to be 0.73,[^5] but was actually found to be 0.50.[^6] The incorrect value was calculated because the learning rate was not adjusted for the size of the data.
[^5]: Scaling Laws for Neural Language Models, January 23, 2020.
[^6]: Training Compute-Optimal Large Language Models, March 29, 2022.
OpenAI knew that control of the model was an important issue, so before we trained our first GPT, we were already working on reinforcement learning from human preferences,[^7] which we applied to the control of robots, reminiscent of the Wright brothers referencing bird flight for control of airplanes and first applying it to gliders instead of airplanes.
[^7]: Deep reinforcement learning from human preferences, June 12, 2017.
To apply this research to language models, human preference data was collected, resulting in InstructGPT.[^8] It's hard to know exactly because OpenAI hasn't published its research since GPT-4, but research is showing that it can learn not only from explicit feedback, but also from implicit feedback, such as users retrying to create or continuing and stopping conversations.[^9] If so, OpenAI could create a positive feedback loop where it improves its models to gather users, who then gather users to improve their models.
[^8]: Training language models to follow instructions with human feedback, March 4, 2022.
[^9]: Rewarding Chatbots for Real-World Engagement with Millions of Users, March 10, 2023.
In this article, we've compared how humans went from flying to talking to machines, and we've seen some very similar patterns in the evolution of technology throughout history.
What other examples will we see in the future as AI technology advances and we work to make it more accessible to more people? Can Rableup and Backend.AI help accelerate that process, allowing people to experiment and realize what we've learned from history more quickly? We're in the middle of this inflection point.
This post is automatically translated from Korean
12 July 2023
Developer Advocate in Lablup
By Jongmin KimThe content of this post is based on my own personal experience, which is not necessarily true in all cases, and opinions may vary.
What is DevRel?
Image by Henning Westerkamp from Pixabay
As the name implies, DevRel's primary purpose is to create relationships with developers. This includes not only the developers who write the code, but also the planners, designers, and all the production people involved in creating services and products. The specific role of a DevRel varies a lot depending on the nature of the organisation and the industry, but often the objectives converge to promote (product, technology) or recruit.
The primary roles of a DevRel are to
- Product promotion and technical evangelism
- Build, run, and support the community
- Gathering user input and feeding back into product development
- Provide a variety of resources to help users better use the product
Developer Advocate 은 DevRel 안에서도 기술 전파 및 리소스 제공 에 조금 더 초점을 맞추고 있는 역할이라고 볼 수 있습니다.
Developer Advocate can be seen as a role within DevRel that is a little more focused on skills evangelism and resource provision.
Why DevRel?
In order to make their products and services known to the public, companies conduct marketing and public relations activities through various channels and events. These activities are called PR - Public Relations, which literally means creating a relationship with the public. In companies that develop or service IT products, the main consumers are engineers (developers). And products aimed at engineers often need to provide more detailed and technical information, which requires different resources and strategies than general PR activities. DR - Developer Relations, as the name suggests.
The community is arguably the most important driver of an IT ecosystem, especially for open source products like Backend.AI. Moreover, when new technologies of a similar kind come along, the size of the community is often a factor in choosing and adopting them, in addition to performance or maturity. One of the important roles of DevRel is to help and engage with these communities.
A great analogy I like to use to explain the importance of community is the relationship between an artist and their fan club. An artist's primary role is to create music and perform, but it's through their fanbase that viral and secondary creations are born, and it's an important driver of their career and the entertainment business. Nowadays, entertainment companies are well aware of the importance of fandom and support various activities.
Image created by Midjourney
Similarly, the DevRel role in an IT organisation is to support and communicate with the various communities and ensure that the product and organisation don't lose focus.
Developer Advocate in Lablup
Lablup has been active in the Korean developer community for a long time by Wooyoung Yoo, and our CEO, Jeongkyu Shin, is also active in various developer communities, so we have already done a lot of DevRel activities for an organisation of our size. Backend.AI is a great piece of software, and it's an important tool, especially in this era of big AI. However, as a tool for AI learning and inference, it requires a lot of preparation and resources to set up, which can be daunting, especially if you're new to AI (which I am 😅).
Since Backend.AI does not have a large user base yet, we have been conducting various community activities such as the [Open Source Contribution Academy] (https://www.oss.kr/contribution_academy) in conjunction with the theme of open source culture. In order for our own community to grow, information needs to be produced and disseminated among users in addition to the one-way transmission of technology from creators to users. In order for such activities to take place, it is my main task to continue to create and improve various resources and exchange opportunities that anyone can easily use. I will continue to strive to communicate with you in various opportunities and ways.
This post is automatically translated from Korean
24 March 2023
Recap of my Lablup Internship, Summer 2022
By Sion KangIntroduction
My first encounter with 'Lablup' was in the summer of 2019. At that time, I attended a GDG Seoul 'Everyone's Toy Story' event because someone I knew was presenting there. I had the opportunity to sit in on a presentation by Lablup about their GPU virtualization tools for machine learning, and that piqued my interest. At that time, my fascination with machine learning was at its peak, and the technical depth of their presentation was impressive—it was my introduction to a company pioneering in this field.
Subsequently, I reconnected with the company at the 42 Seoul open source hackathon. This was an event focused on developing a product using a designated open source within a limited timeframe, where I became a part of the Backend.AI team. Despite it being three years since the initial presentation at GDG Seoul, the impression it left was so profound that the company's name was instantly familiar to me. Throughout the contest, the guidance provided by Jungkyu (CEO of Lablup) was instrumental, contributing significantly to our achievement being a second place at a hackathon.
In May 2022, I was working on a school-based internship at a company called SATREC INITIATIVE. As my internship was drawing to a close, I came across an announcement for Lablup's summer internship on Facebook. Recalling the valuable mentorship from Mr. Shin during a competition, and with a keen interest in the developer community and open source, I chose Lablup as my next destination.
At that period, I was engaged in a project named 42 World, which was a significant phase of learning and growth for me. In my interview with Lablup, I detailed my experiences with the 42 World project, particularly the challenges I faced while implementing a MonoRepo. Interestingly, Lablup had encountered similar issues with MonoRepo in Backend.AI, allowing us to exchange a sense of empathy as developers during our conversation.
After being accepted into the Lablup internship, I started my internship with four other interns. I joined the company about a week later than the others to give myself time to finish my existing internship and relocate. My first week was dedicated to orientation, familiarizing myself with Backend.AI, and acclimating to the company's culture. The onboarding documentation was comprehensive, contributing to a welcoming environment for newcomers to quickly adapt. A significant portion of my orientation involved installing and configuring Backend.AI. Since I began a week after the other interns, they were able to assist me, which made the orientation process relatively smooth.
Getting to my work
During the second week, I began tackling the actual tasks. I opted to collaborate with the DevOps, Frontend, and Research teams. Each chapter leader provided me with a 'good-first-issue' to address. My choice was the DevOps team. My initial task involved refactoring the 'run' command, which initiates a session and runs the specified code. The goal was to integrate 'start', to launch the session, and 'exec', to execute the code, thereby minimizing redundant code.
I faced challenges with the first issue assigned to me as it required a thorough understanding of Backend.AI's repository structure, irrespective of the implementation difficulty. I realized that while it's important to know why an issue exists, it's also important to know exactly what Backend.AI is trying to accomplish and understand how the issue needs to be solved to achieve that goal so that I can solve it correctly.
After addressing the initial issue, I took on the task of testing a feature in development known as vfolder clone. My role was solely focused on DevOps, providing me with my inaugural experience with Backend.ai-webui, a Frontend chapter project. My involvement wasn't limited to just testing the vfolder clone; I also executed it personally, identifying areas for improvement and bugs, which I then documented as issues. This made me feel somewhat guilty, as it seemed I was generating work for other teams. However, the Frontend chapter's encouragement to contribute assuaged my concerns. This experience reaffirmed my understanding that open-source contributions extend beyond code; there are numerous ways to contribute.
Improving CI/CD
As I've always been interested in CI/CD, I was intrigued by the actions utilized in Backend.AI. At the time, Backend.AI could bypass CI with the
skip:ci
tag, but I noticed that neitherskip:ci
norskip:changelog
tags would work if the PR was labeled after creation, necessitating an extra commit. Since external contributors lack label permissions and Backend.AI is open source, resolving this seemed crucial. After exploring GitHub Actions, I discovered a trigger for labeling, which allowed me to address the issue. My proactive approach in finding and fixing problems independently, rather than through assigned tasks, was highly appreciated by the company. This experience deepened my interest in actions, leading to further enhancements. Noticing frequent omissions in assignments, I proposed the auto-auth-assign action for automation, which I had previously utilized. Subsequently, I considered automating the labeling process. Although new to the labeler action, I conducted multiple tests in a test repository before implementing it in Backend.AI, successfully automating the labeling of various systems integrated into a monolith. During this process, I identified the benefit of linking labels from issues to PRs. Unable to find an existing action, I took the initiative to create the auto-label-in-issue action, learning the GitHub API and action scripting in the process.Finishing my Internship
This internship has been a significant learning experience for me. Although it's my second internship, it's my first within an IT company, as my previous employer wasn't in this sector. What draws me to Lablup is its commitment to open-source products and the active contributions to the community. The company's horizontal structure allowed me to freely share my thoughts, making me wonder if a company could really operate so collaboratively. One of the greatest aspects of Lablup is the autonomy it offers, allowing you to pursue what you're passionate about rather than what you're obliged to do.
Having developed a strong understanding of the project, I was disappointed when the internship neared its conclusion. Fortunately, Lablup offered me the opportunity to extend my internship, which I accepted, and continued to work on action issues. As there are few developers specializing in this area, I recently presented on the subject at GDG Daejeon. This led to the amusing nickname "Action Mask" from my colleagues.
I have shared my internship experience with the Open Source Contribution Academy, where I performed well. I believe that my experience at the company was the foundation for that.
19 December 2022
A memoir of two years as a software engineer at Lablup
By Jihyun KangHello Hello, 🙇🏻♀️
I am Jihyun Kang, senior web frontend developer at Lablup.
Lablup is an 8-year-old startup with the motto of “Make AI Accessible" for anyone, anytime, anywhere!”. I've been working here for about 3 years now: 3 months as an intern → 2 years as a full-time employee after being offered a job. I'd like to share my thoughts on my growth and experiences in Lablup.
Table of Contents
- First Impressions
- Tech stacks i've learned & soft skills
- Other things that I had achieved
- What I want to achieve
- Wrap up
- Extra: We doesn't just do work.
First Impressions and period of onboarding
After switching my majors from Sculpture 🗿⛏ → Computer Science 💻, I wandered through a deep tunnel, wasted some time. (in Korean word, we say this as 'endless shoveling' which emphasizes doing meaningless work over and over again.) After that period, I met Lablup. I walked through the door, and there was a new world infront of me.
There are three things I still remember from my first day's orientation.
- Respect is mandatory regardless of title; address people as name, avoid job grade (except when introducing them to outsiders).
- We can just declare our vacation, no need to ask for it.
- 10AM to 7PM work hours, available to work from home, flexible work schedules.
Most importantly, I was free from the usual labels that accompany me at many companies, such as GPA, major grade, age, and school. Compared to my experience of working as a contractor in the public sector, here, I found the freedom and the emphasis on the work itself quite intriguing. Additionally, I was eager to learn about the distribution and execution of the work, and thankfully, I gained clarity on this within a month.
There's a work-life balance game question that's been making the rounds for a while.
For me, Lablup was the perfect former. Every day was always thrilling with new technical terms, what the heck was being said in the daily all-hands meetings, why I was getting weird errors when I tried to install a project, only on my local environment... 🤯 And the pressure to create issues, commits, and share everything in “English” made me cry (?) my first month.
But I couldn't let it get me down, because like any intern, I didn't want to be a burden, even if I wasn't going to be a huge help to the company. Faced with a week of shoveling, I realized I needed to write down anything I could so I wouldn't be asked the same questions over and over again, and I started jotting down terms and context that I learned, even if it wasn't TIL(Today I Learned).
Below are some captures from ✍️ Survival of the Fittest that I've been updating throughout my internship and beyond.
I also read all the threads that came up that I thought might be helpful, even if I didn't understand them at the time, and organized the ones that I thought I could leave behind. Then I finally started writing PRs for my first issues, asking for code reviews, getting some issue-centric comments, and slowly getting used to the development culture at Lablup. As I grew more confident, I had the audacity to volunteer for issues that responded to customer requests.
Ultimately, Lablup functioned as a hybrid of top-down and bottom-up structures, rather than being strictly hierarchical. This unity allowed members to operate as a single entity. Despite the considerable freedom granted by Korean societal norms through self-directed commuting and flexible work arrangements, our achievements surpassed what a smaller team might have accomplished, as each member assumed responsibility and actively pursued our collective objectives.
The tech stack and soft skills I learned
Lablup is where I started as an intern, and it's also my first job as a full-time software engineer. It's safe to say that I've learned almost everything I need to know in my professional life at Lablup. I've been working on one of Lablup's flagship products, Backend.AI, using the following technology stack, which has helped me gain a better understanding of the product.
Web Component
As a web standard, modern web browsers now support all web components natively. When the browser engine renders web components, it creates an HTML document called the ShadowDOM, which prevents each component's styles from unintentionally affecting each other and allows them to be applied independently. It also requires shadowRoot to be open when accessed from within Javascript code and can only be accessed via a key called “shadowRoot”.
It's similar to React's VirtualDOM, but it serves a different purpose. While Virtual DOM is for optimization, ShadowDOM is for applying styles and logic independently on a per-component basis. (To read more: ShadowDOM vs VirtualDOM)
Lit (Lit-Element)
Lit is a library for web components that makes it quick and easy to create web components. It abstracts the process of declaring, attaching, and setting up a shadowDOM to a document. It makes it available in an OOP structure that is familiar to software engineers (support for class), for example, and provides functions that can be called at different points in the lifecycle, such as during render, immediately after render, or by explicitly requesting a render, to help users implement components more intuitively. (To read more: Lit Documentation (v2))
Modern Javascript & Typescript
Javascript since ES6 supports functions like map, filter, and reduce that are specific to the Array type, in addition to iterations like for-each, for-of, and while. API calls can be processed asynchronously with a native function (fetch) rather than an external library (jQuery). When fetching various types of data, you can specify the type to predict what type it will be and call or respond to functions accordingly. The compiler will explicitly call an error if the wrong type is accessed.
Of course, I was able to contribute to the backend (Backend.AI Core) as well as the frontend, dipping my toes into the Python asynchronous library Python Asyncio. Below is a presentation on an issue that helped me a lot in understanding the structure of Backend.AI while contributing to the backend.
https://www.youtube.com/watch?v=itCEkuO2DtE
Lablup stands out as a unique development organization in the open-source community, boasting a strong team of seasoned open-source experts. This contributes to a forward-thinking and open-minded development culture. Work often progresses from the ground up, as highlighted. In such an environment, individuals are encouraged to offer ideas irrespective of their position, or even organize internal seminars to discuss and validate their proposals before formal submission.
Below are three of the soft skills I've learned on the job that have helped me a lot.
- Ask questions before the end of the day. But also mention what I know by then.
- Dive in, clear some snow, and document the experience, even if it doesn't seem directly pertinent at first.
- Share what you learn as often as possible
What else I accomplished
I didn't just do development; the various experiences described below helped me grow outside of development. Some might say, “You're too busy to just do development, aren't you being too generalized, what is your specialty?” But I believe that learning how a company works outside of development and how to work with other departments is a fundamental skill for a developer to have.
- Contributed to Backend.AI Good Software (GS) Certification
- Wrote Backend.AI documentation contribution guide
- Published and applied responsive layouts to our company main page and featured product pages
- Created a Backend.AI tutorial video
- Open Source Contribution Academy mentors in 2021
- Planned and executed the company's annual meeting in July 2021.
- (junior) developer → promoted to senior developer in May 2022! ✨
- Infcon 2022 Talk Concert: Joined Junior Developers' Bamboo Forest (event) as a Panel
What I want to accomplish more in the future
Below is a list of things I'd like to see Lablup do in the future, if not right away.
- Building a dedicated design system
- Combining web components with React for MSA services in Backend.AI
- Apply Google Analytics 4 (GA4) to your company site and major product sites
- Create a category for each topic on our tech blog
- Apply BDD or test environments to our primary repository
Remarks
At first, I didn't know what to write about because I had been on a crazy run, but as I slowly wrote it down, I realized that a lot of the things I've accomplished have been because I started at Lablup. In particular, I realized that none of the above would have been possible without my colleagues, who were patient and understanding enough to let me pick up the pace. I would like to take this opportunity to thank all the members of Lablup once again.
I also promise myself to be a developer who can continue to contribute to Lablup.
Extra: We doesn't just do work.
Joining Lablup amidst the COVID pandemic meant missing out on numerous workshops and trips like GTC and Google I/O. However, among this short experience allowed me to acquire a diverse range of external experiences beyond just work. The recent workcation I took stands out in my memory. To dispel any notions of Lablup being a "work-only startup," I am recounting several of my experiences in this post😎.
Things we've done
-
Gangneung Workshop (2020.11)
-
Yangyang Workcation (2022.07) (2020.11)
-
Culture Day (2022.08)
On-going
- Jeju Island Workcation (2022.10)
Future-plan
- Google I/O 2023?
29 September 2022