Tag : Open Source
Developing KIMCHI: An AI Image Generation Model Reflecting Korean Culture
By Yonggeun KwonIntroduction
Lablup's research team is working on various examples utilizing Backend.AI. We develop AI Agents that automatically generate CLI commands inside Backend.AI to match the user's instructions, and we also create simple demos for various exhibitions. For example, 'VisuTale', a service that creates images and fairy tales using images from users', is the work of our research team. VisuTale has been used at various exhibitions as an example of AI services that can be developed using Backend.AI, and it has been a hit with the audience.
During a recent internal study, we realized that AI-powered image generation models often fail to capture specific cultural contexts or languages. This is because global AI models are often trained on English, or trained on generic, general-purpose domain data. When a region or culture is entered into the prompts, its unique characteristics are often not represented. When we examined different models, we found that when using Korean text prompts to generate images, the Korean identity was not fully represented in the images.
For example, popular models such as Stable Diffusion v1.5 and DALL-E 3 demonstrate issues with not interpreting Korean text prompts correctly, or not fully reflecting Korean identity in the generated images.
-
Stable Diffusion v1.5 generated an image of a building instead of a ramyeon noodle with the prompt in Korean “Draw me a Korean Ramyeon”, indicating that the model was not handling the Korean input correctly.
-
DALL-E 3 generated an image of a noodle dish with the same prompt, it produced an image that looked more like Japanese ramen than Korean ramyeon and did not capture the Korean vibe.
Typing the prompt in English isn't the answer either - sometimes there are no one-to-one words that translate from Korean to English, or the translator doesn't understand the context and doesn't come up with the right translation. For example, if you type the food “Gochujang-jinmichaeboekum” into the translator, it will come up with “stir-fried vegetables with red pepper paste”, which isn't an accurate translation of the food.
As a solution to this problem, KIMCHI (Korean dIffusion Model adapting Culture and HIstory), an image generation model that fully reflects Korean culture and identity, was born.
Making 'KIMCHI'
Attached Image Source: Wikipedia & Lotte Group
Lablup participates in various overseas exhibitions and conferences such as CES, MWC, GTC, and SC, so there are many opportunities to showcase various demos. In this situation, we believe that 'KIMCHI' can be used as a good example of a culture-specific AI model. If Lablup's 'KIMCHI' is shown to generate customized images that reflect cultural characteristics in real time, it will attract attention overseas and effectively communicate the brand's differentiation. Furthermore, KIMCHI can also be used as marketing materials at various exhibitions and provide differentiated content.
'KIMCHI' was developed with two goals. 1. To understand Korean prompts directly and produce accurate images without the need for translation. 2. To generate images that reflect Korean culture and identity. If we can achieve both of these goals, we should be able to generate images of Korean ramyeon noodles and Lotte World Tower standing in a Korean city center in response to the prompts “Draw me ramyeon noodles in Korea” or “Draw me Lotte World Tower in Korea”
In this article, we'll describe the development of these 'KIMCHI' models and the limitations of the dataset preprocessing framework.
Preparing Dataset
Datasets collected from AIHub, such as Korean Food Dataset, Korean Historical Building and Landmark Dataset, External knowledge base multimodal Q&A data with Korean images were used to train the 'KIMCHI' model.
Processing the Data
1. VLM(Vision Language Model) Processing: First, we'll feed the prepared image with prompts as input to a VLM called LLaVA. The VLM is a multimodal AI model that can understand images and text together, meaning it analyzes the information in the image and translates it into human-understandable text. For example, it can look at a certain photo and generate an English caption that says, “A plate of food with a red sauce and a green vegetable." 2. LLM(Large Language Model) Refinement: The English captions are then passed to LLaMA, a language model. LLaMA is responsible for replacing the existing English caption with a more detailed and accurate Korean description. If the original English caption was “A plate of food with a red sauce and a green vegetable.” and contained only visual information about the image, LLaMA would turn it into a Korean sentence that reads, “Kimchi, a traditional Korean dish made with chunggat tossed in red sauce on a white plate.”
Through these two processes, Korean captions are generated that best describe the existing image. These captions are then used to train the image generation model.
Training a model
The Korean descriptions generated from this process were used as text prompts and utilized to train the Diffusion model.
In the image above, we obtained the Korean description of Gat-Kimchi (a Korean food which is a variant of kimchi) “a traditional Korean dish made of chunggat tossed in red sauce on a white plate”. Since it would be very expensive to fully refine the diffusion model, we applied the Low-Rank Adaptation (LoRA) technique to improve learning efficiency, and used the text encoder of the Korean-CLIP model to encode Korean prompts. The Korean-CLIP model is a model that is trained by adding the CLIP model to Korean-English parallel data. Through this process, we were able to further train the model and develop 'KIMCHI', which is capable of Korean prompts and generates images that naturally reflect Korean culture.
Evaluating experiments and performance
Image generation experiments
Experiment (1) - Representing cultural elements
The comparison experiment above shows the results of DALL-E 3, ChatGPT-4o, Stable Diffusion 2.1 (using English translation prompts), and the KIMCHI model generating images based on the same Korean prompts alongside real images.
While other models performed well, KIMCHI captures the texture and ingredients of food, details of traditional architecture, and more without looking artificial. This means that KIMCHI is trained to accurately understand Korean text and generate natural Korean images without the need for translation.
Experiment(2): Preserving the ability to create generic images
The image above is the result of a validation to see if KIMCHI retains the ability to generate the pretrained model after training. KIMCHI was trained using Stable Diffusion 1.5 as a base model, so we compared them together.
Evaluating performance
Result(1): Representing cultural elements
After the experiment, we evaluated the image generation results by surveying 53 evaluators, who selected one of five statements for each of the above items for each model's generated images, ranging from strongly agree to strongly disagree. We then scaled the evaluators' responses from 1 to 5, and measured the average value for each item.
The table above tabulates the experimental results for Experiment (1). KIMCHI outperformed all other models in Text-Image Alignment, Reality, and Domain-Specific Suitability. This means that KIMCHI is good at generating images that are close to the real image, reflecting the Korean prompt, and it is also good at reflecting cultural factors.
Result(2): Preserving the ability to create generic images
The table above shows the results of Experiment(2) through human evaluation. Compared to the Stable Diffusion 1.5 base model, KIMCHI performs slightly worse in Text-Image Alignment and Error Detection, indicating that KIMCHI's ability to generate general images has decreased somewhat as a result of focusing on that domain while learning to specialize in a specific domain. However, we found that it preserved its Reality performance well.
Limitations and conclusions
Limitations
KIMCHI is a fine-tuned model utilizing Stable Diffusion v1.5 as a base model. Stable Diffusion v1.5 was released in October 2022, which is more than two years old. Therefore, even with fine-tuning, image performance may not be as good as the latest version of the model. Also, while it can be trained to understand Korean prompts, it has limitations in that its text-image alignment ability is dependent on the ability of the base model.
Conclusions
KIMCHI is a model that directly understands Korean prompts and generates images that reflect cultural nuances, challenging the limitations of existing image generation models. Although there are still many limitations, this research has the following contributions.
- Build a culturally contextualized dataset preprocessing pipeline with VLM and LLM
- Generate images with Korean prompts without translation
- Generate more realistic and culturally contextualized images
If we can build on these contributions and collect images from specific cultures, we can leverage the dataset generation preprocessing pipeline utilized by KIMCHI to build a dataset that can be used to fine-tune the diffusion model, which will allow us to develop a generation model specific to that language and culture.
While this research started by generating images that accurately reflect Korean culture, it can be extended to AI technologies that understand various cultures. We look forward to further research in the future so that culture-specific models such as KIMCHI can contribute to the world's diverse cultures, and to AI technologies that understand and respect diverse cultures.
26 December 2024
-
Lablup with PyCon Korea 2024: lambda submit: Starbucks if submit == "duck" else None
By Jinho HeoHello, I'm Jinho Heo, a Technical Writer at Lablup.
Lablup participated in PyCon Korea 2024 as a platinum sponsor from October 26th to October 27th, at the Suwon Convention Center.
Lablup's founding philosophy is strongly tied to open source. It's not a stretch to say that open source is in Lablup's blood. There are a variety of open sources, but Lablup has a particularly strong relationship with Python. Lablup is an active contributor to open source, such as aiohttp, which was developed using asyncio, and also contributes to Python itself. Our deep ties go beyond Python to PyCon. Members of Lablup have delivered presentations on diverse topics at PyCons globally and have been a sponsor partner of PyCon Korea five times.
This year's PyCon was a big one for Lablup. This is because Lablup's CEO Jeonkyu Shin and CTO Joongi Kim were keynote speakers on both days of PyCon. Jeongkyu Shin gave a talk on 'Python, PyCon, the Dinosaur Age, and the Planet of Chickens', which is a confusing title at first glance, but he likened two qualities of Python to dinosaurs, the first evolutionary victors: Python's rapid growth and ascension to the number one language, and its status as a key language in the era of massive AI: which creates innovations with massive computation. This leads to 'chicken', the most consumed meat in the modern world, which he said of its accessibility and universality, making it easy for anyone to learn and utilize.Shin's presentation was well-received by the audience, as he entertained the audience with his witty titles, verbal skills, and various AI-generated illustrations.
Joongi Kim, the CTO, delivered a presentation entitled "10 Years of PyCon and Me." Spanning four chapters, he reflected on his PyCon presentations worldwide, beginning with asyncio's development, the company's expansion, managing code scaling, and motivating fellow Python enthusiasts. His clear explanation of the challenges developers encounter or may encounter garnered considerable applause from the audience.
Kyujin Cho, Senior Software Engineer | Sergey Leksikov, Researcher | Joongi Kim, Chief Technology Officer
In addition, Kyujin Cho, Senior Software Engineer, Sergey Leksikov, Researcher, and Joongi Kim, CTO, each presented a session. Kyujin presented “Automated Python Web Framework API Schema Creator: The long way around', where he shared with the audience the series of shovelfuls he went through to overcome the challenges of automatically generating API documentation using aiohttp. Sergey presented 'Automating CLI commands execution with LLM and LangGraph: A new frontier in Python automation', where he talked about how complex CLIs can be transformed into user-friendly tools using LLM and LangGraph frameworks to improve user experience and operational efficiency. Both talks were well attended and received great interest from the audience. Joongi presented 'Engineering Python for enterprise delivery', where he shared his experience in developing packaging/installers for Python app delivery.
👨🏫 Jinho: We'd love to hear your PyCon session recaps.
👨💻 Sergey: Although my presentation was in English, the PyCon Korea organizers offered a real-time translation service, enabling me to deliver my talk smoothly. The highlight of the session was the interaction with eager audience members who approached me with questions about my presentation topic afterwards.
'AI Score Reader' event (Drawing a duck and swan)
Lablup organized an 'AI Score Reader' event at PyCon Korea 2024. Attendees could join the event by scanning a QR code on their mobile devices, tablets, or laptops and were invited to draw "ducks and swans" for a chance to win prizes. Every participant had the opportunity to receive a Lablup Folding Pouch, while the grand prize for the best drawing was a Starbucks gift card.
The stickers Lablup carries to every exhibition
Why ducks and why swans? They are the ones in stickers we always bring to exhibitions. While they appear to glide smoothly across the water, they are paddling hard beneath the surface. This imagery serves as a metaphor for our desire to have our customers enjoy seamless AI services on the front end while Backend.AI handles the complex processes in the background.
The concept of the event is straightforward. If you've ever used generative AI tools such as Stable Diffusion or Dall-E for image creation, you're aware that the structure of generative AI is highly sensitive to its input. The output may vary with each attempt, even with identical inputs, or change drastically with minor input modifications. This method of instructing generative AI to yield specific results is known as 'prompt engineering.' All participants at our booth had the opportunity to engage in some prompt engineering themselves.
We asked Kyujin Cho, a senior software engineer who was responsible for developing the backend of our “AI Score Reader” event, to give us an overview.
👨🏫 Jinho: Can you explain how the “AI Picture Reader” event page was created?
👨💻 Kyujin: The "AI Score Reader" event page's backend was built on three microservices: the Web Application Server (WAS), the image generation pipeline, and the image similarity pipeline. The WAS includes a user database and an API to manage image generation requests. These microservices were all deployed on Backend.AI using the 'Model Service' feature. Similar to the Visutale demo by Sergey from our research team, this serves as a testament to the versatility of Backend.AI in developing AI services.
👨🏫 Jinho: Specifically, what process leads to the similarity between a user's drawing and a given image?
👨💻 Kyujin: When a user submits a prompt through a page accessed by scanning a QR code to generate an image, the Web Application Server (WAS) forwards the text to an image generation service and retrieves the created image. The WAS then sends this image to the similarity discriminator, receives a similarity score as a percentage, and delivers both the score and the generated image to the user.
👨🏫 Jinho: What criteria should we draw by to get a high score?
👨💻 Kyujin: I have no idea what the image similarity pipeline is determining similarity. Perhaps asking the AI directly would be a better way to get an answer.
'Massive' engagement
Our members perceived it as a "minor event." However, it took less than an hour for this perception to be completely overturned.
Our booth started to fill up, and the front of the booth became a scene of people staring hungrily at the leaderboard, eager to see where they stood and defend their top placement.
There were even reports of people sitting in lounges staring at their phone to draw swans.
Let's take a look at the stats.
On the 26th and 27th of October, the event saw the participation of 428 individuals. Throughout this period, we received 11,639 image creation requests, with the most surprising contribution from a single participant who submitted over 1,000 images, a detail we'll delve into later.
Unintended Side Effects Raised
Seated in the booth and observing the leaderboard, I noticed two familiar nicknames: "cloudshin" and "achimnol," which belong to CEO Shin Jeongkyu and CTO Kim Joongi, respectively. They were on a streak, achieving similarity scores above 90%. They kept drawing ducks non-stop, even while proceeding to lunch.
DevRel Lead WooYoung Yoo attempted to stop them, yet it is said that their enthusiasm was difficult to diminish. (P.S.1: Of course, we did forcefully erase their data when finalizing the scores).
(P.S.2: Indeed, the participants performed so well that they effortlessly exceeded the scores of the latter two, resulting in a dominant podium presence...)In addition to the internal side effects, there were also external issues. While running a booth on Day 1, adeveloper screamed in the distance.
"I believe someone is executing a macro."
We discovered that duplicate submission requests were being sent to the same prompt every 1-5 seconds, due to a macro exploiting the AI's generative capabilities to yield varied results with identical prompts. However, addressing the issue was challenging at the moment of detection, because the backend developer was slated to present at PyCon on the second day. The situation was deteriorating.
“I can't submit” ‘The button doesn't work’ ”I get a gray blank space instead of an image”
An attendee opened our event website on their laptop, connected the developer tools, and pointed out that no requests were coming through. We found a developer tucked away in a corner preparing to present at a stage, and discovered that the earlier macro was bombarding our NVIDIA H100 with concurring requests.
Day 1 concluded with a trickle of image submissions. While tidying up the booth and reflecting on the day, we recognized the need to avert this issue on Day 2.
Poor developer pulled an all-nighter to prepare for Day 2, putting off his presentation and adding a couple new features to prevent the accident.
First, added same-prompt submission protection
We added the ability to validate submitted prompts on the server side. If a prompt has been submitted previously, the response has been altered to provide the initially generated image and its similarity score rather than creating a new image.
Second, added captcha when submitting images
To deter macros, we implemented a captcha for submitting responses. Although this may have been slightly inconvenient for participants, the introduction of the captcha successfully prevented random macros on the second day of the event.
Other minor improvements We were receiving data with quite a few decimal places to calculate participants' scores, but we were truncating them in the GUI to the first decimal place for user convenience. However, as the competition heated up, people started to show up with identical scores to the first decimal place, so we patched the GUI to show the second decimal place to reduce user confusion.
Thanks to Kyujin Cho, Senior Engineer, and Soojin Kim, Frontend Engineer, who worked tirelessly on the backend to make the event a success 🙂 .
Wrapping up PyCon Korea 2024
Numerous Python enthusiasts attended PyCon Korea 2024, where Lablup also engaged with many attendees. Lablup is committed to ongoing contributions and growth within the open source community. Echoing Chef Jeong Ji-sun's words from the hit show “Culinary Class War,” “Opening up recipes leads to more recipes. With many contributing ideas, it creates a larger tapestry.”
Here at Lablup, we will continue to think big and never lose sight of the open source spirit that has always been the cornerstone of our foundation.
20 November 2024
Uncharted AI: The Age of AI
By LablupThis article is a summary of Jeongkyu Shin's keynote speech on September 24, 2024 at lab | up > /conf/4.
On September 24, 2024, Lablup's 4th conference, lab | up > /conf/4, was held. The event was attended by a variety of external speakers as well as Lablup employees, and the keynote address was given by Lablup's CEO, Jeongkyu Shin.
Photo by 'iT dongA'
This article will cover the advancements in the AI era as introduced by Jeongkyu Shin in his keynote speech, the future trajectory of Lablup, updates on the current products, and some of our new product releases.
Uncharted Waters
The title of this keynote, "Uncharted AI - The Age of AI," draws inspiration from the classic game "Uncharted Waters," fondly remembered by many. However, the Uncharted Waters is not merely a game; it represents a significant chapter in the real-life history of our global community.
During the Age of Discovery, beginning in the 15th century, numerous explorers ventured across the oceans in pursuit of spices, such as the nowaday widely-known "pepper." Although I was not alive during that time to witness it firsthand, so I played it with a game. We may not consider a spice today so valuable, but numerous adventurers risked their lives in its pursuit.
Uncharted AI
Like so many people who risked their lives across the ocean in search of spices back then, we're in a new era of artificial intelligence (AI), and we're risking our lives and working with a diverse set of partners to advance AI. The necessity of this effort lies in its commitment to accessibility. If I could harvest pepper in my backyard, I wouldn't have to cross the ocean. At the dawn of a new era, this difference in access creates a skills gap for some and a challenge for others. For Lablup, the skills gap introduced by emerging technologies has catalyzed the dawn of a new era.
At Lablup, our motto has been clear since our founding in 2015. We've made it our core mission to Make AI Accessible, making technology more accessible and lowering barriers. Our goal was to reduce the barriers to AI accessibility by making the technology itself comprehensible and user-friendly, not merely available as an API.
As the field of AI advances, the challenge of scaling emerges. As AI technology expands, data it processes increases, computation also intensifies, it moves from single-node to multi-node, and from tens to hundreds of thousands of GPUs. Simultaneously, AI is becoming more compact, operating on devices in the palm of your hand, such as Samsung's Galaxy AI and Apple Intelligence, as well as on IoT sensors like thermometers.
Simultaneously, we are witnessing efforts to operate AI with greater power and more resources, as well as a surge in endeavors to run AI with less power and fewer resources. If we consider the traditional spectrum of AI, it is expanding both upwards (larger) and downwards (smaller), with the technology needed to shift the scale in either direction being entirely distinct.
Back in 2015, we were able to construct models using just a GeForce GTX970. However, workloads have expanded so rapidly that for the past four or five years, their growth has surpassed the performance improvements of semiconductors, known as Moore's Law. Consequently, the focus has shifted from enhancing a single chip's performance to combining several chips and utilizing them in parallel.
Make AI "Scalable"
Over the past four years, the distributed computing paradigm in AI has undergone significant evolution. We have moved beyond parallel processing to witness a variety of computations occurring concurrently. Diverse tasks like data processing, model training, and service provisioning are now integrated. Simultaneous demands for heterogeneous computational resources have emerged, encompassing databases, training, data processing, fleet management, RAS, and others that align more closely with the service stack.
Accelerators such as GPUs have become essential for modern computing. We no longer use CPUs and GPUs separately; instead, we must integrate them more closely. The driving force behind this integration is the universal need for GPUs, which leads to bottlenecks that are both physical—such as power, network, and data—and non-physical, including hardware instability, platform management, and software issues. At Lablup, our goal is to eliminate these obstacles to scaling.
This year at Lablup, we've set a new objective: Make AI Scalable. Our aim is to expand AI workloads across the full range, from accelerators to individual nodes to hyperscale environments. This goal builds upon our initial mission of “Making AI Accessible,” as we eliminate obstacles to scaling, incorporate elements that facilitate scaling, and persist in dismantling barriers to accessing AI technology.
Through the years, the company's dedication to making AI both accessible and scalable has resulted in numerous innovations. As a result, the number of enterprise GPU running on Backend.AI has grown to nearly 13,000, with some sites managing more than 1,500 GPUs. Additionally, the number of teams (customers) utilizing our products has increased to over 100. In varied sectors such as cloud services, AI accelerator testbeds, and autonomous driving, Backend.AI has established itself as a crucial infrastructure component for AI.
This massive scale significantly increased the technical challenge. We've had to develop technologies that span the entire spectrum, from single servers to thousands of clusters. We had to “take away everything that are blocking the scaling, and add everything for the scaling.” We would like to use this opportunity to share our recent innovations, the ongoing developments, and the future we are striving to create.
Open Source
Lablup is a company that is deeply involved in the open source ecosystem. We are developing and releasing various projects such as Backend.AI, Callosum, aiodocker, aiomonitor (aiotools), Raftify, and many more. Open source is in our DNA. Our experience on the open-source we create, publish, or contribute to across various on-premises environments is a significant competitive edge of us. Backend.AI's support for on-premises environments, compatibility with cloud environments, and more are all capabilities that what we've gained from our open source experience.
Backend.AI CLI Installer: Easy installation experience with TUI
The Backend.AI CLI Installer is an open-source initiative designed to enhance the accessibility of Backend.AI. It features a text-based user interface (TUI) for simplified installation, automates the package-based installation process, and includes meta settings for streamlined automatic setup.
bndev: Easily build your own AI infrastructure
For enthusiasts who enjoy tinkering and hacking beyond mere package-based installations, we have introduced a development tool named bndev. This tool simplifies the process of constructing and maintaining intricate Backend.AI development environments. The concept behind bndev is to empower everyone to own and maintain their personal AI infrastructure.
Backend.AI Core
Backend.AI conducts major version releases biannually, in March and September. The release of version 24.03 took place in March 2024, and the upcoming release of version 24.09 is imminent. Significant updates to Backend.AI Core are expected to influence future releases. Allow me to introduce these changes for you.
Key Updates
- Support for NVIDIA NGC(NVIDIA GPU Cloud) NIM(Nemo Infrerence Microservice): Key NGC features, like license-based container image loading, are compatible with Backend.AI.
- Expanded support for new accelerators including Intel Gaudi2, Rebellions ATOM+, and Furiosa RNGD: Backend.AI allows you to flexibly choose the best AI accelerator to match the characteristics of your workload.
- General availability of Backend.AI model store, browser, and serving: A comprehensive solution that integrates the essential features of MLOps, simplifying the process for customers to find AI models and deploy them seamlessly into their workflows.
- Enhanced Task Scheduling: The new Priority Scheduler enables the independent prioritization of tasks, ensuring that tasks of high importance are addressed swiftly and dependably.
- Agent Selector Concept: The Agent Selector is responsible for determining which nodes the scheduler actually runs the selected tasks on. This part is now easily customizable as a standalone plugin. You can use it to distribute jobs based on different criteria, such as power usage or temperature of each node. We expect this to be a great help in optimizing the operation of your infrastructure by balancing the load across nodes, increasing power efficiency, and more.
- Our own Docker network plugin: Expanded support for GPUDirect Storage for large-scale data processing, minimizing bottlenecks in moving data within a single node.
- Cilium-based networking stack for inter-container communication: The implementation has enhanced large-scale distributed learning, resulting in a 30% increase in network performance compared to previous setup.
- OpenID Connect (OIDC)-based federated authentication scheme: Access various infrastructure services, such as Backend.AI and others, using a single account to significantly streamline account management.
- Expanded support for enterprise environments: It works with a variety of PrivateContainer Registries, including GitLab, GitHub Enterprise, AWS ECR, and more, and makes it easy to configure hybrid configurations that span both on-premises legacy resources and the cloud.
Leveraging these updates, Backend.AI is broadening its scope as a cutting-edge AI infrastructure, serving both high-performance computing (HPC) and enterprise needs. Further enhancements will accompany the launch of Backend.AI 24.09.
Next-gen Sokovan
We continues to develop the next-generation Sokovan, scheduled for release early the following year. Here is a brief overview of what to expect from Next-gen Sokovan.
- Dual-engine architecture supporting Kubernetes: In addition to the current proprietary cluster management system, it will function as a native Kubernetes service. This includes managing accelerators through the Kubernetes Operator Proxy. We will seamlessly integrate NVIDIA and AMD device plugins, Intel GPU plugins, among others, to uphold industry standards.
- Database load balancing with Raftify during high-availability (HA) config: Minimize bottlenecks for metadata services and ensure reliable operation in clusters of tens of thousands of units.
- Enhanced automatic scaling for serving large language models: API metrics like request patterns and latency, and resource usage are analyzed for optimal scaling
- Strengthening the project unit: Capable to manage datasets, models, pipelines, and more collectively. The objective is to facilitate fine-grained role-based access control (RBAC) to accommodate diverse collaborative scenarios.
- Enhanced management capabilities for enterprise customers: You'll have integrated logging and monitoring, as well as audit log tracking for regulatory compliance.
All of these changes are being made with one goal in mind: to accelerate our customers' AI projects. With the new AI accelerator and connections to other Kubernetes-based solutions, our team is looking forward to further maturing the Backend.AI Core and MLOps features. Stay tuned for the next Sokovan's journey as he takes on a broader role.
Backend.AI WebUI
In the near future, the Backend.AI WebUI will be getting a new look. From a user's perspective, the user interface is probably the most important factor that determines the first impression of Backend.AI. We have always recognized the importance of the WebUI and have been innovating on it. We launched ML Desktop last year and GenAI Desktop earlier this year to test different user experiences, and we recently brought a user-friendly UI to our products with Neo Session Launcher.
Introducing WebUI Neo, the third new evolution of WebUI. Designed in close collaboration with Vice Versa Design Studio with the goal of delivering a rich user experience, this new design language is designed with the user in mind from start to finish. To coincide with the relaunch of Backend.AI, we've redesigned the entire UI/UX to give it a sleeker, more futuristic look and feel.
WebUI Neo was designed with the concepts of “reducing cognitive load” and “maintaining consistency in visual metaphors.” In terms of reducing cognitive load, we wanted to minimize the amount of complex information users had to type or top-search. For example, when setting up large-scale experiments, we limited the amount of information available in a step by exposing information sequentially, rather than presenting dozens of options at once.
In terms of “maintaining consistency in visual metaphors,” we've organized UI/UX elements, from screen composition to icons to colors, into similar design patterns for similar concepts, such as experiments, models, and data sets. By this, our users can reuse what they've learned once without having to relearn how to use similar features. WebUI Neo will be applied across both Backend.AI Core and Enterprise.
In recognition of this innovation, WebUI Neo was awarded the Excellence Award, which is only given to four consortia, at the Seoul Design Foundation and Seoul Metropolitan Government's Industrial Design Development Support Project for Small and Medium-sized Enterprises.
WebUI Neo will not be included in the Backend.AI 24.09 update right away, but is still being developed and tested with the goal of a general release later this year. We're also finalizing the move from Web Components, which is the codebase used since the first version of WebUI, to React. WebUI Neo is more than just a repackaging of past features; it will continue to add new functionality that is tightly aligned with machine learning workflows and will be the foundation for achieving the high level of automation and ease of use that Backend.AI strives for. This is the future we envision with WebUI Neo, a world where everyone can easily understand and benefit from AI infrastructure beyond its complexity.
Lablup Enterprise
The core of Lablup Enterprise, centered on Backend.AI Enterprise, can be described as ___ made easy. Lablup Enterprise aims to make deep-level AI technology innovation easy with end-to-end technology from device driver level to AIOps. We have three ___ made easy concepts: “Scaling made easy”, “Acceleration made easy”, and “Inference made easy”.
Scaling made easy: FastTrack 2, Finetun.ing, Cluster Designer
FastTrack 2
FastTrack 2, released with 24.09, is an automation solution for AI projects at scale. It provides pipeline management based on project groups, making it easy to define and execute complex workflows. It offers a wide range of reusable templates to minimize repetitive tasks. In addition, FastTrack 2 enables you to better leverage your resources by connecting with external partners. You can add model compression nodes and model serving services from partners to your pipeline.
Finetun.ing
Finetun.ing is a cloud-based fine-tuning service created in collaboration with FastTrack. It stands out from traditional fine-tuning services by eliminating the need for users to prepare their own data. Typically, fine-tuning involves uploading data to adjust model, but Finetun.ing simplifies this process by allowing users to interactively input prompts. The service then generates synthetic data from these interactions to fine-tune the model. The finetuned models are automatically evaluated and made available for download, complete with a model card. Finetun.ing operates on NVIDIA NemoTron and supports Llama 3.1 and Gemma 2. Ongoing tests aim to enable fine-tuning for an array of new models, with plans to expand the selection in the future.
Finetun.ing is currently gearing up for its final unveiling, and we've decided to take a waitlist for the first time at this event. You can sign up for the waitlist at https://finetun.ing.
Cluster Designer
Backend.AI Cluster Designer is a GUI-based cluster design tool. It automatically calculates the effective performance of a cluster of your desired size and performance, along with the required hardware configuration and estimated cost. It's perfect for those who want to validate the optimal architecture before actually building.
Backend.AI Helmsman
Backend.AI Helmsman is an interactive cluster management interface. It makes complex cluster operations possible just by chatting in a terminal. Under the hood, it utilizes a Gemma-based fine-tuning model to accurately understand user intent. It combines packages such as TorchTune, LangGraph, and LangChain to build interactive fine-tuning pipelines for on-premises environments. UI packages and models via the Helmsman CLI and WebUI will be released after the Backend.AI 24.09 release, by the end of the year.
Acceleration made easy
The second is “Acceleration made easy”. We support a wide variety of accelerators for AI workloads than any other AI infrastructure platform in existence.
CPU architecture coverage includes x86 as well as heterogeneous architectures such as Arm and RISC-V. We work closely with the latest accelerators, including NVIDIA's Grace Hopper, AMD's MI Series, Intel Gaudi, GraphCore BOW, GroqCard, Rebelion ATOM+, and Furiosa RNGD, to ensure you get the same user experience and peak performance on Backend.AI.
Inference made easy
Finally, “Inference made easy”.
We've simplified the sharing and distribution of pre-trained models with a unified model store. Inspired by package managers like Choco on Windows and Homebrew on macOS, Lablup ION model recipes allow you to install models and services contributed by the community via GitHub with a single line of command.
PALI, PALI PALI (PALI2), PALANG
There's also something new to introduce in terms of model service operations. It's PALI, PALI2, PALANG.
**Performant AI Launcher for Inference (PALI) is a high-performance inference runtime that combines the Backend.AI model player with a curated model catalog and predefined models. It features flexible scalability and high performance. Anyone can easily install, run NVIDIA NIM, Hugging Face models, and Lablup ION recipes right out of the box to run model services.
PALI2 is a dedicated hardware infrastructure appliance for PALI. You can easily scale by connecting multiple appliances with PALI. PALI2 is an architecture optimized for AI workloads, delivering high performance and low latency. Depending on your installation, we can provide and update models for different architectures and chip environments.
We are also preparing a PALI2 appliance that incorporates the NVIDIA reference platform GH200, and KYOCERA Mirai Envision Co., Ltd. in Japan will launch Instant.AI as the first reference platform for PALI2, which will be available for purchase on October 1.
Reference platforms for the Korean market will be available to reserve in October and for sale in Q4. PALI2 appliances targeting the U.S. and European markets will be available as early as Q4 of this year.
PALANG is a language model inference platform that includes PALI, FastTrack, Talkativot, and Helmsman. It provides ready-to-use inference and fine-tuning settings, greatly simplifying the deployment and operation of large-scale language models. Talkativot makes it easy to create custom chatbot interfaces and provides software components for model comparison and interface building during development. You can use PALI and PALI2 if you only need references, or PALANG if you need both language model fine-tuning and inference.
G
Finally, One More Thing... We'd like to give you a sneak peek at a new project we're currently working on: G, a language model based on Gemma2. It features easy customization with Finetun.ing. It will be used for a variety of purposes, including a backend model for Helmsman and an enterprise agent. Details will be revealed soon.
From Uncharted AI to Industrial Revolution
During the Age of Discovery, countless adventurers sailed the globe in search of pepper. Their adventures led to the discovery of many parts of the world that remained uncharted, and the world became more connected through the routes they opened. Shipbuilding and navigation were improved, new trade routes were opened, and innovations were made in medicine, military technology, and more. But that's not all: the Age of Discovery spawned another important event: the Industrial Revolution.
We are currently living in what is known as the Age of Great AI. It's akin to the dawn of the Age of Discovery, where the doors to new possibilities are just now opening. One person is returning with pepper, while another is constructing a larger vessel to demonstrate that the Earth is round. We are witnessing the equivalent of what the Industrial Revolution brought by the Age of Discovery.
Engine of AI Infrastructure
The Industrial Revolution began with James Watt's steam engine. The invention of the steam engine ushered in an era of mass production and mechanization. Now we're in the midst of another revolution. In the face of the tidal wave that is the Age of Great AI, Lablup is building a new engine.
Lablup is the engine of AI infrastructure. Our technology fuels innovation across industries. While the steam engine harnessed the power of coal, our engine is fueled by data. Just as a car engine converts the energy of gasoline into motion, Lablup provides an efficient and powerful engine that converts the fuel of data into AI, and the value it brings.
Just as the internal combustion engine gave birth to the automotive industry, AI engines will reshape the data-driven IT industry. Lablup is preparing for the time when everyone and every organization will be able to derive insights and value from their own data, rather than just storing and managing it. Lablup's AI engine is unrivaled in scale and speed. It has the scale to run dozens to tens of thousands of GPUs simultaneously, processing petabytes of data in real time, for the IoT and beyond. Just as the performance of an engine determines the speed of a car, our infrastructure will determine your success in the AI ecosystem.
So far, you've seen the engines that we had built. With these engines, we want to drive the AI revolution beyond the Age of Great AI. We're going to work on designing and improving the engine so that each and every one of you can be in the driver's seat. We invite you to step on the gas pedal of the AI era with Lablup.
27 September 2024
Backend.AI Open Source Contribution Guide (Jul. 2024)
By Daehyun SungBackend.AI's core engine utilizes many open-source software components and is itself being developed as open source. When enterprise customers encounter inconveniences or bugs while using Backend.AI, we provide issue tracking and support through customer support and technical support channels. However, those using the open-source version can also directly contribute to the project.
There are mainly two ways to contribute: creating an issue that explains in detail what problem exists or what improvement ideas you have, and making a pull request to directly contribute code changes. In this post, we'd like to introduce a few things that are good to know in advance for more effective and faster communication with the development team during the contribution process.
Introduction to GitHub Repositories
As seen in the previous post Backend.AI Open Source Contribution Guide, Backend.AI was originally developed with repositories divided into the Backend.AI meta-repository and several sub-components.
However, from version "22.06", Backend.AI has changed to a mono-repository using Pants.
This transition in the development workflow has greatly helped in resolving package compatibility issues that often occur across multiple individual components, creating a more convenient development environment.
Pants is a fast, scalable, and user-friendly build system.
If you want to submit an issue, the first place to look is the Backend.AI repository. The repository named Backend.AI integrates multiple packages using Pants. This repository is not only for project management but also contains code that actually performs functions. All issues related to Backend.AI's server and Client SDK are managed here, and links to other projects are provided through the README.
When creating a new issue, two default templates are provided: bug report and feature request. However, it's not strictly necessary to follow these templates. Considering the complexity of Backend.AI and its various usage environments, following these templates when writing content makes it easier to share context for problem identification.
Introduction to Mono-repository
From version "22.06", Backend.AI has changed to a mono-repository using Pants. A mono-repository is a project with an integrated code base that shares the basic dependencies, data models, features, tooling, and processes of multiple projects. It operates the repository by integrating multiple projects that were previously used into a single project.
Introduction to Pants
Backend.AI is installed using Pants as a build system. For more details about Pants, please check the following link Pants - Getting started.
Relationship between Backend.AI Components
Figure 1. Relationship structure between major Backend.AI components
Figure 1 is a diagram showing the relationship between the major components of Backend.AI.
Figure 2. Major component structure of Backend.AI and examples of execution methods
Figure 2 is a diagram showing the major component structure of Backend.AI, and shows the location of the source code of the components and execution commands.
Most of Backend.AI's components are managed in the Backend.AI repository, and the source code is located in the
src/ai/backend/
subdirectory. Briefly, summarizing what each component does by directory:src/ai/backend/manager
(Manager): Core service that monitors computational resources of the entire cluster, handles session scheduling, provides user authentication and APIs for session executionsrc/ai/backend/agent
(Agent): Service installed on compute nodes to manage and control containerssrc/ai/backend/common
(Common): Library of functions and data formats commonly or frequently used across multiple server-side componentssrc/ai/backend/client
(Client SDK for Python): Official command-line interface and library providing API wrapper functions and classes for Pythonsrc/ai/backend/storage
(Storage Proxy): Service that allows user web browsers or Client SDK to directly perform large-volume I/O from network storagesrc/ai/backend/web
(Web Server): HTTP service that provides routing for Web UI and SPA (single-page app) implementation and web session-based user authenticationsrc/ai/backend/webui
(Web UI & Desktop App): Web component-based implementation of the actual UI that users interact with. Also supports Electron-based desktop app builds. Also includes a local lightweight version of the app proxy that allows users to directly access application ports running inside containers.
Backend.AI Version Management Method
Backend.AI has major releases every 6 months (March and September each year), with post-release support provided for about 1 year. Therefore, the version number follows the CalVer format in the form of
YY.0M.micro
(e.g., 20.09.14, 21.03.8). However, due to the version number normalization of the Python packaging system, the version of the wheel package is in the formatYY.MM.micro
without zero-padding in the month part (e.g., 20.9.14, 21.3.8). Some detailed components with version update cycles different from the main release cycle follow the general SemVer format.Essential Packages to Install Before Development
Before installing Backend.AI, you need to install Docker, Docker Compose v2, etc. first. When installing Backend.AI using the
scripts/install-dev.sh
script in the repository, it checks for the installation of Docker, Docker Compose v2, etc., and guides you through the installation process. If Python, pyenv, Docker, npm are not installed, you need to install the essential packages as follows. For Python, please install it using the system package's Python3. Then, you need to installpyenv
andpyenv-virtualenv
.$ curl https://pyenv.run | bash
Then, you can install Docker and Docker Compose v2 as follows:
MacOS
For MacOS, Docker Desktop on Mac automatically installs Docker and Docker Compose v2.
Ubuntu, Debian, CentOS, Fedora Core, and other Linux environments
For Ubuntu, Debian, CentOS, Fedora Core, you can automatically install Docker and Docker Compose v2 using the following script:
$ sudo curl -fsSL https://get.docker.io | bash
After installing Docker, if you get a unix:///var/run/docker.sock access permission error when running without sudo, like this:
$ docker ps Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json": dial unix /var/run/docker.sock: connect: permission denied
If such a permission problem exists, set the permissions using the following command:
$ sudo usermod -aG docker $(whoami) $ sudo chown root:docker /var/run/docker.sock
After that, reboot and run
docker run hello-world
to confirm that it runs normally.$ docker run hello-world Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world c1ec31eb5944: Pull complete Digest: sha256:94323f3e5e09a8b9515d74337010375a456c909543e1ff1538f5116d38ab3989 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps: 1. The Docker client contacted the Docker daemon. 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64) 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading. 4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/ For more examples and ideas, visit: https://docs.docker.com/get-started/
Instead of changing the group ownership of
/var/run/docker.sock
with chown, changing the permissions of the/var/run/docker.sock
file to 666 allows other users in the group to access it without rebooting.sudo chmod 666 /var/run/docker.sock
However, setting the permissions of the
/var/run/docker.sock
file to 666 creates a security vulnerability.You can check if Docker Compose v2 is installed as follows:
$ sudo docker compose version Docker Compose version v2.28.1
If nvm is not installed, you should install nvm as shown in the following link nvm - install & Update Script.
$ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
After installing
nvm
, install the latest LTS version of Node.js and set it up for use.$ nvm install --lts $ nvm use --lts
How to Install the Development Environment
To actually contribute code, you need to write a pull request, and unless it's a simple typo correction or documentation-related contribution, you need to directly modify the code and run it, so it's essential to set up your own development environment. Backend.AI has a structure where multiple components work together, so installation is not complete just by cloning one repository and creating a Python virtual environment with an editable install[1]. At a minimum, you need to set up and run manager, agent, storage-proxy, webserver, and wsproxy to check the functioning GUI, and for the CLI environment, you need to install the client SDK separately. Also, Redis, PostgreSQL, and etcd servers need to be run together for manager operation and communication with the agent.
If you have installed the essential packages introduced earlier and want to install multiple components of Backend.AI, you can install them using the
scripts/install-dev.sh
script in the repository. This script does the following:- Checks for the installation of pyenv, Python, Docker, npm, etc., and guides the installation method
- Installs all of these various components in their respective directories
- At this time, components such as accelerator-cuda, which are necessary for the operation of other components, are additionally installed in an editable state.
- Adds database/etcd fixtures including basic port settings and example authentication keys that each component can look at each other
- Creates and runs PostgreSQL, Redis, etcd services using Docker Compose under the name "halfstack"
When the install-dev script execution is successfully completed, it outputs commands to run service daemons such as manager and agent, and basic configured example account information. Following the instructions, use terminal multiplexers like tmux, screen, or multiple tab features of terminal apps to run service daemons in separate shells, and confirm that the hello world example works. Then you're ready to develop and test Backend.AI.
Currently, this method only supports Intel (amd64/x86_64) and ARM-based macOS and Ubuntu/Debian/CentOS/Fedora and Linux environments where Docker Compose can be installed.
Usually, when you first use this install-dev script, it often stops due to various errors or pre-check failures and needs to be run again. In this case, you can easily perform the deletion procedure using the
scripts/delete-dev.sh
script.Installing and Uninstalling Backend.AI
Using these install-dev and delete-dev scripts, you can freely install and uninstall Backend.AI. First, clone the Backend.AI repository.
$ git clone https://github.com/lablup/backend.ai
Then install Backend.AI.
$ cd backend.ai $ ./scripts/install-dev.sh
After the installation is complete, please take note of the result content that appears on the screen.
If you want to uninstall Backend.AI, run the
scripts/delete-dev.sh
script from the location where you cloned the Backend.AI repository.$ cd backend.ai $ ./scripts/delete-dev.sh
Things to Know Before Contributing
As with most projects managed in distributed version control systems, to contribute to Backend.AI, code work should be based on the latest commit of the main branch of the original remote repository, and if conflicts occur, they should be resolved before requesting a review. If you've forked the original repository, the current forked original repository and the actual original repository need to be synchronized.
Before explaining the method, please refer to the following terminology to help understanding:
- Original remote repository (upstream): The original Backend.AI repository. All major commit contents are reflected here.
- Forked original repository (origin): The Backend.AI repository copied to "your" account via GitHub. (Note: Original remote repository != Forked original repository)
- Code copy (local working copy): The forked repository currently downloaded to your local machine
Git command branch notation
main
: The main branch of the current local working copyorigin/main
: The main branch of the repository (origin) from which I cloned to create my local working copyupstream/main
: The main branch belonging to the separately added upstream remote repository
Workflow concepts
- At the time of forking,
origin/main
is created - When you clone the forked repository,
main
is created on your work computer - Create a new topic branch from
main
and proceed with work - When you upload this work branch to origin and create a PR, GitHub automatically points to the original repository of the fork
- At this point, to synchronize changes to the
main
of the original repository during work, follow the procedure below
The method of synchronization is as follows:
- step1: Add the original remote repository as a name called upstream
$ git remote add upstream https://github.com/lablup/backend.ai
- step2: Fetch the latest commits of the main branch of the original remote repository to the code copy (local working copy)
$ git fetch upstream
- step3: Bring the latest commit reflection history of the main branch of the original remote repository to origin (the code copy (local working copy) of the original repository you forked)
$ git switch main && git merge --ff upstream/main
- step4: Reflect the changes in the code copy (local working copy) made in steps 1 ~ 3 to origin (the remote repository of the original repository you forked)
$ git push origin main
Now
upstream/main
andorigin/main
are synchronized throughmain
.- step5: Reflect the latest updates to my branch that I'm working on
$ git switch topic $ git merge main
When performing this process, if a history branch is created between
origin/main
andupstream/main
and step 5 is performed incorrectly, it can become extremely difficult to recover. Also, when the CI tools used by Backend.AI test PRs, they are set to find common ancestor commits to see the differences betweenupstream/main
andorigin/topic
, but if you reuse the main name for the topic branch, these tools will not work properly. If possible, think of always giving a new name when creating a new branch.How to Write a Pull Request
To send a specific bug patch or feature implementation as a PR, you first need to upload it to GitHub. There are several methods, but the following is recommended:
- Fork the repository on the GitHub repository page. (If you have direct commit permissions, it's recommended to create a branch directly without forking.)
- In your local working copy, use git remote to point to that forked repository.
- Following convention, it's good to name Lablup's original repository as
upstream
and the newly created forked repository asorigin
. - If you installed with install-dev first instead of cloning after forking, the original repository will be
origin
, so you need to rename the remote.
- Following convention, it's good to name Lablup's original repository as
- Create a new branch.
- For branch names, prepend
fix/
for bug fixes orfeature/
for feature additions or improvements, and summarize the topic in kebab-case. (e.g.,feature/additional-cluster-env-vars
,fix/memory-leak-in-stats
) Other prefixes likedocs/
,refactor/
are also used. - It's possible to write a PR by directly modifying the main branch, but during PR review and modification periods, if additional changes occur on the main branch, you'll have to rebase or merge every time you synchronize with the upstream repository, which is more troublesome. Having a separate branch allows you to rebase and merge when you want.
- For branch names, prepend
- Commit changes to that branch.
- Commit messages should follow the conventional commit style as much as possible. Like branch names, use title prefixes such as
fix:
,feat:
,refactor:
,docs:
,release:
, and for Backend.AI specifically,setup:
for dependency-related commits,repo:
for cases like gitignore updates or repository directory structure changes. You can also indicate affected components in parentheses. (e.g.,fix(scripts/install-dev): Update for v21.03 release
) - Commit messages should be written in English.
- Commit messages should follow the conventional commit style as much as possible. Like branch names, use title prefixes such as
- Push the branch and write the PR.
- For PRs with separate issues, you should write the issue number in the PR body. If you want to reference an issue in the repository, look at the number in the issue link like https://github.com/lablup/backend.ai/issues/401 and write it in the format
#401
, and GitHub will automatically link it. - There's no specific format required for the PR body, but it's good to write what problem it's solving, what principle it's written on, or what tools or libraries were used, and why those choices were made.
- PR titles and bodies can be written in English or Korean.
- When you create a PR, you'll see various automated inspection tools in action. In particular, you must sign (register your GitHub username) the CLA (contributor license agreement) for the review to proceed.
- You must pass all basic coding style and coding rule checks for each language. (For Python code, flake8, mypy, etc.)
- In repositories with a
changes
directory andtowncrier
check, when you create a PR and receive its number, create a file namedchanges/<PR number>.<modification type>
and write a one-line English sentence summarizing the changes in Markdown syntax. (For relatively simple content or if there's a separate existing issue, this content can also serve as the PR body.) Modification types includefix
,feature
,breaking
,misc
,deprecation
,doc
, and parts that differ by project are defined in each repository'spyproject.toml
. You can refer to files likeCHANGELOG.md
orCHANGES.md
to see how existing messages were written.
- For PRs with separate issues, you should write the issue number in the PR body. If you want to reference an issue in the repository, look at the number in the issue link like https://github.com/lablup/backend.ai/issues/401 and write it in the format
- Proceed with the review process.
- When completed, the reviewer usually organizes the commit log in a squash-merge form to create a single commit for merging.
- Therefore, don't feel burdened about making frequent small modification commits during the review process, and feel free to make commits whenever you think of something.
It's even better to use tools like GitHub CLI, SourceTree, GitKraken along with git commands.
Summary
We've looked at Backend.AI's overall component structure and repository structure, how to install the development environment, and how to write pull requests. I hope this guide has helped you take one step closer to Backend.AI's source code.
[1]: An "editable" installation refers to a method of installing a Python package to directly look at the source directory, allowing changes to be immediately reflected when importing the package just by modifying the source directory without editing inside the site-packages directory.
10 July 2024
aiomonitor-ng: Debugging tool for complex asyncio applications
By Joongi KimAs program complexity grows, software developers need robust debugging tools. The optimal debugging method involves pinpointing a reliable way to replicate an issue within a development setting conducive to free experimentation, followed by the creation of automated tests. However, when the reproduction scenario is overly complex or involves bugs that sporadically appear in production environments, detailed logging becomes the alternative to comprehend the issue retrospectively. In this post, we presents the 'aiomonitor-ng', designed to simplify the debugging of intricate asyncio programs.
Debugging asyncio applications has its own difficulties. In Python, the stack trace is commonly used for debugging, revealing the program's location at the time of an exception. However, with asyncio's concurrent execution of multiple coroutine tasks, each with its own stack, it's crucial to examine not just the stack of the coroutine where the exception occurred but also those of 'related' coroutines to pinpoint if the error stemmed from another task. This issue intensifies when an external library implicitly generates a coroutine that invokes my code. Moreover, certain bugs, like coroutine task explosions that only manifest in production, or silent terminations of ongoing coroutine tasks, are particularly elusive in development settings, as they don't produce clear exceptions and are only detectable through post-incident logs.
aiomonitor is a production-grade live debugging tool created by the asyncio core developers. Wrapping asyncio-based code within a monitor object allows you to initiate a telnet session to a pre-set TCP port outside the process while the code is active. Through simple commands, you can inspect the list of coroutine tasks running in the event loop and the status of individual stacks. Backend.AI has integrated aiomonitor, assigning a unique debugging telnet port to each service process. (For security purposes, only local connections are permitted.) This integration has significantly aided in troubleshooting production-specific issues. Nonetheless, pinpointing the cause of a coroutine task's failure due to an external library, not specific to Backend.AI's code, remains a challenge when using aiomonitor at the time of the problem's occurrence.
We have developed an enhanced version named aiomonitor-ng, where "ng" signifies next-generation. This version includes the following additions and enhancements:
- Task creation tracker: For all running coroutine tasks, the momentary stack trace is preserved for each job that created the coroutine task (
asyncio.create_task()
) to allow the entire chain of task creation to be tracked (ps
,where
command). - Task termination tracker: Recently terminated coroutine tasks can be preserved and viewed up to a maximum of N, especially when one job cancels (
Task.cancel()
) another job. The momentary stack trace of the cancellation trigger is also preserved to enable tracking of the entire cancellation chain (ps-terminated
,where-terminated
command). - Persistent task marker: By default, to prevent memory leaks, recently terminated jobs are tracked up to a maximum of N. However, if specific jobs that must continue running throughout the application's lifespan are marked with a decorator, those jobs always preserve their termination logs, regardless of the history limit. They also provide a filtering function as an additional option in the termination log query command (
aiomonitor.task.preserve_termination_log
decorator). - Sophisticated terminal UI: We improved command-line processing, which was previously composed of a simple REPL (read-evaluate-print loop) based on handcrafted command parsing. We rewrote the aiomonitor server-side implementation to use Click and prompt_toolkit. We also developed a Telnet client that natively operates with asyncio to provide argument autocomplete, such as command and task ID.
Here are some screenshots of the actual usage:
We have successfully resolved resource leaks and performance issues stemming from excessive coroutine task creation in the grpcio library through callbacks. Additionally, we addressed problems where tasks that monitor events produced by the docker daemon would silently stop due to specific input message patterns. This was preventing the outcomes of container creation or deletion tasks from being reported, leading to system crashes.
We anticipate that developers working not only on Lablup but also on various Python asyncio applications will find aiomonitor-ng useful for debugging purposes in the future.
aiomonitor-ng can be installed via PyPI using the command
pip install aiomonitor-ng
, and it is open-sourced on my GitHub account for anyone to use and contribute.28 November 2022
- Task creation tracker: For all running coroutine tasks, the momentary stack trace is preserved for each job that created the coroutine task (