Culture

Nov 22, 2023

Culture

2023 Summer Intern in Lablup

  • Dongjin Park

    Intern

Nov 22, 2023

Culture

2023 Summer Intern in Lablup

  • Dongjin Park

    Intern

#Overview

I applied to CUop, a collaboration between universities specializing in science and technology, and worked as an intern at Lablup for 8 weeks.

I wrote about my experiences through onboarding, developing Backend.AI, and attending PyCon.

Motivation for applying

I first learned about Lablup through a PyCon presentation session I stumbled upon. I could tell that the company had a lot of technically deep and passionate members. I applied to Lablup because I was interested in Python and asynchronous programming.

onboarding

During the first two weeks, we conducted an onboarding process.

We went through the Realtime Web Chat implementation, Backend.AI development environment, and code base seminar.

Realtime Web Chat

This is an assignment to familiarize yourself with Python asyncio. You will develop a real-time chat app using aiohttp, an asynchronous web framework, and redis, an in-memory database. The task also includes setting up docker compose to build python and redis at once. For more information, see the Github Readme.

To broadcast the messages through redis, we used Redis pub/sub. pub/sub acts as a platform that delivers messages without storing them. Since we didn't have a requirement to store messages, we used Redis pub/sub. We also registered the Redis pub/sub process as a task using asyncio.create_task() to run it in an event loop. Realtime Web Chat Launch Screen Realtime Web Chat Launch Screen

I was able to understand the basic behavior of asyncio. I was able to ask questions and solve the difficult parts. I think Lablup has a great environment for interns and junior developers to grow because they can freely ask questions through Microsoft Teams.

Build the Backend.AI development environment

I installed Backend.AI on a VM Farm and a local VM and tried to run it myself. I read the official documentation, but the process was not smooth. I encountered various errors, which I solved by sharing with my fellow interns and asking questions on Teams.

💡 A VM farm is an environment where virtual machines are managed and run. Lablup has its own VM Farm.

This was my first experience of developing with a VM Farm and VSCode connected via SSH. To develop Backend.AI, I need to run multiple processes (Manager, Agent, Storage proxy, Web server, etc.) and Docker Containers. If I use a laptop, it's a bit of a battery drain. However, with VM Farm, you can develop Backend.AI lightly because all you need is an SSH connection. In fact, I was able to develop for a long time using VM Farm when I was out of the office and couldn't charge my laptop.

Code Base Seminar

After looking at the code, focusing on the difficult parts of Backend.AI, I prepared a seminar based on my understanding. I was in charge of presenting the Manager, Agent, and GraphQL parts.

Since Backend.AI is open source, the official documentation is well written. I studied the architecture of Backend.AI by looking at the official documentation to understand the overall structure and asking the employees directly if I had any questions. Since the topic of the presentation was session/kernel creation and scheduling control of Backend.AI Manager, I studied the manager code and analyzed the logs of the manager process. A sequence diagram I drew in preparation for a seminar presentation. A sequence diagram I drew in preparation for a seminar presentation.

Analyzing the logs, I found a bug that caused the session state to change from Preparing back to Pulling. It was rewarding to analyze the logs one by one. It was difficult to analyze the logs in order because of the asynchronous code base, but drawing a call graph and a sequence diagram was very helpful.

Develop Backend.AI

After onboarding, I started working on Backend.AI. I worked on GitHub issues and volunteered to help or found and resolved issues myself.

I created 9 pull requests from the Backend.AI repository repository and 2 pull requests from the Backend.AI-WebUI repository, and they all merged!

I chose high-priority issues that I was confident in addressing. I wanted to make as many contributions as I could in the short time frame of two months.

First PR

https://github.com/lablup/backend.ai/pull/1395

I created a PR to fix a bug I found while preparing for a seminar. It was an easy PR to fix the API parameter. However, it was a good experience to learn about branch name convention, commit convention, and news fragment, and to experience the CI (Continuous Integration) process with GitHub Actions and to do some Git-related shoveling beforehand.

💡 A news fragment is a one-sentence Markdown description of what the branch created by the PR is trying to do. You want to keep it simple and clear so that if you see this PR again in the future, you'll know what it was trying to do.

PRs for vfolder

When I heard that Teams had an open issue for an intern, I jumped at the chance to apply. I had to learn a new concept called vfolder, but I knew it would be important to get to know the product.

PR (1)

https://github.com/lablup/backend.ai/pull/1397

Only admin can create a project type vfolder. It should be possible to create a vfolder regardless of the max_vfolder_count of the keypair resource policy, but if the user type vfolder exceeds the max_vfolder_count, the project type vfolder cannot be created. At first, I was confused by the terminology, but by analyzing the code and asking questions, I was able to interpret the terminology.

PR (2)

https://github.com/lablup/backend.ai/pull/1400

Fixed new bugs discovered while addressing PR (1).

PR (3)

https://github.com/lablup/backend.ai/pull/1417

I found an issue related to the PR (1) issue. DB migration and GraphQL were new to me, but I wanted to try them out, so I supported them. I used a DB migration tool called Alembic, studied GraphQL schema concepts, and modified my query and mutation code to support backward compatibility. I tried to use cURL to test the modified code, but GraphQL has a much longer request format than the REST API, which was cumbersome. I wrote the test code by asking questions to interns and employees who are familiar with GraphQL. I wrote python code to test the modified query and mutation in the form of CLI to make testing easier.

Supported an issue in Teams. There was a bug in the WebUI where it was not possible to delete a session if the wsproxy address of the resource group was invalid. I also wanted to get some experience with WebUI development.

PR (1)

https://github.com/lablup/backend.ai/pull/1423

I read through the WebUI code to troubleshoot the issue, but I couldn't quite grasp the concept of wsproxy. I realized that wsproxy has v1 and v2, but it was not easy to understand the difference between the two, so I asked the employees. The main difference between v1 and v2 is the path of the traffic: v1 goes through the manager to communicate with the container, while v2 can communicate directly with the container without going through the manager, which is faster. Once I understood what wsproxy does and the difference between v1 and v2, it was easier to understand how the code worked, and I realized that a lot of people didn't know the difference. I realized that questions that seemed easy to ask might not have been asked in the organization.

PR (2)

https://github.com/lablup/backend.ai-webui/pull/1819

I also modified the webui code to fix the issue. To fix the JavaScript code, I learned about callback functions, promise objects, and async/await. I handled errors so that they didn't affect other logic, and defined duplicate code as functions to eliminate duplication of code.

PR (3)

https://github.com/lablup/backend.ai-webui/pull/1833

However, since the WebUI needs to be backwards compatible with Backend.AI 22.09, we had a review from our CEO that it should also handle HTTP Status 404, so we made it handle both 404 and 500.

PR (4)

https://github.com/lablup/backend.ai/pull/1466

However, after the code was merged, a bug occurred: when setting up v1 wsproxy, the return value for wsproxy-version disappeared. This happened because I was modifying the core code and didn't handle all the branches. I fixed the code in a hurry, but it was a simple mistake, and I realized that I should write test code to prevent such mistakes.

https://github.com/lablup/backend.ai/pull/1444

In preparation for the seminar, I came across an issue with manager that I had been studying. With my internship coming to an end, I thought I could contribute by resolving the issue on the code I knew best.

This PR has been heavily modified by code reviews. Initially, we designed scheduler health check and scheduler trigger as the same API. After receiving code reviews, I split the two functions into different APIs to separate responsibilities. We originally stored health information only for the schedule function, but we also stored health information for the prepare function and the scale_services function because we needed to create the trigger API after storing health information for the three functions that run periodically according to the scheduler's global timer to get a complete picture of the scheduler's health. We also changed the design so that we can store the scheduler's health based on the manager ID because there may be multiple manager processes.

The scheduler's storage was also reviewed. Initially, we looked at the code in the existing manager state API to store manager state in etcd and did the same for scheduler state. etcd is good for storing configuration information that needs to be kept consistent, but it is slow to write. redis is a volatile database, but it performs well even with many reads/writes. Since the scheduler state is read/written periodically and does not need to be consistent, we switched to storing it in redis.

https://github.com/lablup/backend.ai/pull/1472

Now that I had a good understanding of the Manager part of Backend.AI, I wanted to understand another important component: the Agent. I came across an issue about the Agent, so I took a look.

While Backend.AI was running, there was a bug where the internal state of the Agent did not match the state of the actual working container. As a result, when creating a session, an InsufficientResource Error was thrown during the resource allocation process, even though there were actually enough resources. When the error occurred, we needed to improve the logging to understand what went wrong during the resource allocation process.

It took a while to figure out the resource allocation process. The concurrency issues were difficult, and it took a lot of Q&A with the CTO to get a general idea of the flow and what to log.

A few weeks after my internship ended, the CTO merged it with more than 10 commits, adding refactorings and test code. What impressed me was that he wrote test code to reproduce the error. I had to go through a complex process (see PR) to reproduce the error myself, which took a lot of development time. I could see the difference in productivity here. Of course, I thought about writing test code, but I thought the implementation would be too complicated, and I thought that writing test code would be the end of my internship. In the future, I shouldn't be so intimidated by writing test code and just try it out and learn as I go. Also, it seems that the refactorings were done with a focus on code readability. The functions in the part I modified were too long and not very readable, but after refactoring, the functions were shorter and the logging was cleaner. I realized that I shouldn't stop at implementation, but try to make good code.

Attending PyCon

On August 12 and 13, Lablup will have a booth at PyCon. Companies that sponsor PyCon are given the opportunity to have a booth. I was an intern, but I wanted to participate in the booth activities and listen to some of the talks. The company had some PyCon tickets left over, so I was able to participate.

At the Lablup booth, we had an event where we challenged Llama2 to write a 10-line pyramid at a prompt. The challenge wasn't that easy, and it was important to explain it in a way that Llama2 could understand. Two lucky people who submitted the correct answer were drawn to win a Nintendo Switch and a Logitech mouse. My role at the booth was to help direct PyCon attendees to the event, and since there were a lot of employees at PyCon, I was free to attend any talks I wanted to hear. Since Rableup is an open source company, we encourage people to contribute to open source and participate in conferences. In fact, four of our presenters attended PyCon, so we value participating in conferences.

Lablup Inc. booth Lablup Inc. booth

During the RustPython session, a tool called ruff was introduced as a replacement for the python lint tools flake8 and isort. Since ruff is composed of Rust, it is 100x faster than flake8. At Backend.AI, we were using flake8 and isort for lint, but after our CTO reviewed ruff, I watched him adopt ruff for the Backend.AI project right on the stairs of COEX. I realized that he was really good at the process involved in coding, applying new tools to the project in a short time and even revising the official documentation. I realized that I want to be a proficient developer someday. After PyCon, I read the updated official documentation, applied ruff to the Backend.AI development environment, and experienced linting 100x faster. If I hadn't participated in PyCon, I wouldn't have gotten up to speed on these great tools so quickly. I hope to continue participating in developer conferences in the future. Group photo with members of Lablup Inc. Group photo with members of Lablup Inc.

End of internship

During my internship, I tried to get as much experience as possible, and I wanted to contribute a lot. In the end, I was able to experience a lot because I tried to contribute a lot. I was quick to volunteer for issues that came up in Teams, so I was able to understand the core components of Backend.AI: vfolder, wsproxy, web-ui, manager, and agent. I also learned new concepts like DB Migration, GraphQL, and etcd. Although it was a bit physically demanding to attend the conference from morning to evening on the weekend, it was fun to listen to more than 10 presentation sessions, get inspired, and meet various people through booth activities.

During my internship, I think I actively asked questions about things I didn't understand, which helped me to solve issues quickly. I think the reason why I was able to ask a lot of questions was because there was a culture of horizontal rabble-rousing, and there were many people who were kind enough to answer my questions, so I was able to actively ask questions. I would like to take this opportunity to thank the members for their support.

I was able to experience a variety of things, including asynchronous programming experience, GitHub collaboration, presenting English seminars, and attending conferences. I feel like I've grown a lot as a developer through this program. I recommend the Lablup internship to anyone who is thirsty for growth.

This post is automatically translated from Korean

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

Namyoung Bldg. 4F/5F, 34, Seolleung-ro 100-gil, Gangnam-gu, Seoul, Republic of Korea

© Lablup Inc. All rights reserved.