Tag : Open Source

  • Backend.AI Open Source Contribution Guide (Jul. 2024)

    By Daehyun Sung

    Backend.AI's core engine utilizes many open-source software components and is itself being developed as open source. When enterprise customers encounter inconveniences or bugs while using Backend.AI, we provide issue tracking and support through customer support and technical support channels. However, those using the open-source version can also directly contribute to the project.

    There are mainly two ways to contribute: creating an issue that explains in detail what problem exists or what improvement ideas you have, and making a pull request to directly contribute code changes. In this post, we'd like to introduce a few things that are good to know in advance for more effective and faster communication with the development team during the contribution process.

    Introduction to GitHub Repositories

    As seen in the previous post Backend.AI Open Source Contribution Guide, Backend.AI was originally developed with repositories divided into the Backend.AI meta-repository and several sub-components.

    However, from version "22.06", Backend.AI has changed to a mono-repository using Pants.

    This transition in the development workflow has greatly helped in resolving package compatibility issues that often occur across multiple individual components, creating a more convenient development environment.

    Pants is a fast, scalable, and user-friendly build system.

    If you want to submit an issue, the first place to look is the Backend.AI repository. The repository named Backend.AI integrates multiple packages using Pants. This repository is not only for project management but also contains code that actually performs functions. All issues related to Backend.AI's server and Client SDK are managed here, and links to other projects are provided through the README.

    When creating a new issue, two default templates are provided: bug report and feature request. However, it's not strictly necessary to follow these templates. Considering the complexity of Backend.AI and its various usage environments, following these templates when writing content makes it easier to share context for problem identification.

    Introduction to Mono-repository

    From version "22.06", Backend.AI has changed to a mono-repository using Pants. A mono-repository is a project with an integrated code base that shares the basic dependencies, data models, features, tooling, and processes of multiple projects. It operates the repository by integrating multiple projects that were previously used into a single project.

    Introduction to Pants

    Backend.AI is installed using Pants as a build system. For more details about Pants, please check the following link Pants - Getting started.

    Relationship between Backend.AI Components

    Figure 1. Relationship structure between major Backend.AI components

    Figure 1 is a diagram showing the relationship between the major components of Backend.AI.

    Figure 2. Major component structure of Backend.AI and examples of execution methods

    Figure 2 is a diagram showing the major component structure of Backend.AI, and shows the location of the source code of the components and execution commands.

    Most of Backend.AI's components are managed in the Backend.AI repository, and the source code is located in the src/ai/backend/ subdirectory. Briefly, summarizing what each component does by directory:

    • src/ai/backend/manager (Manager): Core service that monitors computational resources of the entire cluster, handles session scheduling, provides user authentication and APIs for session execution
    • src/ai/backend/agent (Agent): Service installed on compute nodes to manage and control containers
    • src/ai/backend/common (Common): Library of functions and data formats commonly or frequently used across multiple server-side components
    • src/ai/backend/client (Client SDK for Python): Official command-line interface and library providing API wrapper functions and classes for Python
    • src/ai/backend/storage (Storage Proxy): Service that allows user web browsers or Client SDK to directly perform large-volume I/O from network storage
    • src/ai/backend/web (Web Server): HTTP service that provides routing for Web UI and SPA (single-page app) implementation and web session-based user authentication
    • src/ai/backend/webui (Web UI & Desktop App): Web component-based implementation of the actual UI that users interact with. Also supports Electron-based desktop app builds. Also includes a local lightweight version of the app proxy that allows users to directly access application ports running inside containers.

    Backend.AI Version Management Method

    Backend.AI has major releases every 6 months (March and September each year), with post-release support provided for about 1 year. Therefore, the version number follows the CalVer format in the form of YY.0M.micro (e.g., 20.09.14, 21.03.8). However, due to the version number normalization of the Python packaging system, the version of the wheel package is in the format YY.MM.micro without zero-padding in the month part (e.g., 20.9.14, 21.3.8). Some detailed components with version update cycles different from the main release cycle follow the general SemVer format.

    Essential Packages to Install Before Development

    Before installing Backend.AI, you need to install Docker, Docker Compose v2, etc. first. When installing Backend.AI using the scripts/install-dev.sh script in the repository, it checks for the installation of Docker, Docker Compose v2, etc., and guides you through the installation process. If Python, pyenv, Docker, npm are not installed, you need to install the essential packages as follows. For Python, please install it using the system package's Python3. Then, you need to install pyenv and pyenv-virtualenv.

    $ curl https://pyenv.run | bash
    

    Then, you can install Docker and Docker Compose v2 as follows:

    MacOS

    For MacOS, Docker Desktop on Mac automatically installs Docker and Docker Compose v2.

    Ubuntu, Debian, CentOS, Fedora Core, and other Linux environments

    For Ubuntu, Debian, CentOS, Fedora Core, you can automatically install Docker and Docker Compose v2 using the following script:

    $ sudo curl -fsSL https://get.docker.io | bash
    

    After installing Docker, if you get a unix:///var/run/docker.sock access permission error when running without sudo, like this:

    $ docker ps
    Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json": dial unix /var/run/docker.sock: connect: permission denied
    

    If such a permission problem exists, set the permissions using the following command:

    $ sudo usermod -aG docker $(whoami)
    $ sudo chown root:docker /var/run/docker.sock
    

    After that, reboot and run docker run hello-world to confirm that it runs normally.

    $ docker run hello-world
    Unable to find image 'hello-world:latest' locally
    latest: Pulling from library/hello-world
    c1ec31eb5944: Pull complete
    Digest: sha256:94323f3e5e09a8b9515d74337010375a456c909543e1ff1538f5116d38ab3989
    Status: Downloaded newer image for hello-world:latest
    
    Hello from Docker!
    This message shows that your installation appears to be working correctly.
    
    To generate this message, Docker took the following steps:
    1. The Docker client contacted the Docker daemon.
    2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
        (amd64)
    3. The Docker daemon created a new container from that image which runs the
        executable that produces the output you are currently reading.
    4. The Docker daemon streamed that output to the Docker client, which sent it
        to your terminal.
    
    To try something more ambitious, you can run an Ubuntu container with:
    $ docker run -it ubuntu bash
    
    Share images, automate workflows, and more with a free Docker ID:
    https://hub.docker.com/
    
    For more examples and ideas, visit:
    https://docs.docker.com/get-started/
    

    Instead of changing the group ownership of /var/run/docker.sock with chown, changing the permissions of the /var/run/docker.sock file to 666 allows other users in the group to access it without rebooting.

    sudo chmod 666 /var/run/docker.sock
    

    However, setting the permissions of the /var/run/docker.sock file to 666 creates a security vulnerability.

    You can check if Docker Compose v2 is installed as follows:

    $ sudo docker compose version
    Docker Compose version v2.28.1
    

    If nvm is not installed, you should install nvm as shown in the following link nvm - install & Update Script.

    $ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
    

    After installing nvm, install the latest LTS version of Node.js and set it up for use.

    $ nvm install --lts
    $ nvm use --lts
    

    How to Install the Development Environment

    To actually contribute code, you need to write a pull request, and unless it's a simple typo correction or documentation-related contribution, you need to directly modify the code and run it, so it's essential to set up your own development environment. Backend.AI has a structure where multiple components work together, so installation is not complete just by cloning one repository and creating a Python virtual environment with an editable install[1]. At a minimum, you need to set up and run manager, agent, storage-proxy, webserver, and wsproxy to check the functioning GUI, and for the CLI environment, you need to install the client SDK separately. Also, Redis, PostgreSQL, and etcd servers need to be run together for manager operation and communication with the agent.

    If you have installed the essential packages introduced earlier and want to install multiple components of Backend.AI, you can install them using the scripts/install-dev.sh script in the repository. This script does the following:

    • Checks for the installation of pyenv, Python, Docker, npm, etc., and guides the installation method
    • Installs all of these various components in their respective directories
      • At this time, components such as accelerator-cuda, which are necessary for the operation of other components, are additionally installed in an editable state.
    • Adds database/etcd fixtures including basic port settings and example authentication keys that each component can look at each other
    • Creates and runs PostgreSQL, Redis, etcd services using Docker Compose under the name "halfstack"

    When the install-dev script execution is successfully completed, it outputs commands to run service daemons such as manager and agent, and basic configured example account information. Following the instructions, use terminal multiplexers like tmux, screen, or multiple tab features of terminal apps to run service daemons in separate shells, and confirm that the hello world example works. Then you're ready to develop and test Backend.AI.

    Currently, this method only supports Intel (amd64/x86_64) and ARM-based macOS and Ubuntu/Debian/CentOS/Fedora and Linux environments where Docker Compose can be installed.

    Usually, when you first use this install-dev script, it often stops due to various errors or pre-check failures and needs to be run again. In this case, you can easily perform the deletion procedure using the scripts/delete-dev.sh script.

    Installing and Uninstalling Backend.AI

    Using these install-dev and delete-dev scripts, you can freely install and uninstall Backend.AI. First, clone the Backend.AI repository.

    $ git clone https://github.com/lablup/backend.ai 
    

    Then install Backend.AI.

    $ cd backend.ai
    $ ./scripts/install-dev.sh 
    

    After the installation is complete, please take note of the result content that appears on the screen.

    If you want to uninstall Backend.AI, run the scripts/delete-dev.sh script from the location where you cloned the Backend.AI repository.

    $ cd backend.ai
    $ ./scripts/delete-dev.sh 
    

    Things to Know Before Contributing

    As with most projects managed in distributed version control systems, to contribute to Backend.AI, code work should be based on the latest commit of the main branch of the original remote repository, and if conflicts occur, they should be resolved before requesting a review. If you've forked the original repository, the current forked original repository and the actual original repository need to be synchronized.

    Before explaining the method, please refer to the following terminology to help understanding:

    • Original remote repository (upstream): The original Backend.AI repository. All major commit contents are reflected here.
    • Forked original repository (origin): The Backend.AI repository copied to "your" account via GitHub. (Note: Original remote repository != Forked original repository)
    • Code copy (local working copy): The forked repository currently downloaded to your local machine

    Git command branch notation

    • main: The main branch of the current local working copy
    • origin/main: The main branch of the repository (origin) from which I cloned to create my local working copy
    • upstream/main: The main branch belonging to the separately added upstream remote repository

    Workflow concepts

    • At the time of forking, origin/main is created
    • When you clone the forked repository, main is created on your work computer
    • Create a new topic branch from main and proceed with work
    • When you upload this work branch to origin and create a PR, GitHub automatically points to the original repository of the fork
    • At this point, to synchronize changes to the main of the original repository during work, follow the procedure below

    The method of synchronization is as follows:

    • step1: Add the original remote repository as a name called upstream
    $ git remote add upstream https://github.com/lablup/backend.ai
    
    • step2: Fetch the latest commits of the main branch of the original remote repository to the code copy (local working copy)
    $ git fetch upstream
    
    • step3: Bring the latest commit reflection history of the main branch of the original remote repository to origin (the code copy (local working copy) of the original repository you forked)
    $ git switch main && git merge --ff upstream/main
    
    • step4: Reflect the changes in the code copy (local working copy) made in steps 1 ~ 3 to origin (the remote repository of the original repository you forked)
    $ git push origin main
    

    Now upstream/main and origin/main are synchronized through main.

    • step5: Reflect the latest updates to my branch that I'm working on
    $ git switch topic
    $ git merge main
    

    When performing this process, if a history branch is created between origin/main and upstream/main and step 5 is performed incorrectly, it can become extremely difficult to recover. Also, when the CI tools used by Backend.AI test PRs, they are set to find common ancestor commits to see the differences between upstream/main and origin/topic, but if you reuse the main name for the topic branch, these tools will not work properly. If possible, think of always giving a new name when creating a new branch.

    How to Write a Pull Request

    To send a specific bug patch or feature implementation as a PR, you first need to upload it to GitHub. There are several methods, but the following is recommended:

    • Fork the repository on the GitHub repository page. (If you have direct commit permissions, it's recommended to create a branch directly without forking.)
    • In your local working copy, use git remote to point to that forked repository.
      • Following convention, it's good to name Lablup's original repository as upstream and the newly created forked repository as origin.
      • If you installed with install-dev first instead of cloning after forking, the original repository will be origin, so you need to rename the remote.
    • Create a new branch.
      • For branch names, prepend fix/ for bug fixes or feature/ for feature additions or improvements, and summarize the topic in kebab-case. (e.g., feature/additional-cluster-env-vars, fix/memory-leak-in-stats) Other prefixes like docs/, refactor/ are also used.
      • It's possible to write a PR by directly modifying the main branch, but during PR review and modification periods, if additional changes occur on the main branch, you'll have to rebase or merge every time you synchronize with the upstream repository, which is more troublesome. Having a separate branch allows you to rebase and merge when you want.
    • Commit changes to that branch.
      • Commit messages should follow the conventional commit style as much as possible. Like branch names, use title prefixes such as fix:, feat:, refactor:, docs:, release:, and for Backend.AI specifically, setup: for dependency-related commits, repo: for cases like gitignore updates or repository directory structure changes. You can also indicate affected components in parentheses. (e.g., fix(scripts/install-dev): Update for v21.03 release)
      • Commit messages should be written in English.
    • Push the branch and write the PR.
      • For PRs with separate issues, you should write the issue number in the PR body. If you want to reference an issue in the repository, look at the number in the issue link like https://github.com/lablup/backend.ai/issues/401 and write it in the format #401, and GitHub will automatically link it.
      • There's no specific format required for the PR body, but it's good to write what problem it's solving, what principle it's written on, or what tools or libraries were used, and why those choices were made.
      • PR titles and bodies can be written in English or Korean.
      • When you create a PR, you'll see various automated inspection tools in action. In particular, you must sign (register your GitHub username) the CLA (contributor license agreement) for the review to proceed.
      • You must pass all basic coding style and coding rule checks for each language. (For Python code, flake8, mypy, etc.)
      • In repositories with a changes directory and towncrier check, when you create a PR and receive its number, create a file named changes/<PR number>.<modification type> and write a one-line English sentence summarizing the changes in Markdown syntax. (For relatively simple content or if there's a separate existing issue, this content can also serve as the PR body.) Modification types include fix, feature, breaking, misc, deprecation, doc, and parts that differ by project are defined in each repository's pyproject.toml. You can refer to files like CHANGELOG.md or CHANGES.md to see how existing messages were written.
    • Proceed with the review process.
      • When completed, the reviewer usually organizes the commit log in a squash-merge form to create a single commit for merging.
      • Therefore, don't feel burdened about making frequent small modification commits during the review process, and feel free to make commits whenever you think of something.

    It's even better to use tools like GitHub CLI, SourceTree, GitKraken along with git commands.

    Summary

    We've looked at Backend.AI's overall component structure and repository structure, how to install the development environment, and how to write pull requests. I hope this guide has helped you take one step closer to Backend.AI's source code.


    [1]: An "editable" installation refers to a method of installing a Python package to directly look at the source directory, allowing changes to be immediately reflected when importing the package just by modifying the source directory without editing inside the site-packages directory.

    10 July 2024

  • aiomonitor-ng: Debugging tool for complex asyncio applications

    By Joongi Kim

    As program complexity grows, software developers need robust debugging tools. The optimal debugging method involves pinpointing a reliable way to replicate an issue within a development setting conducive to free experimentation, followed by the creation of automated tests. However, when the reproduction scenario is overly complex or involves bugs that sporadically appear in production environments, detailed logging becomes the alternative to comprehend the issue retrospectively. In this post, we presents the 'aiomonitor-ng', designed to simplify the debugging of intricate asyncio programs.

    Debugging asyncio applications has its own difficulties. In Python, the stack trace is commonly used for debugging, revealing the program's location at the time of an exception. However, with asyncio's concurrent execution of multiple coroutine tasks, each with its own stack, it's crucial to examine not just the stack of the coroutine where the exception occurred but also those of 'related' coroutines to pinpoint if the error stemmed from another task. This issue intensifies when an external library implicitly generates a coroutine that invokes my code. Moreover, certain bugs, like coroutine task explosions that only manifest in production, or silent terminations of ongoing coroutine tasks, are particularly elusive in development settings, as they don't produce clear exceptions and are only detectable through post-incident logs.

    aiomonitor is a production-grade live debugging tool created by the asyncio core developers. Wrapping asyncio-based code within a monitor object allows you to initiate a telnet session to a pre-set TCP port outside the process while the code is active. Through simple commands, you can inspect the list of coroutine tasks running in the event loop and the status of individual stacks. Backend.AI has integrated aiomonitor, assigning a unique debugging telnet port to each service process. (For security purposes, only local connections are permitted.) This integration has significantly aided in troubleshooting production-specific issues. Nonetheless, pinpointing the cause of a coroutine task's failure due to an external library, not specific to Backend.AI's code, remains a challenge when using aiomonitor at the time of the problem's occurrence.

    We have developed an enhanced version named aiomonitor-ng, where "ng" signifies next-generation. This version includes the following additions and enhancements:

    • Task creation tracker: For all running coroutine tasks, the momentary stack trace is preserved for each job that created the coroutine task (asyncio.create_task()) to allow the entire chain of task creation to be tracked (ps, where command).
    • Task termination tracker: Recently terminated coroutine tasks can be preserved and viewed up to a maximum of N, especially when one job cancels (Task.cancel()) another job. The momentary stack trace of the cancellation trigger is also preserved to enable tracking of the entire cancellation chain (ps-terminated, where-terminated command).
    • Persistent task marker: By default, to prevent memory leaks, recently terminated jobs are tracked up to a maximum of N. However, if specific jobs that must continue running throughout the application's lifespan are marked with a decorator, those jobs always preserve their termination logs, regardless of the history limit. They also provide a filtering function as an additional option in the termination log query command (aiomonitor.task.preserve_termination_log decorator).
    • Sophisticated terminal UI: We improved command-line processing, which was previously composed of a simple REPL (read-evaluate-print loop) based on handcrafted command parsing. We rewrote the aiomonitor server-side implementation to use Click and prompt_toolkit. We also developed a Telnet client that natively operates with asyncio to provide argument autocomplete, such as command and task ID.

    Here are some screenshots of the actual usage:

    We have successfully resolved resource leaks and performance issues stemming from excessive coroutine task creation in the grpcio library through callbacks. Additionally, we addressed problems where tasks that monitor events produced by the docker daemon would silently stop due to specific input message patterns. This was preventing the outcomes of container creation or deletion tasks from being reported, leading to system crashes.

    We anticipate that developers working not only on Lablup but also on various Python asyncio applications will find aiomonitor-ng useful for debugging purposes in the future.

    aiomonitor-ng can be installed via PyPI using the command pip install aiomonitor-ng, and it is open-sourced on my GitHub account for anyone to use and contribute.

    28 November 2022

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

Namyoung Bldg. 4F/5F, 34, Seolleung-ro 100-gil, Gangnam-gu, Seoul, Republic of Korea

© Lablup Inc. All rights reserved.