Engineering

Nov 28, 2022

Engineering

aiomonitor-ng: Debugging tool for complex asyncio applications

  • Joongi Kim

    Co-Founder / CTO

Nov 28, 2022

Engineering

aiomonitor-ng: Debugging tool for complex asyncio applications

  • Joongi Kim

    Co-Founder / CTO

As program complexity grows, software developers need robust debugging tools. The optimal debugging method involves pinpointing a reliable way to replicate an issue within a development setting conducive to free experimentation, followed by the creation of automated tests. However, when the reproduction scenario is overly complex or involves bugs that sporadically appear in production environments, detailed logging becomes the alternative to comprehend the issue retrospectively. In this post, we presents the 'aiomonitor-ng', designed to simplify the debugging of intricate asyncio programs.

Debugging asyncio applications has its own difficulties. In Python, the stack trace is commonly used for debugging, revealing the program's location at the time of an exception. However, with asyncio's concurrent execution of multiple coroutine tasks, each with its own stack, it's crucial to examine not just the stack of the coroutine where the exception occurred but also those of 'related' coroutines to pinpoint if the error stemmed from another task. This issue intensifies when an external library implicitly generates a coroutine that invokes my code. Moreover, certain bugs, like coroutine task explosions that only manifest in production, or silent terminations of ongoing coroutine tasks, are particularly elusive in development settings, as they don't produce clear exceptions and are only detectable through post-incident logs.

aiomonitor is a production-grade live debugging tool created by the asyncio core developers. Wrapping asyncio-based code within a monitor object allows you to initiate a telnet session to a pre-set TCP port outside the process while the code is active. Through simple commands, you can inspect the list of coroutine tasks running in the event loop and the status of individual stacks. Backend.AI has integrated aiomonitor, assigning a unique debugging telnet port to each service process. (For security purposes, only local connections are permitted.) This integration has significantly aided in troubleshooting production-specific issues. Nonetheless, pinpointing the cause of a coroutine task's failure due to an external library, not specific to Backend.AI's code, remains a challenge when using aiomonitor at the time of the problem's occurrence.

We have developed an enhanced version named aiomonitor-ng, where "ng" signifies next-generation. This version includes the following additions and enhancements:

  • Task creation tracker: For all running coroutine tasks, the momentary stack trace is preserved for each job that created the coroutine task (asyncio.create_task()) to allow the entire chain of task creation to be tracked (ps, where command).
  • Task termination tracker: Recently terminated coroutine tasks can be preserved and viewed up to a maximum of N, especially when one job cancels (Task.cancel()) another job. The momentary stack trace of the cancellation trigger is also preserved to enable tracking of the entire cancellation chain (ps-terminated, where-terminated command).
  • Persistent task marker: By default, to prevent memory leaks, recently terminated jobs are tracked up to a maximum of N. However, if specific jobs that must continue running throughout the application's lifespan are marked with a decorator, those jobs always preserve their termination logs, regardless of the history limit. They also provide a filtering function as an additional option in the termination log query command (aiomonitor.task.preserve_termination_log decorator).
  • Sophisticated terminal UI: We improved command-line processing, which was previously composed of a simple REPL (read-evaluate-print loop) based on handcrafted command parsing. We rewrote the aiomonitor server-side implementation to use Click and prompt_toolkit. We also developed a Telnet client that natively operates with asyncio to provide argument autocomplete, such as command and task ID.

Here are some screenshots of the actual usage:

We have successfully resolved resource leaks and performance issues stemming from excessive coroutine task creation in the grpcio library through callbacks. Additionally, we addressed problems where tasks that monitor events produced by the docker daemon would silently stop due to specific input message patterns. This was preventing the outcomes of container creation or deletion tasks from being reported, leading to system crashes.

We anticipate that developers working not only on Lablup but also on various Python asyncio applications will find aiomonitor-ng useful for debugging purposes in the future.

aiomonitor-ng can be installed via PyPI using the command pip install aiomonitor-ng, and it is open-sourced on my GitHub account for anyone to use and contribute.

We're here for you!

Complete the form and we'll be in touch soon

Contact Us

Headquarter & HPC Lab

Namyoung Bldg. 4F/5F, 34, Seolleung-ro 100-gil, Gangnam-gu, Seoul, Republic of Korea

© Lablup Inc. All rights reserved.