  • Backend.AI Meets Tool LLMs : Revolutionizing AI Interaction with Tools - Part 3

    By Sergey Leksikov

    Part 3. Making own API Retriever and Question Answering system with few lines of code locally without training and serving LLM

    Previously, in Part 1 we talked about Tool LLM and their usage. Part 2 demonstrated how to run Gorilla LLM on Backend.AI. In the Part 3, there will be talk about the case when there are no GPU available, but we still want to get help and assistance regarding our API.

    Suppose we have Backend.AI, and we want to get information about Backend.AI REST API and Functional API in more interactive way via question answering style. The example of REST API can be described in this documentation: https://docs.backend.ai/en/latest/manager/rest-reference/index.html

    Figure 1. Backend.AI REST API Documentation

    In addition, Backend.AI REST API documentation can be exported into openapi.json format:

    Figure 2. Backend.AI openai.json

    Another source of BackendAI API is functional API defined in Backend.AI Client. We want to know how to interact with Backend.AI and which parts of code are responsible. The client code repository is responsible with managing and interacting with cloud and computing environment:

    Steps to make a Question Answering API system

    1. Let’s setup Backend.AI Client locally from https://github.com/lablup/backend.ai/tree/main/src/ai/backend/client on our local PC environment and create a new directory bai-dev/src/ai/backend/client/gpt_api_client

    Figure 3. The directory location of gpt_api_client

    1. At vector_data directory let’s create two sub directories data1/ which will store a REST api documentation: openapi.json and data2/ will store selected B.AI Client files over which we want to do an API Question Answering.

    Figure 4. Overview of data directories with openapi.json and client function code files

    1. Let’s install python library LlamaIndex library. Pip install llama-index. Note LlamaIndex is not related to Meta LLaMA language model. LlamaIndex is about data structures and methods for efficient processing and storing documents for retrieval.

    2. Let’s convert our api and code files into an embedded vector and store them in a Vector Database with LLamaIndex. Let’s use Jupyter Notebook interactive environment which is also integrated in out VSCode on a local PC.

    Figure 5. Jupyter Notebook interactive environment. Loading openapi.json from data/ directory. Then asking questions from query engine over a vector index.

    1. Vectorize data2/ directory with our code functions

    Figure 6. Load data2/ directory with code files from B.AI Client. Then vectorize them into index and create a question answering engine.

    We can save both indexes using python Pickle or Joblib libraries which are commonly used for storing and serializing objects to later load them into system. joblib.dump(index, "rest_api_index.joblib") and joblib.dump(index, "functional_index.joblib")

    1. Jupyter Notebook environment already provides to us ability to ask questions and get response in interactive way. Additionally, we can load the saved vectorized indexes on FastAPI server and answer questions over the web. In previous Part 2, we set computational session with Gorilla LLM. From the previous demo we still have a computational session with a FastAPI server.

    2. Let’s transfer the files rest_api_index.joblib and functional_index.joblib to api_helper/ vFolder at Backend.AI Cloud session

    3. At file server.py load the vector indexes and define the query engines.

    Figure 7. server.py definition of index files and query engine.

    1. For each query engine we specify an FastAPI endpoint.

    Figure 8. Code snippets for REST and Functional API retrieval

    1. Test server response from your local PC using curl command. When a server gets queried on a specific endpoint, it will get an answer from a user.
    curl -X POST -H "Content-Type: application/json" -d '{"instruction":"Create a new session"}'

    Figure 9. Command line response from curl command. Example 1

    curl -X POST -H "Content-Type: application/json" -d '{"instruction":"Create a new session"}'

    Figure 10. Command line response from curl command. Example 2

    In addition, we can make a web app which receives user input, sends to corresponding endpoint, and receives the answer.

    Figure 11. A web app prototype for Question Answering over Backend.AI REST and Functional API. Example 1

    Figure 12. A web app prototype for Question Answering over Backend.AI REST and Functional API. Example 2


    In Part 3, we demonstrated how to locally create a Question-Answering system using open-source python library LLamaIndex which helped to convert our documents and Backend.AI code into vector form. The question answering can be done in interactive way in a Jupyter Notebook environment which Visual Studio Code supports with plugins. Furthermore, we decided to move those vector indexes to a Backend.AI Cloud environment where a Gorilla LLM API tuned model is server. Then an API Question-Answering web app was implemented to assist users over network.


    • LLama Index. https://docs.llamaindex.ai/en/stable/

    Demo video for Backend.AI API Helper and Gorilla LLM:

    30 January 2024

  • Backend.AI Meets Tool LLMs : Revolutionizing AI Interaction with Tools - Part 2

    By Sergey Leksikov

    Part 2. Backend.AI Gorilla LLM model serving

    Previously, we talked about the Tool LLM capabilities and usage. In this article, there will be a step-by-step demonstration of how to run the Gorilla LLM model on the Backend.AI Cloud while using Backend.AI Desktop app.

    Figure 1. A Backend.AI Desktop app installed on MacOs

    1. Press a start button to make a session creation menu appear.

    Figure 2. New session start interactive screen

    1. Select NGC-Pytorch 23.07 image

    2. Attach a vFolder which is a working directory containing the model files. For example: api_helper/ directory name.

    Figure 3. Attaching vFolder screen

    1. Select the resource amount 128 GB RAM and 5 fGPU

    Figure 4. Resource selection screen

    1. Select a Visual Studio Code Desktop environment

    Figure 5. IDE environment selection screen

    1. At /home/work/api_helper/ directory create a server.py file

    2. Create a requirements.txt file

    Figure 6. Content of requirements.txt file

    To install requirements run the command: pip install -r requirements.txt

    Figure 7. Executing install requirements command

    1. Create a server.py and define using transformers library the tokenizer and model loader.

    Figure 8. Code snippet of server.py

    1. Define server IP address and port number

    Figure 9. Definition of server IP address and port number

    1. To run the model type: python server.py

    Figure 10. Starting a server.py

    1. Accessing the created server

    VSCode automatically creates a port tunneling session from your device to a Backend.AI Cloud server. You may see the server status by accessing the localhost address and the request will be tunneled to a Backend.AI Cloud. In addition, you may define other custom endpoints according your needs.

    Figure 11. The server run log

    Figure 12. VSCode Port Forwarding configuration

    Figure 13. Accessing the root of a server

    Up to this point, we create a computation session on Backend.AI Cloud, attached an api_helper/ vFolder directory with requirements.txt file and server.py. Then we started our FastAPI server where the Gorilla LLM is gets downloaded from HuggingFace repository and loaded into computation session memory with inference/ api .endpoint

    1. API Inference testing To test the API inference of Gorilla LLM you may create a curl request from your local computer command line:
    curl -X POST -H "Content-Type: application/json" -d '{"text":"Object detection on a photo. <<<api_domain>>>:"}'

    Figure 14. An example of curl request

    Figure 15. The GPU workload on a server after receiving the request

    Figure 16. The server logs of receiving the request and printing the result

    1. Defining UI web app. You may use any web technology to make a UI app which can display the result in a better way. For example, you may use html and JavaScript files and place them in static directory under root of server.py Then define an endpoint for a web app.

    Figure 17. Example of adding an html web app to a FastAPI server

    1. Gorilla LLM Web App prototype - an API tuned Large Language Model for API question answering and code generation.

    Figure 18. Gorilla LLM web app prototype. Example 1

    Figure 19. Gorilla LLM web app prototype. Example 2


    Despite some difficulties of Gorilla LLM serving, LLM tuned on own API has a large potential and promises. Since, the model can provided the most recent results with more accurate parameters and function calls than commercial large models and be useful in tasks such as question answering over API, code autocomplete, API code executions.

    Limitations and difficulties:

    While trying to server the Gorilla LLM model there were following issues to consider:

    • Model may generate response in not expected format
    • Model may generate result different for same questions
    • Parsing and rendering LLM response
    • Eliminating the duplicate sentences and lines

    29 January 2024

  • Backend.AI Meets Tool LLMs : Revolutionizing AI Interaction with Tools - Part 1

    By Sergey Leksikov

    Part 1. Introduction to LLMs and Tool Interaction

    What if future AI technology capabilities were available now? Probably while you are on the way home from your workplace, you could ask an AI Assistant to turn on the air-conditioner in the home before your arrival. At same time you are planning the vacation and after having few options you ask an AI model to do hotel booking on your behalf. As the model books your trip, you receive a notification from a cloud provider about your deep learning model's training progress. You ask the AI Assistant to run another session with another set of parameters for the experiment while targeting specific values for performance accuracy. How be such a futuristic scenario realized in the present days?

    This kind of interaction of LLM with real world could be possible via Application Programmatic Interfaces (API). The specific Tool Large-Language Model (LLM) fine-tuned on APIs dataset can respond user’s query with specific API and that API can invoke a program or functions to make a real-world impact. Large Language Models (LLM) are rising in popularity due to their outstanding capabilities of generating text in context while also having reasoning capability for problem solving. Text model utilization ranges from text generating, editing they as well become useful as a copilot for a programmer. How else can LLMs extend their usage beyond their text-generating capabilities?

    With Tool LLM, we are stepping into an era where AI in addition to understanding our requests, the AI can act on those requests using a universe of online tools. Tool LLM are pushing the boundaries of what AI can do with tools via functional and REST APIs.

    GPT-4 is currently the state-of-the-art among LLMs, topping most AI benchmarks. Consider this scenario, a GPT-4 model is being asked to transcribe the audio file into text of another language. However, when prompted to use specific APIs, GPT-4 may hallucinate and suggest non-existent APIs or provide incorrect arguments. As consequence causing function execution failure and not achieving objectives of user specified task.

    Besides issues with hallucinations and inaccuracies, API documentation and versions are constantly changing. The retraining general purpose LLM is costly and not practical to keep the LLM models updated with constantly changing documentations. Tool LLMs provides a solution to the hallucination issues of general large models, enabling interaction with the physical world via programmatic interfaces. Tool LLM are much smaller, making it feasible to periodically be retrained with recent data. In addition, API documentation Retriever module can be added into model serving pipeline to help supplement the model with the most recent API documentation which is relevant to user’s input query.

    To overcome these challenges, researchers have recently proposed two notable open-source methods for enhancing LLMs tool use abilities such as Gorilla LLM and ToolLLaMA, each having its own advantages and specific use cases. Moreover, those models can be prepared for inference serving on Backend.AI Cloud.

    What is Tool LLM?

    Tool LLM is an LLM which was trained on a dataset with user query and API request with relevant context information such as API code usage and API description documentation. The response from such LLM can be executed as a code. The code execution implies that the LLM can interact with various online services and tools. Such as Cloud Computing Providers, Kubernetes machine learning and Deep Learning libraries and repositories such as HuggingFace, TorchHub, TensorFlowHub.

    The main advantage of such Tool LLM is ability to accurately generate an API response to user query which can be executed to obtain the results.

    Understanding the Types of API

    An Application Programming Interface (API) is a crucial element in modern computing, serving as a set of rules and protocols for how different software applications or hardware systems can communicate and interact.

    Functional APIs are designed to be invoked through function calls within a programming environment. For instance, machine learning and deep learning libraries like HuggingFace and TensorFlow offer various models that can be loaded into memory and utilized through Functional API calls. These APIs are integral in executing specific functions and operations within the software.

    This capability of LLM to generate a code related to an API extends their utility far beyond basic text generation and processing. Tool LLMs can seamlessly integrate with diverse online services and tools, ranging from cloud computing platforms to advanced machine learning libraries. Furthermore, their application is not limited to human queries; they can also be integrated into systems where they interact with other programs or AI agents. This versatility positions Tool LLMs as vital components in complex systems and infrastructures, enhancing their potential for real-world applications.

    In the following sections, we'll delve into how Tool LLM were trained and how they are operated. After that two specific research examples will be covered such as Gorilla LLM and ToolLLaMA.

    Tool LLM Training and Inference Workflow

    Tool LLM training involves several steps which includes setting api database, creating a training dataset, model training and inference.

    The API Database includes descriptions and relevant code samples. To generate a Self-Instruct training dataset there is a need to pre-process API database samples into {Input User Query-API Output} pairs. ChatGPT can help with automatically generating such dataset by covering various scenarios and query complexities which humans might ask. From specific cases to general and abstract cases. After Self-Instruct dataset is generated the model is trained to make accurate prediction in terms of API given user input query.

    For Tool LLM inference, it's crucial that the LLM not only responds with accurate argument parameters but also uses the latest API documentation. Thus, API Document Retriever is used which helps to keep the model with the most recent API changes.

    Figure 1. An overview workflow of Tool LLM training and inference over API instuction dataset

    Case Studies: Gorilla LLM and ToolLLaMA


    Gorilla, a fine-tuned LLaMA 7 billion-based model that outperforms GPT-4 in writing API calls. The notable aspects of Gorilla are:

    • It addresses the limitations of current LLMs in generating accurate input arguments for APIs and their tendency to hallucinate incorrect API usage.
    • Gorilla integrates with a document API retriever, allowing it to adapt to real-time changes in documentation, a significant advantage considering how frequently APIs get updated.
    • The authors have developed a dataset called APIBench to evaluate the model's abilities, which includes APIs from HuggingFace, TorchHub, and TensorHub totaling 1600+ APIs.
    • Gorilla seems to mitigate hallucination issues and improves the reliability of LLM outputs. Also, Gorilla got updated and extended to work with Cloud providers such as AWS, GCP and managing Kubernetes clusters.


    ToolLLaMA is a model which was fine-tuned on ToolBench an instruction-tuning dataset for tool based on RapidAPI repository. There are following keypoints of ToolLLaMA:

    • ToolBench covers an impressive range of over 16,000 real-world APIs, offering diverse instruction sets and solution paths.
    • The paper proposes a novel Depth-First Search-Based Decision Tree algorithm (DFSDT) to enhance the reasoning capabilities of LLMs such as multiple tool usage and multi-step reasoning.
    • Finetuned ToolLLAMA on ToolBench matches the performance of ChatGPT and demonstrates the generalization abilities in out-of distribution datasets like APIBench.

    Both papers are significant in pushing the boundaries of LLM’s capabilities in real-world tool use by navigating and utilizing a vast array of APIs. This advancement is crucial for practical applications. Below is a comparative summary table provided.

    Figure 2. A comparative table between two API tuned LLM

    Synergy between Backend.AI and ToolLLM

    The training or model serving of LLM requires a significant computer resource, especially since there is a huge demand for Graphic Processing Units (GPU) with high capacity for RAM and computational speed.

    Backend.AI offers a scalable foundation for building, training, and serving diverse models. Backend.AI includes scaling on demand feature for model inference with adding external node for serving and Load Balance to optimize the workload. Backend.AI has vLLM and TensorRT server which can be used for high performance inference of LLMs. In addition, there is a well-designed user-friendly interface and pipeline maker FastTrack tool to create computing environment sessions of various complexities.


    The futuristic scenario which can be realized at present day where various AI Assistants and Agents interact with various devices and services are possible through API and Tool LLM specifically fine-tuned on such interactions. Gorilla LLM and ToolLLaMA offer a good opportunity to incorporate them in complex tasks. The workflow of how they trained and served is easy to comprehend. Gorilla LLM could be recommended to use for Machine Learning and cloud administration tasks. While ToolLLaMA for more general API usage, multi-tool, and multi-step cases.

    There is also an advantage of training your own model on your own API documentation or code to have a LLM model which understands your code. Such LLM can be helpful at assisting or interacting with users who want to get the relevant information.

    The Backend.AI can effectively to be a backbone for model training and providing scalable model serving while offering a simplistic GUI. How to set up such models and step by step guide will be explained in other parts.

    Commonly asked questions:

    • Q: What is source of hallucinations and LLM limitations and how it solved in Tool LLM?
    • A: GPT-4, like other Large Language Models, faces limitations such as hallucinations and inaccuracies, which are primarily due to its training on extensive yet potentially outdated or inaccurate datasets from the internet. These 'hallucinations' refer to instances where the model confidently produces information that's either factually incorrect or not based in reality, a challenge stemming from the nature of its purely text-based training data and not directly from its size or lack of interaction with the physical world. To address these issues, Tool LLMs are being developed with a focus on specialization and frequent updates. They are fine-tuned on specific datasets, like API documentation, enabling direct interaction with real-world systems through programmatic interfaces for more accurate and current information. The retraining frequency of Tool LLMs varies, depending on the application and the pace of change in the relevant field, with updates potentially needed monthly, quarterly, or bi-annually to keep the model up-to-date with the latest trends and information.
    • Q: What are example pairs of user Query and API?
    • A: The example pairs are provided below.
    • User Query: "Summarize this article about space exploration."
    • API Output: HuggingFace.summarize(text="Article text here", model="facebook/bart-large-cnn")
    • User Query: "What is the sentiment of this customer review?"
    • API Output: HuggingFace.analyze_sentiment(text="Customer review text", model="distilbert-base-uncased-finetuned-sst-2-english")
    • User Query: "Identify the objects in this photo."
    • API Output: HuggingFace.image_recognition(image_file="path/to/photo.jpg", model="google/vit-base-patch16-224")
    • User Query: "Convert this speech recording to text."
    • API Output: HuggingFace.speech_to_text(audio_file="path/to/recording.wav", model="facebook/wav2vec2-base-960h")
    • Q: How do the GorillaLLM and ToolLLaMA papers differ in their approach to utilizing API documentation during the training and inference of their models?
    • A: GorillaLLM appends relevant API documentation during training and offers two inference modes, while ToolLLaMA employs Sentence-BERT for fine-tuning embeddings in the API domain. GorillaLLM uses BM25 and GPT-Retriever from LLamaIndex for documentation retrieval, whereas ToolLLaMA uses Sentence-BERT for a similar purpose.
    • Q: How frequently should small API models be retrained, and what role does the API Retriever play in handling changes in API documentation?
    • A: Training small API models annually is reasonable, but monthly retraining for API changes isn't practical. The API Retriever, using up-to-date documentation, can mitigate the need for frequent retraining. Evaluating and benchmarking fine-tuned API models and RAG methods is essential for effectiveness.
    • Q: What is the difference between ToolLLM and RAG systems, and how do they function in the context of LLMs?
    • A: ToolLLM is a model fine-tuned on API documentation, focusing on incorporating knowledge. RAG systems, on the other hand, are algorithms for data chunking, storage, search, re-ranking, and synthesis. They can work independently or in combination to enhance LLM efficiency, especially in handling context limits and knowledge updates.


    • Gorilla: Large Language Model Connected with Massive APIs. https://gorilla.cs.berkeley.edu/
    • ToolLLM: Facilitating Large Language Models To Master 16000+ Real-World APIs. https://github.com/OpenBMB/ToolBench

    28 January 2024

