Sr Technical Writer
In this tutorial, you will learn how to build a real-time AI chatbot with vision and voice capabilities using OpenAI, LiveKit and Deepgram deployed on DigitalOcean GPU Droplets. This chatbot will be able to engage in real-time conversations with users, analyze images captured from your camera, and provide accurate and timely responses.
In this tutorial, you will leverage three powerful technologies to build your real-time AI chatbot, each serving a specific purpose that enhances the chatbot’s capabilities, all while utilizing the robust infrastructure provided by DigitalOcean’s GPU Droplets:
OpenAI API: The OpenAI API will generate human-like text responses based on user input. By employing advanced models like GPT-4o, our chatbot will be able to understand context, engage in meaningful conversations, and provide accurate answers to user queries. This is crucial for creating an interactive experience where users feel understood and valued.
LiveKit: LiveKit will facilitate real-time audio and video communication between users and the chatbot. It allows us to create a seamless interaction experience, enabling users to speak to the chatbot and receive voice responses. This is essential for building a voice-enabled chatbot that can naturally engage users, making the interaction feel more personal and intuitive.
Deepgram: Deepgram will be employed for speech recognition, converting spoken language into text. This allows the chatbot to process user voice inputs effectively. By integrating Deepgram’s capabilities, you can ensure that the chatbot accurately understands user commands and queries, enhancing the overall interaction quality. This is particularly important in a real-time setting where quick and accurate responses are necessary for maintaining user engagement.
Why GPU Droplets?: Utilizing DigitalOcean’s GPU Droplets is particularly beneficial for this setup as they provide the necessary computational and GPU infrastructure to power and handle the intensive processing required by these AI models and real-time communication. The GPUs are optimized for running AI/ML workloads, significantly speeding up model inference and video processing tasks. This ensures the chatbot can deliver responses quickly and efficiently, even under heavy load, improving user experience and engagement.
Before you begin, ensure you have:
1.Create a New Project - You will need to create a new project from the cloud control panel and tie it to a GPU Droplet.
2.Create a GPU Droplet - Log into your DigitalOcean account, create a new GPU Droplet, and choose AI/ML Ready as the OS. This OS image installs all the necessary NVIDIA GPU Drivers. You can refer to our official documentation on how to create a GPU Droplet.
3.Add an SSH Key for authentication - An SSH key is required to authenticate with the GPU Droplet and by adding the SSH key, you can login to the GPU Droplet from your terminal.
4.Finalize and Create the GPU Droplet - Once all of the above steps are completed, finalize and create a new GPU Droplet.
Firstly, you will need to create an account or sign in to your LiveKit Cloud account and create a LiveKit Project. Please note down the LIVEKIT_URL
, LIVEKIT_API_KEY
and the LIVEKIT_API_SECRET
environment variables from the Project Settings page as you will need them later in the tutorial.
The below command will install the LiveKit CLI on your GPU Droplet.
For LiveKit Cloud users, you can authenticate the CLI with your Cloud project to create an API key and secret. This allows you to use the CLI without manually providing credentials each time.
Then, follow instructions and log in from a browser.
You will be asked to add the device and authorize access to your LiveKit Project you creted earlier in this step.
The template provides a working voice assistant to build on. The template includes:
Note: By default, the example agent uses Deepgram for STT and OpenAI for TTS and LLM. However, you aren’t required to use these providers.
Clone the starter template for a simple Python voice agent:
This will give you multiple existing LiveKit templates that you can use to deploy an app.
You will use the voice-pipeline-agent-python
template.
Now, enter your Application name, OpenAI API Key and Deepgram API Key when prompted. If you aren’t using Deepgram and OpenAI, you can checkout other supported plugins.
First, switch to your applications’s directory which was created in the last step.
You can list the files that were created from the template.
Here agent.py
is the main application file which contains the logic and source code for the AI chatbot.
Now, you will create and activate a python virtual environment using the below commands:
Add the following API keys in your environment:
You can find the LIVEKIT_URL
, LIVEKIT_API_KEY
and the LIVEKIT_API_SECRET
on the LiveKit Projects Settings page.
Activate the virtual environment:
Note: On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.
Now, let’s install the dependencies required for the app to work.
To add the vision capabilities to your agent you will need to modify the agent.py
file with the below imports and functions.
First, let’s start off by adding these imports alongside the existing ones. Open your agent.py
file using a text editor like vi
or nano
.
Copy the below imports alongside the existing ones:
These new imports include:
rtc
: Access to LiveKit’s video functionalityChatMessage
and ChatImage
: Classes you’ll use to send images to the LLMFind the ctx.connect()
line in the entrypoint
function. Change AutoSubscribe.AUDIO_ONLY
to AutoSubscribe.SUBSCRIBE_ALL
:
Note: If it is difficult for you to edit and modify agent.py
file using the vi
or nano
text editor on the GPU Droplet. You can copy the agent.py
file content to your local system and make the required edits in a Code editor like VSCode etc, and then copy-paste the updated code.
This will enable the assistant to receive video tracks as well as audio.
Add these two helper functions after your imports but before the prewarm
function:
This function searches through all participants to find an available video track. It’s used to locate the video feed to process.
Now, you will add the frame capture function
The purpose of this function is to capture a single frame from the video track and ensures proper cleanup of resources. Using aclose()
releases system resources like memory buffers and video decoder instances, which helps prevent memory leaks.
Now, inside the entrypoint
function, add the below callback function which will inject the latest video frame just before the LLM generates a response. Search for the entrypoint
function inside the agent.py
file:
This callback is the key to efficient context management — it will only add visual information when the assistant is about to respond. If visual information was added to every message, it would quickly fill up the LLMs context window which would be highly inefficient and costly.
Find the initial_ctx
creation inside the entrypoint
function and update it to include vision capabilities:
Find the VoicePipelineAgent
creation inside the entrypoint
function and add the callback:
The major update here is the before_llm_cb
parameter, which uses the callback created earlier to inject the latest video frame into the conversation context.
agent.py
file with voice & vision capabilitiesThis is how the agent.py
file would look after adding all the necessary functions and imports:
Start your assistant and test the below:
Test Voice Interaction: Speak into your microphone and see the chatbot respond.
Test Vision Capability: Ask the chatbot to identify objects through your video cam stream.
You would obeseve the following logs in your console:
Now, you will need to connect the app to the LiveKit room with a client that publishes both audio and video. The easiest way to do this is by using the hosted agent playground.
Since, this agent requires a frontend application to communicate with. You can use one of our example frontends in livekit-examples, create your own following one of the client quickstarts, or test instantly against one of the hosted Sandbox frontends.
In this example you will use an existing hosted agent playground. Simply open this https://agents-playground.livekit.io/
in your system’s browser and connect your LiveKit Project. It should auto-populate with your Project.
With these above changes to your agent, your assistant can now:
Connects to both audio and video streams.
Listens for user speech as before.
Just before generating each response:
4.Keep the context clean by only adding frames when needed.
Congratulations! You have successfully built a real-time AI chatbot with vision and voice capabilities using OpenAI, LiveKit, and Deepgram on DigitalOcean GPU Droplets. This powerful combination enables efficient, scalable and real-time interactions for your applications.
You can refer to LiveKit’s official documentation and it’s API reference for more details on building AI agents.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
This is a very helpful tutorial, thanks Anish!
I noticed that this guide seems to be based on an older version of LiveKit (v0.x). With the release of LiveKit v1.0, some of the functions and classes used here, like
ChatImage
are replaced withImageContent
, and many of other functions seems to be deprecated.It would be fantastic if you could publish an updated version of this tutorial for LiveKit v1.0. I’m particularly interested in learning how to integrate camera feeds with the Agents Playground. Additionally, it would be incredibly useful to see how the LLM could also view a screen share in a similar manner to the camera feed.
Thanks again for the great content!