Modern web applications can be taken to the next level by integrating Artificial Intelligence. This tutorial focuses on the cutting-edge development of multi-modal bots, which leverage natural language processing, image generation, and speech recognition. These bots offer a unique user experience, engaging users through various modes of interaction.
This tutorial delves into developing a multi-modal bot using Django and OpenAI’s GPT-4 Large Language Model (LLM) for conversational AI, Whisper for accurate speech transcription, and DALL-E for image generation. It describes building a web application that generates stories with images accompanying them. Users can specify the theme of the story by voice or text, and the application would respond with a generated story embellished with visual imagery.
By the end of this tutorial, you will have developed a practical application that can understand and respond to user inputs in various forms, including text, voice, and images. This will significantly enhance the user’s interaction with the application, making it more intuitive and accessible.
To complete this tutorial, you will need:
A basic understanding of Python and Django. If you’re new to Django, following the How To Install Django and Set Up a Development Environment tutorial is recommended.
An OpenAI API key: This tutorial requires you to interact with OpenAI’s GPT-4 and DALL-E models, which require an API key from OpenAI. You can obtain an API key by creating an OpenAI account and then creating a secret key.
Whisper: Visit the OpenAI Whisper GitHub page for detailed installation guides and verify that your development setup is properly configured for Whisper.
The OpenAI Python package: If you followed the tutorial in the first prerequisite, you should already have a virtual environment named env
active within a directory named django-apps
.
Note: Ensure your virtual environment is active by confirming that its name appears in parentheses at the start of your terminal prompt. If it’s not active, you can manually activate it by running the following command in your terminal from the directory containing your Django app.
sammy@ubuntu:$ .env/bin/activate
Once your environment is active, run the following to install the OpenAI Python package:
(env)sammy@ubuntu:$ pip install openai
If this is your first time using the OpenAI library, you should review the How to Integrate OpenAI GPT Models in Your Django Project tutorial.
In this step, you’ll set up OpenAI Whisper in your Django application to allow it to transcribe speech to text. Whisper is a robust speech recognition model that can provide accurate transcriptions, a crucial feature for our multi-modal bot. By integrating Whisper, our application will be able to understand user inputs provided through voice.
First, ensure that you are working in your Django project directory. Following the prerequisite tutorials, you should have a Django project ready for this integration. Open your terminal, navigate to your Django project directory, and ensure your virtual environment is activated:
sammy@ubuntu:$ cd path_to_your_django_project
sammy@ubuntu:$ source env/bin/activate
What needs to be done now is to create a function that utilizes Whisper to transcribe audio files to text. Create a new Python file named whisper_transcribe.py
.
(env)sammy@ubuntu:$ touch whisper_transcribe.py
Open whisper_transcribe.py
in your text editor and import Whisper. Next, let’s define a function that takes the path of an audio file as input, uses Whisper to process the file, and then returns the transcription:
import whisper
model = whisper.load_model("base")
def transcribe_audio(audio_path):
result = model.transcribe(audio_path)
return result["text"]
In this code snippet, you’re using the “base” model for transcription. Whisper offers different models tailored to various accuracy and performance needs. Feel free to experiment with other models based on your requirements.
To test the transcription, save an audio file within your Django project directory. Ensure the file is in a format Whisper supports (e.g., MP3, WAV). Now, modify whisper_transcribe.py
by adding the following lines at the bottom:
# For testing purposes
if __name__ == "__main__":
print(transcribe_audio("path_to_your_audio_file"))
Run whisper_transcribe.py
with Python to see the transcription of your audio file in your terminal:
(env)sammy@ubuntu:$ python whisper_transcribe.py
You should see the transcribed text output in the terminal if everything is set up correctly. This functionality will serve as the foundation for voice-based interactions within our bot.
In this step, you’ll utilize the GPT-4 LLM to generate text responses based on the user input or the speech transcription obtained in the previous step. GPT-4, with its large language model, can generate coherent, contextually relevant responses, making it an ideal choice for our multi-modal bot application.
Before proceeding, ensure the OpenAI Python package is installed in your virtual environment as described in the prerequisites. The GPT-4 model requires an API key to access, so ensure you have it ready. You can add the OpenAI API key to your environmental variables so that you don’t add it directly to the Python file:
(env)sammy@ubuntu:$ export OPENAI_KEY="your-api-key"
Navigate to your Django app’s directory and create a new Python file named chat_completion.py
. This script will handle communication with the GPT-4 model to generate responses based on the input text.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_KEY"])
def generate_story(input_text):
# Call the OpenAI API to generate the story
response = get_story(input_text)
# Format and return the response
return format_response(response)
This code snippet first sets the API key necessary to authenticate with OpenAI’s services. It then calls a separate function, get_story
to make the API call to OpenAI for the story and then another function, format_response
, to format the response from the API.
Now, let’s focus on the get_story
function. Add the following to the bottom of your chat_completion.py
file:
def get_story(input_text):
# Construct the system prompt. Feel free to experiment with different prompts.
system_prompt = f"""You are a story generator.
You will be provided with a description of the story the user wants.
Write a story using the description provided."""
# Make the API call
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": input_text},
],
temperature=0.8
)
# Return the API response
return response
In this function, you first set up the system prompt, which informs the model about the task it needs to perform, and then request the ChatCompletion
API to generate a story using the user’s input text.
Finally, you can implement the format_response
function. Add the following to the bottom of your chat_completion.py
file:
def format_response(response):
# Extract the generated story from the response
story = response.choices[0].message.content
# Remove any unwanted text or formatting
story = story.strip()
# Return the formatted story
return story
To test the text generation, modify chat_completion.py
by adding a few lines at the bottom:
# For testing purposes
if __name__ == "__main__":
user_input = "Tell me a story about a dragon"
print(generate_story(user_input))
Run chat_completion.py
with Python to see the generated response in your terminal:
(env)sammy@ubuntu:$ python chat_completion.py
Based on the prompt, you should observe a creatively generated response from GPT-4. Experiment with different inputs to see various responses.
In the next step, you will add images to the generated stories.
DALL-E is designed to create detailed images from textual prompts, enabling your multi-modal bot to enhance stories with visual creativity.
Create a new Python file named image_generation.py
in your Django app. This script will use the DALL-E model for image generation:
(env)sammy@ubuntu:$ touch image_generation.py
Let’s create a function within image_generation.py
that sends a prompt to DALL-E and retrieves the generated image:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_KEY"])
def generate_image(text_prompt):
response = client.images.generate(
model="dall-e-3",
prompt=text_prompt,
size="1024x1024",
quality="standard",
n=1,
)
image_url = response.data[0].url
return image_url
This function sends a request to the DALL-E model specifying the text prompt, the number of images to generate (n=1
), and the size of the images. It then extracts and returns the URL of the generated image.
To illustrate the use of this function within your Django project, you can add the following example at the bottom of your image_generation.py
file:
# For testing purposes
if __name__ == "__main__":
prompt = "Generate an image of a pet and a child playing in a yard."
print(generate_image(prompt))
Run image_generation.py
with Python to generate an image based on the given prompt:
(env)sammy@ubuntu:$ python image_generation.py
If the script runs successfully, you will see the URL of the generated image in the terminal. You can then view the image by navigating to this URL in your web browser.
In the next step, you will bring speech recognition together with text and image generation for a unified user experience.
In this step, you will integrate the functionalities developed in the previous steps to provide a seamless user experience.
Your web application will be capable of processing text and voice input from users, generating stories, and complementing them with related images.
First, ensure that your Django project is organized and that you have whisper_transcribe.py,
chat_completion.py,
and image_generation.py
in the Django app directory. You will now create a view that combines these components.
Open your views.py
file and import the necessary modules and functions. Then create a new view called get_story_from_description
:
import uuid
from django.core.files.storage import FileSystemStorage
from django.shortcuts import render
from .whisper_transcribe import transcribe_audio
from .chat_completion import generate_story
from .image_generation import generate_image
# other views
def get_story_from_description(request):
context = {}
user_input = ""
if request.method == "GET":
return render(request, "story_template.html")
else:
if "text_input" in request.POST:
user_input += request.POST.get("text_input") + "\n"
if "voice_input" in request.FILES:
audio_file = request.FILES["voice_input"]
file_name = str(uuid.uuid4()) + (audio_file.name or "")
FileSystemStorage(location="/tmp").save(file_name, audio_file)
user_input += transcribe_audio(f"/tmp/{file_name}")
generated_story = generate_story(user_input)
image_prompt = (
f"Generate an image that visually illustrates the essence of the following story: {generated_story}"
)
image_url = generate_image(image_prompt)
context = {
"user_input": user_input,
"generated_story": generated_story.replace("\n", "<br/>"),
"image_url": image_url,
}
return render(request, "story_template.html", context)
This view retrieves the text and/or voice input from the user. If there is an audio file, it saves it with a unique name (using the uuid
library) and uses the transcribe_audio
function to convert speech to text. It then uses the generate_story
function to generate a text response and the generate_image
function to generate a related image. These outputs are passed to the context dictionary, then rendered with story_template.html
.
Next, create a file called story_template.html
and add the following:
<div style="padding:3em; font-size:14pt;">
<form method="post" enctype="multipart/form-data">
{% csrf_token %}
<textarea name="text_input" placeholder=" Describe the story you would like" style="width:30em;"></textarea>
<br/><br/>
<input type="file" name="voice_input" accept="audio/*" style="width:30em;">
<br/><br/>
<input type="submit" value="Submit" style="width:8em; height:3em;">
</form>
<p>
<strong>{{ user_input }}</strong>
</p>
{% if image_url %}
<p>
<img src="{{ image_url }}" alt="Generated Image" style="max-width:80vw; width:30em; height:30em;">
</p>
{% endif %}
{% if generated_story %}
<p>{{ generated_story | safe }}</p>
{% endif %}
</div>
This simple form allows users to submit their prompts through text or by uploading an audio file. It then displays the text and image generated by the application.
Now that you have the get_story_from_description
view ready, you must make it accessible by creating a URL configuration.
Open your urls.py
file within your Django app and add a pattern for the get_story_from_description
view:
from django.urls import path
from . import views
urlpatterns = [
# other patterns
path('generate-story/', views.get_story_from_description, name='get_story_from_description'),
]
You can now visit http://your_domain/generate-story/
in your web browser. You should see the form defined in story_template.html
. Try submitting a text prompt through the text input field, or uploading an audio file using the file input. Upon submission, your application will process the input(s), generate a story and an accompanying image, and display them on the page.
For example, here is a sample story for the prompt: “Tell me a story about a pet and a child playing in a yard.”
By completing this step, you have created an application that seamlessly processes and responds to user inputs in various forms—text, voice, and images.
In this tutorial, you have successfully developed a multi-modal bot utilizing Django, with integration capabilities for Whisper for speech recognition, GPT-4 for text generation, and DALL-E for image generation. Your application can now comprehend and react to user inputs in various formats.
For further development, it is recommended to explore alternative versions of the Whisper, GPT, and DALL-E models, improve the UI/UX design of your application, or extend the bot’s functionality to include additional interactive features.
The author selected Direct Relief Program to receive a donation as part of the Write for DOnations program.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
So, what size / characteristics of Droplet should one have to run all that is stated in this tutorial?