Developing Multi-Modal Bots with Django, GPT-4, Whisper, and DALL-E

Developing Multi-Modal Bots with Django, GPT-4, Whisper, and DALL-E


Modern web applications can be taken to the next level by integrating Artificial Intelligence. This tutorial focuses on the cutting-edge development of multi-modal bots, which leverage natural language processing, image generation, and speech recognition. These bots offer a unique user experience, engaging users through various modes of interaction.

This tutorial delves into developing a multi-modal bot using Django and OpenAI’s GPT-4 Large Language Model (LLM) for conversational AI, Whisper for accurate speech transcription, and DALL-E for image generation. It describes building a web application that generates stories with images accompanying them. Users can specify the theme of the story by voice or text, and the application would respond with a generated story embellished with visual imagery.

By the end of this tutorial, you will have developed a practical application that can understand and respond to user inputs in various forms, including text, voice, and images. This will significantly enhance the user’s interaction with the application, making it more intuitive and accessible.

Developing Multi-Modal Bots with Django, GPT-4, Whisper, and DALL-E

  1. Integrating OpenAI Whisper for Speech Recognition
  2. Generating Text Responses with GPT-4
  3. Generating Images with DALL-E
  4. Combining Modalities for a Unified Experience


To complete this tutorial, you will need:

  1. A basic understanding of Python and Django. If you’re new to Django, following the How To Install Django and Set Up a Development Environment tutorial is recommended.

  2. An OpenAI API key: This tutorial requires you to interact with OpenAI’s GPT-4 and DALL-E models, which require an API key from OpenAI. You can obtain an API key by creating an OpenAI account and then creating a secret key.

  3. Whisper: Visit the OpenAI Whisper GitHub page for detailed installation guides and verify that your development setup is properly configured for Whisper.

  4. The OpenAI Python package: If you followed the tutorial in the first prerequisite, you should already have a virtual environment named env active within a directory named django-apps.

Note: Ensure your virtual environment is active by confirming that its name appears in parentheses at the start of your terminal prompt. If it’s not active, you can manually activate it by running the following command in your terminal from the directory containing your Django app.

sammy@ubuntu:$ .env/bin/activate

Once your environment is active, run the following to install the OpenAI Python package:

(env)sammy@ubuntu:$ pip install openai

If this is your first time using the OpenAI library, you should review the How to Integrate OpenAI GPT Models in Your Django Project tutorial.

Step 1 — Integrating OpenAI Whisper for Speech Recognition

In this step, you’ll set up OpenAI Whisper in your Django application to allow it to transcribe speech to text. Whisper is a robust speech recognition model that can provide accurate transcriptions, a crucial feature for our multi-modal bot. By integrating Whisper, our application will be able to understand user inputs provided through voice.

First, ensure that you are working in your Django project directory. Following the prerequisite tutorials, you should have a Django project ready for this integration. Open your terminal, navigate to your Django project directory, and ensure your virtual environment is activated:

sammy@ubuntu:$ cd path_to_your_django_project
sammy@ubuntu:$ source env/bin/activate

Setting Up Whisper in Your Django Application

What needs to be done now is to create a function that utilizes Whisper to transcribe audio files to text. Create a new Python file named whisper_transcribe.py.

(env)sammy@ubuntu:$ touch whisper_transcribe.py

Open whisper_transcribe.py in your text editor and import Whisper. Next, let’s define a function that takes the path of an audio file as input, uses Whisper to process the file, and then returns the transcription:

import whisper
model = whisper.load_model("base")

def transcribe_audio(audio_path):
    result = model.transcribe(audio_path)
    return result["text"]

In this code snippet, you’re using the “base” model for transcription. Whisper offers different models tailored to various accuracy and performance needs. Feel free to experiment with other models based on your requirements.

Testing the Transcription

To test the transcription, save an audio file within your Django project directory. Ensure the file is in a format Whisper supports (e.g., MP3, WAV). Now, modify whisper_transcribe.py by adding the following lines at the bottom:

# For testing purposes
if __name__ == "__main__":

Run whisper_transcribe.py with Python to see the transcription of your audio file in your terminal:

(env)sammy@ubuntu:$ python whisper_transcribe.py

You should see the transcribed text output in the terminal if everything is set up correctly. This functionality will serve as the foundation for voice-based interactions within our bot.

Step 2 — Generating Text Responses with GPT-4

In this step, you’ll utilize the GPT-4 LLM to generate text responses based on the user input or the speech transcription obtained in the previous step. GPT-4, with its large language model, can generate coherent, contextually relevant responses, making it an ideal choice for our multi-modal bot application.

Before proceeding, ensure the OpenAI Python package is installed in your virtual environment as described in the prerequisites. The GPT-4 model requires an API key to access, so ensure you have it ready. You can add the OpenAI API key to your environmental variables so that you don’t add it directly to the Python file:

(env)sammy@ubuntu:$ export OPENAI_KEY="your-api-key"

Setting Up Chat Completion

Navigate to your Django app’s directory and create a new Python file named chat_completion.py. This script will handle communication with the GPT-4 model to generate responses based on the input text.

import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_KEY"])

def generate_story(input_text):
    # Call the OpenAI API to generate the story
    response = get_story(input_text)
    # Format and return the response
    return format_response(response)

This code snippet first sets the API key necessary to authenticate with OpenAI’s services. It then calls a separate function, get_story to make the API call to OpenAI for the story and then another function, format_response, to format the response from the API.

Now, let’s focus on the get_story function. Add the following to the bottom of your chat_completion.py file:

def get_story(input_text):
    # Construct the system prompt. Feel free to experiment with different prompts.
    system_prompt = f"""You are a story generator.
    You will be provided with a description of the story the user wants.
    Write a story using the description provided."""
    # Make the API call
    response = client.chat.completions.create(
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": input_text},

    # Return the API response
    return response

In this function, you first set up the system prompt, which informs the model about the task it needs to perform, and then request the ChatCompletion API to generate a story using the user’s input text.

Finally, you can implement the format_response function. Add the following to the bottom of your chat_completion.py file:

def format_response(response):
    # Extract the generated story from the response
    story = response.choices[0].message.content
    # Remove any unwanted text or formatting
    story = story.strip()
    # Return the formatted story
    return story

Testing Generated Responses

To test the text generation, modify chat_completion.py by adding a few lines at the bottom:

# For testing purposes
if __name__ == "__main__":
    user_input = "Tell me a story about a dragon"

Run chat_completion.py with Python to see the generated response in your terminal:

(env)sammy@ubuntu:$ python chat_completion.py

Based on the prompt, you should observe a creatively generated response from GPT-4. Experiment with different inputs to see various responses.

In the next step, you will add images to the generated stories.

Step 3 — Generating Images with DALL-E

DALL-E is designed to create detailed images from textual prompts, enabling your multi-modal bot to enhance stories with visual creativity.

Create a new Python file named image_generation.py in your Django app. This script will use the DALL-E model for image generation:

(env)sammy@ubuntu:$ touch image_generation.py

Let’s create a function within image_generation.py that sends a prompt to DALL-E and retrieves the generated image:

import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_KEY"])

def generate_image(text_prompt):
    response = client.images.generate(
    image_url = response.data[0].url
    return image_url

This function sends a request to the DALL-E model specifying the text prompt, the number of images to generate (n=1), and the size of the images. It then extracts and returns the URL of the generated image.

Testing The Script

To illustrate the use of this function within your Django project, you can add the following example at the bottom of your image_generation.py file:

# For testing purposes
if __name__ == "__main__":
    prompt = "Generate an image of a pet and a child playing in a yard."

Run image_generation.py with Python to generate an image based on the given prompt:

(env)sammy@ubuntu:$ python image_generation.py

If the script runs successfully, you will see the URL of the generated image in the terminal. You can then view the image by navigating to this URL in your web browser.

In the next step, you will bring speech recognition together with text and image generation for a unified user experience.

Step 4 — Combining Modalities for a Unified Experience

In this step, you will integrate the functionalities developed in the previous steps to provide a seamless user experience.

Your web application will be capable of processing text and voice input from users, generating stories, and complementing them with related images.

Creating the Unified View

First, ensure that your Django project is organized and that you have whisper_transcribe.py, chat_completion.py, and image_generation.py in the Django app directory. You will now create a view that combines these components.

Open your views.py file and import the necessary modules and functions. Then create a new view called get_story_from_description:

import uuid
from django.core.files.storage import FileSystemStorage
from django.shortcuts import render
from .whisper_transcribe import transcribe_audio
from .chat_completion import generate_story
from .image_generation import generate_image

# other views

def get_story_from_description(request):
    context = {}
    user_input = ""
    if request.method == "GET":
        return render(request, "story_template.html")
        if "text_input" in request.POST:
            user_input += request.POST.get("text_input") + "\n"
        if "voice_input" in request.FILES:
            audio_file = request.FILES["voice_input"]
            file_name = str(uuid.uuid4()) + (audio_file.name or "")
            FileSystemStorage(location="/tmp").save(file_name, audio_file)
            user_input += transcribe_audio(f"/tmp/{file_name}")

        generated_story = generate_story(user_input)
        image_prompt = (
            f"Generate an image that visually illustrates the essence of the following story: {generated_story}"
        image_url = generate_image(image_prompt)

        context = {
            "user_input": user_input,
            "generated_story": generated_story.replace("\n", "<br/>"),
            "image_url": image_url,

        return render(request, "story_template.html", context)

This view retrieves the text and/or voice input from the user. If there is an audio file, it saves it with a unique name (using the uuid library) and uses the transcribe_audio function to convert speech to text. It then uses the generate_story function to generate a text response and the generate_image function to generate a related image. These outputs are passed to the context dictionary, then rendered with story_template.html.

Creating the Template

Next, create a file called story_template.html and add the following:

<div style="padding:3em; font-size:14pt;">
    <form method="post" enctype="multipart/form-data">
        {% csrf_token %}
        <textarea name="text_input" placeholder=" Describe the story you would like" style="width:30em;"></textarea>
        <input type="file" name="voice_input" accept="audio/*" style="width:30em;">
        <input type="submit" value="Submit" style="width:8em; height:3em;">

        <strong>{{ user_input }}</strong>
    {% if image_url %}
            <img src="{{ image_url }}" alt="Generated Image" style="max-width:80vw; width:30em; height:30em;">
    {% endif %}
    {% if generated_story %}
        <p>{{ generated_story | safe }}</p>
    {% endif %}

This simple form allows users to submit their prompts through text or by uploading an audio file. It then displays the text and image generated by the application.

Creating a URL for the View

Now that you have the get_story_from_description view ready, you must make it accessible by creating a URL configuration.

Open your urls.py file within your Django app and add a pattern for the get_story_from_description view:

from django.urls import path
from . import views

urlpatterns = [
    # other patterns
    path('generate-story/', views.get_story_from_description, name='get_story_from_description'),

Testing the Unified Experience

You can now visit http://your_domain/generate-story/ in your web browser. You should see the form defined in story_template.html. Try submitting a text prompt through the text input field, or uploading an audio file using the file input. Upon submission, your application will process the input(s), generate a story and an accompanying image, and display them on the page.

For example, here is a sample story for the prompt: “Tell me a story about a pet and a child playing in a yard.”

Screenshot of a generated story

By completing this step, you have created an application that seamlessly processes and responds to user inputs in various forms—text, voice, and images.


In this tutorial, you have successfully developed a multi-modal bot utilizing Django, with integration capabilities for Whisper for speech recognition, GPT-4 for text generation, and DALL-E for image generation. Your application can now comprehend and react to user inputs in various formats.

For further development, it is recommended to explore alternative versions of the Whisper, GPT, and DALL-E models, improve the UI/UX design of your application, or extend the bot’s functionality to include additional interactive features.

The author selected Direct Relief Program to receive a donation as part of the Write for DOnations program.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us

About the authors
Default avatar

Technical Writer

Default avatar

Sr Technical Writer

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
Leave a comment

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Featured on Community

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more
DigitalOcean Cloud Control Panel