Introducing Dia, a TTS model from Nari Labs

Published on May 7, 2025

Introducing Dia, a TTS model from Nari Labs

Introduction

One of the most exciting areas in AI right now is the advancement of voice models. In our ongoing exploration of cutting-edge voice models, we had previously highlighted the Conversational Speech Model from Sesame.

In this article, we will discuss Dia, a 1.6 billion parameter open-source TTS model from Nari Labs. Currently, there is not much information available on its architecture other than it is heavily inspired by SoundStorm, Parakeet, and Descript Audio Codec. We’ll leave it up to you to speculate how this model was trained and will perhaps cover it in a follow-up article once more information is available, but for now, we’ll focus on its implementation.

We’re very impressed with the model’s performance. Test it out yourself in this HuggingFace space or follow the implementation instructions below.

Implementation

We’ll cover two different ways of testing out this model. The first is in the Web Console. This is great for one-off testing scenarios for you to do a quick check of the model’s capabilities in a Gradio interface. The second is using the Python library, which is great for developing more intricate applications.

Option 1: Web Console CLI

Step 1 : Set up a GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDIA H100 option. ai-ml

Step 2: Web Console

Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.

In the web console, copy and paste the following code snippet:

git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install -e .
python app.py

The output will be a Gradio link that you can access within VS Code.

Step 3: Open VS Code

In VS Code, click on “Connect to…” in the Start menu.

Choose “Connect to Host…”.

Step 4: Connect to your GPU Droplet

Click “Add New SSH Host…” and enter the SSH command to connect to your droplet. This command is usually in the format ssh root@[your_droplet_ip_address]. Press Enter to confirm, and a new VSCode window will open, connected to your droplet.

You can find your droplet’s IP address on the GPU Droplet page. connection details

Step 5: Access the Gradio

In the new VS Code window connected to your droplet, type >sim and select “Simple Browser: Show”.

Paste the Gradio url from the Web Console, hit enter, and click the arrow in the top right. url

gradio This is the Gradio interface. Feel free to modify the input text to your liking.

Using Dia Effectively

To use Dia effectively, it’s essential to consider the length of your input text. Nari Labs recommends aiming for text that corresponds to 5-20 seconds of audio for the most natural-sounding results. If your input text is too short, equivalent to under 5 seconds of audio, the output may sound unnatural. On the other hand, inputs that would take over 20 seconds to speak will be compressed, resulting in unnaturally fast speech. By keeping your text within the moderate range, you can achieve more realistic and engaging audio outputs.

When creating dialogue with Dia, using speaker tags correctly is crucial. Always begin your input text with the [S1] tag to indicate the first speaker. When switching between speakers, alternate between [S1] and [S2] tags, making sure to never use [S1] twice in sequence.

In addition to speaker tags, non-verbal elements can also enhance your audio outputs. However, it’s recommended to use non-verbal tags sparingly for the most natural results. Stick to the officially supported non-verbal sounds listed in the documentation, as overusing these tags or attempting to use unlisted non-verbals may introduce unwanted artifacts.

Option 2: Python Library

To work with Dia in a more programmatic way, we can implement its Python library in VS Code. The code snippet below from voice_clone.py can be modified to your liking.

from dia.model import Dia


model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

# You should put the transcript of the voice you want to clone
# We will use the audio created by running simple.py as an example.
# Note that you will be REQUIRED TO RUN simple.py for the script to work as-is.
clone_from_text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
clone_from_audio = "simple.mp3"
# For your custom needs, replace above with below and add your audio file to this directory:
# clone_from_text = "[S1] ... [S2] ... [S1] ... corresponding to your_audio_name.mp3"
# clone_from_audio = "your_audio_name.mp3"

# Text to generate
text_to_generate = "[S1] Hello, how are you? [S2] I'm good, thank you. [S1] What's your name? [S2] My name is Dia. [S1] Nice to meet you. [S2] Nice to meet you too."

# It will only return the audio from the text_to_generate
output = model.generate(
    clone_from_text + text_to_generate, audio_prompt=clone_from_audio, use_torch_compile=True, verbose=True
)

model.save_audio("voice_clone.mp3", output)

Conclusion

Kudos to Nari Labs for pushing the frontier of text-to-speech models - and what’s even more remarkable is that it’s driven by just two passionate undergraduate students. You can really just do things.

We’re excited to hear about how you’re leveraging TTS models. Share your experiences with DigitalOcean GPU Droplets in the comments below: how are you harnessing their power for your TTS applications?

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran

Author

See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

See author profile

Category:

Tutorial

Tags:

AI/ML