Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

How To Make Karaoke Videos using Whisper and Spleeter AI Tools

Published on January 3, 2023

Machine Learning

PyTorch

By Alex Garnett

Senior DevOps Technical Writer

How To Make Karaoke Videos using Whisper and Spleeter AI Tools

Introduction

AI tools are useful for manipulating images, audio, or video to produce a novel result. Until recently, automatically editing images or audio was challenging to implement without using a significant amount of time and computing power, and even then it was often only possible to run turnkey filters to remove certain frequencies from sounds or change the color palette of images. Newer approaches, using AI models and enormous amounts of training data, are able to run much more sophisticated filtering and transformation techniques.

Spleeter and Whisper are open source AI tools that are designed for audio analysis and manipulation. Both were developed and released along with their own pre-trained language models, making it possible to run them directly on your own provided input, such as MP3 or AAC audio files, without any additional configuration. Spleeter is used to separate vocal tracks from instrumental tracks of music. Whisper is used to generate subtitles for spoken language. They both have many uses individually, and they have a particular use together: they can be used to generate karaoke tracks from regular audio files. In this tutorial, you’ll use Whisper and Spleeter together to make your own karaoke selections, or integrate into another application stack.

Prerequisites

These tools are available on most platforms. This tutorial will provide installation instructions for a Ubuntu 22.04 server, following our guide to Initial Server Setup with Ubuntu 22.04. You will need at least 3GB of memory to run Whisper and Spleeter, so if you are running on a resource-constrained server, you should consider enabling swap for this tutorial.

Both Spleeter and Whisper are Python libraries and require you to have installed Python and pip, the Python package manager. On Ubuntu, you can refer to the Step 1 of How To Install Python 3 and Set Up a Programming Environment on an Ubuntu 22.04 Server.

Additionally, both Spleeter and Whisper use machine learning libraries that can optionally run up to 10-20x more quickly on a GPU. If a GPU is not detected, they will automatically fall back to running on your CPU. Configuring GPU support is outside the scope of this tutorial, but should work after installing PyTorch in GPU-enabled environments.

Step 1 – Installing Spleeter, Whisper, and Other Tools

First, you’ll need to use pip, Python’s package manager, to install the tools you’ll be using for this project. In addition to spleeter, you should also install youtube-dl, a script that can be used to download YouTube videos locally, which you’ll use to retrieve a sample video. Install them with pip install:

sudo pip install spleeter youtube-dl

Rather than installing Whisper directly, you can install another library called yt-whisper directly from Github, also by using pip. yt-whisper includes Whisper itself as a dependency, so you’ll have access to the regular whisper command after installation, but this way you’ll also get the yt-whisper script, which makes downloading and subtitling videos from YouTube a one-step process. pip install can parse Github links to Python repositories by preceding them with git+:

sudo pip install git+https://github.com/m1guelpf/yt-whisper.git

Finally, you’ll want to make sure you have ffmpeg installed to do some additional audio and video manipulation. ffmpeg is a universal tool for manipulating, merging, and reencoding audio and video files. On Ubuntu, you can install it using the system package manager by running an apt update followed by apt install:

sudo apt update
sudo apt install ffmpeg

Now that you have the necessary tools installed, you’ll obtain sample audio and video in the next step.

Step 2 – Downloading and Separating Audio from a Video

youtube-dl, which you installed in Step 1, is a tool for downloading videos from YouTube to your local environment. Although you should take care when using potentially copyrighted material out of context, this can be useful in a number of contexts, especially when you need to run some additional processing on videos or use them for source material.

Using youtube-dl, download the video that you’ll be using for this tutorial. This sample link is to a public domain song called “Lie 2 You”, but you can use another:

youtube-dl https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt

youtube-dl will download the song along with some metadata and merge it into a single .webm video file. You can play this video in a local media player such as mpv, but that will depend on your environment.

Note: Because the use of youtube-dl is not explicitly supported by YouTube, downloads can occasionally be slow.

Next, you’ll separate the audio track from the video you just downloaded. This is a task where ffmpeg excels. You can use the following ffmpeg command to output the audio to a new file called audio.mp3:

ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -c:a libmp3lame -qscale:a 1 audio.mp3

This is an example of ffmpeg command syntax. In brief:

-i /path/to/input is the path to your input file, in this case the .webm video you just downloaded
-c:a libmp3lame specifies an audio codec to encode to. All audio and video needs to be encoded somehow, and libmp3lame is the most common mp3 encoder.
qscale:a 1 specifies the bitrate of your output mp3, in this case corresponding to a variable bit rate around 220kbps. You can review other options in the ffmpeg documentation.
audio.mp3 is the name of your output file, presented at the end of the command without any other flags.

After running this command, FFmpeg will create a new file called audio.mp3.

Note: You can learn more about ffmpeg options from ffmprovisr, a community-maintained catalog of ffmpeg command examples, or refer to the official documentation.

In the next step, you’ll use Spleeter to isolate the instrumental track from your new audio.mp3 file.

Step 3 – Separating Vocal Tracks Using Spleeter

Now that you have your standalone audio file, you’re ready to use spleeter to separate the vocal track. Spleeter contains several models for use with the spleeter separate command, allowing you to perform even more sophisticated separation of piano, guitar, drum, bass tracks and so on, but for now, you’ll use the default 2stems model. Run spleeter separate on your audio.mp3, also providing a path to an -o output directory:

spleeter separate -p spleeter:2stems -o output audio.mp3

If you are running Spleeter without a GPU, this command may take a few minutes to complete. This will produce a new directory called output, containing two files called vocals.wav and accompaniment.wav. These are your separated vocal and instrumental tracks. If you encounter any errors, or need to further customize your Spleeter output, refer to the documentation.

You can try listening to these files in MPV or another audio player. They will have a relatively larger file size for now because spleeter decodes them directly to raw WAV output, but in the next steps, you’ll encode them back into a single video.

Step 4 – Generating Captions Using Whisper

Now that you have your instrumental audio track, you just need to generate captions from the original video. You could run whisper directly on the .webm video you downloaded, but it will be even quicker to run the yt_whisper command on the original YouTube video link:

yt_whisper https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt

If you review the yt_whisper source code, you can understand the presets that yt_whisper is passing to whisper to generate captions from a YouTube video. For example, it defaults to the --model small parameter. The Whisper documentation suggests that this model provides a good tradeoff between memory requirements, performance, and accuracy. If you ever need to run whisper by itself on another input source or with different parameters, you can use these presets as a frame of reference.

If you are running Whisper without a GPU, this command may take a few minutes to complete. This will generate a caption file for the video in the .vtt format. You can inspect the captions using head or a text editor to verify that they match the song lyrics:

head -20 Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt

OutputWEBVTT

00:00.000 --> 00:07.000
I need feeling you on me And I guess in a way you do

00:07.000 --> 00:19.000
All my breath on revelin' emotions I need some space to think this through

00:19.000 --> 00:29.000
Call me all night long Try to give you hints in a hard to see

00:29.000 --> 00:39.000
Right on the line, no Losing it on you is the last thing I need

00:39.000 --> 00:49.000
If I'm honest, I'll just make you cry And I don't wanna fight with you

00:49.000 --> 00:57.000
I would rather lie to you But if I'm honest, now's not the right time

You now have your separate audio tracks and your caption file. In the final step, you’ll assemble them all back together using ffmpeg.

Step 5 – Merging Audio and Video Tracks with Captions

Finally, it’s time to combine your outputs into a finalized video containing 1) the original background video, 2) the isolated instrumental track you generated using Spleeter, and 3) the captions you generated using Whisper. This can be done with a single, slightly more complicated, ffmpeg command:

ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -i output/audio/accompaniment.wav -i "Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt" -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng -c:v copy -c:a aac -c:s mov_text final.mp4

Unlike the earlier ffmpeg command, this command is using three different inputs: the .webm video, the .wav audio, and the .vtt captions. It uses several map arguments to map the first (or 0th, counting from 0) input to the video track, then to the audio, and the last to subtitle metadata, like so: -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng. Next, it specifies the codecs being used for each track:

c:v copy means that you are preserving the original video source and not reencoding it. This usually saves time and preserves video quality (video encoding is usually the most CPU-intensive use of ffmpeg by far) as long as the original source is in a compatible format. youtube-dl will almost always default to using the common H264 format, which can be used for streaming video, standalone .mp4 files, Blu Ray discs, and so on, so you should not need to change this.
c:a aac means that you are reencoding the audio to the AAC format. AAC is the default for most .mp4 video, is supported in virtually all environments, and provides a good balance between file size and audio quality.
c:s mov_text specifies the subtitle format you are encoding. Even though your subtitles were in vtt format, mov_text is a typical subtitle format to embed within a video itself.

Note: You may also want to offset your subtitles up by a couple seconds to help viewers anticipate which lines are coming next. You can do this by adding -itsoffset -2 to the ffmpeg command.

Finally, you provide an output format, final.mp4. Notice that you did not actually specify .mp4 output other than in this filename — ffmpeg will automatically infer an output format based on the output path you provide. When working with audio and video files, the codecs you use are generally more important than the file types themselves, which act as containers for the content. The important differences are in which video players expect to be able to read which kinds of files. An .mp4 file containing H264 video and AAC audio is, as of this writing, the most common media file used anywhere, and will play in almost any environment, including directly in a browser without needing to download the file or configure a streaming server, and it can contain subtitles, so it is a very safe target. .mkv is another popular container format that supports more features, but it is not as widely deployed.

Your final.mp4 video can now be downloaded, shared, or projected up on the wall for karaoke night. Good luck with your performance!

You now have an end-to-end karaoke video solution using four tools. These can be combined into a standalone script, integrated into another application, or run interactively as needed.

Conclusion

In this tutorial, you used two machine learning tools to create a separated vocal track and a set of captions from a source video, then joined them back together. This is uniquely useful for making karaoke videos from existing audio sources, but can also be applied to many other tasks.

Next, you may want to configure a video streaming server, or experiment with some other AI or Machine Learning libraries.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Alex Garnett

Author

Senior DevOps Technical Writer

See author profile

Former Senior DevOps Technical Writer at DigitalOcean. Expertise in topics including Ubuntu 22.04, Linux, Rocky Linux, Debian 11, and more.

Category:

Tags:

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Gary Fonseca

February 14, 2023

I was looking for how to install whisper, I managed to complete all the steps and everything looked fine, but the captions never end in the final file, is just the Video without the Vocals, but no captions… I guess for my karaoke singers will have to know the lyrics xD

Peter Cox

February 16, 2024

Great tutorial! i am totally impressed from step-by-step instructions on using Whisper and Spleeter for creating karaoke videos. and you clearly explained it. Good Job

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Report this

How To Make Karaoke Videos using Whisper and Spleeter AI Tools

Introduction

Prerequisites

Step 1 – Installing Spleeter, Whisper, and Other Tools

Step 2 – Downloading and Separating Audio from a Video

Step 3 – Separating Vocal Tracks Using Spleeter

Step 4 – Generating Captions Using Whisper

Step 5 – Merging Audio and Video Tracks with Captions

Conclusion

About the author

Still looking for an answer?

Join the Tech Talk

Deploy on DigitalOcean

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

The developer cloud

Start building today