// Tutorial //

How To Make Karaoke Videos using Whisper and Spleeter AI Tools

Published on January 3, 2023
Default avatar
By Alex Garnett
Senior DevOps Technical Writer
How To Make Karaoke Videos using Whisper and Spleeter AI Tools

Introduction

AI tools are useful for manipulating images, audio, or video to produce a novel result. Until recently, automatically editing images or audio was challenging to implement without using a significant amount of time and computing power, and even then it was often only possible to run turnkey filters to remove certain frequencies from sounds or change the color palette of images. Newer approaches, using AI models and enormous amounts of training data, are able to run much more sophisticated filtering and transformation techniques.

Spleeter and Whisper are open source AI tools that are designed for audio analysis and manipulation. Both were developed and released along with their own pre-trained language models, making it possible to run them directly on your own provided input, such as MP3 or AAC audio files, without any additional configuration. Spleeter is used to separate vocal tracks from instrumental tracks of music. Whisper is used to generate subtitles for spoken language. They both have many uses individually, and they have a particular use together: they can be used to generate karaoke tracks from regular audio files. In this tutorial, you’ll use Whisper and Spleeter together to make your own karaoke selections, or integrate into another application stack.

Prerequisites

These tools are available on most platforms. This tutorial will provide installation instructions for a Ubuntu 22.04 server, following our guide to Initial Server Setup with Ubuntu 22.04. You will need at least 3GB of memory to run Whisper and Spleeter, so if you are running on a resource-constrained server, you should consider enabling swap for this tutorial.

Both Spleeter and Whisper are Python libraries and require you to have installed Python and pip, the Python package manager. On Ubuntu, you can refer to the Step 1 of How To Install Python 3 and Set Up a Programming Environment on an Ubuntu 22.04 Server.

Additionally, both Spleeter and Whisper use machine learning libraries that can optionally run up to 10-20x more quickly on a GPU. If a GPU is not detected, they will automatically fall back to running on your CPU. Configuring GPU support is outside the scope of this tutorial, but should work after installing PyTorch in GPU-enabled environments.

Step 1 – Installing Spleeter, Whisper, and Other Tools

First, you’ll need to use pip, Python’s package manager, to install the tools you’ll be using for this project. In addition to spleeter, you should also install youtube-dl, a script that can be used to download YouTube videos locally, which you’ll use to retrieve a sample video. Install them with pip install:

  1. sudo pip install spleeter youtube-dl

Rather than installing Whisper directly, you can install another library called yt-whisper directly from Github, also by using pip. yt-whisper includes Whisper itself as a dependency, so you’ll have access to the regular whisper command after installation, but this way you’ll also get the yt-whisper script, which makes downloading and subtitling videos from YouTube a one-step process. pip install can parse Github links to Python repositories by preceding them with git+:

  1. sudo pip install git+https://github.com/m1guelpf/yt-whisper.git

Finally, you’ll want to make sure you have ffmpeg installed to do some additional audio and video manipulation. ffmpeg is a universal tool for manipulating, merging, and reencoding audio and video files. On Ubuntu, you can install it using the system package manager by running an apt update followed by apt install:

  1. sudo apt update
  2. sudo apt install ffmpeg

Now that you have the necessary tools installed, you’ll obtain sample audio and video in the next step.

Step 2 – Downloading and Separating Audio from a Video

youtube-dl, which you installed in Step 1, is a tool for downloading videos from YouTube to your local environment. Although you should take care when using potentially copyrighted material out of context, this can be useful in a number of contexts, especially when you need to run some additional processing on videos or use them for source material.

Using youtube-dl, download the video that you’ll be using for this tutorial. This sample link is to a public domain song called “Lie 2 You”, but you can use another:

  1. youtube-dl https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt

youtube-dl will download the song along with some metadata and merge it into a single .webm video file. You can play this video in a local media player such as mpv, but that will depend on your environment.

Note: Because the use of youtube-dl is not explicitly supported by YouTube, downloads can occasionally be slow.

Next, you’ll separate the audio track from the video you just downloaded. This is a task where ffmpeg excels. You can use the following ffmpeg command to output the audio to a new file called audio.mp3:

  1. ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -c:a libmp3lame -qscale:a 1 audio.mp3

This is an example of ffmpeg command syntax. In brief:

  • -i /path/to/input is the path to your input file, in this case the .webm video you just downloaded
  • -c:a libmp3lame specifies an audio codec to encode to. All audio and video needs to be encoded somehow, and libmp3lame is the most common mp3 encoder.
  • qscale:a 1 specifies the bitrate of your output mp3, in this case corresponding to a variable bit rate around 220kbps. You can review other options in the ffmpeg documentation.
  • audio.mp3 is the name of your output file, presented at the end of the command without any other flags.

After running this command, FFmpeg will create a new file called audio.mp3.

Note: You can learn more about ffmpeg options from ffmprovisr, a community-maintained catalog of ffmpeg command examples, or refer to the official documentation.

In the next step, you’ll use Spleeter to isolate the instrumental track from your new audio.mp3 file.

Step 3 – Separating Vocal Tracks Using Spleeter

Now that you have your standalone audio file, you’re ready to use spleeter to separate the vocal track. Spleeter contains several models for use with the spleeter separate command, allowing you to perform even more sophisticated separation of piano, guitar, drum, bass tracks and so on, but for now, you’ll use the default 2stems model. Run spleeter separate on your audio.mp3, also providing a path to an -o output directory:

  1. spleeter separate -p spleeter:2stems -o output audio.mp3

If you are running Spleeter without a GPU, this command may take a few minutes to complete. This will produce a new directory called output, containing two files called vocals.wav and accompaniment.wav. These are your separated vocal and instrumental tracks. If you encounter any errors, or need to further customize your Spleeter output, refer to the documentation.

You can try listening to these files in MPV or another audio player. They will have a relatively larger file size for now because spleeter decodes them directly to raw WAV output, but in the next steps, you’ll encode them back into a single video.

Step 4 – Generating Captions Using Whisper

Now that you have your instrumental audio track, you just need to generate captions from the original video. You could run whisper directly on the .webm video you downloaded, but it will be even quicker to run the yt_whisper command on the original YouTube video link:

  1. yt_whisper https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt

If you review the yt_whisper source code, you can understand the presets that yt_whisper is passing to whisper to generate captions from a YouTube video. For example, it defaults to the --model small parameter. The Whisper documentation suggests that this model provides a good tradeoff between memory requirements, performance, and accuracy. If you ever need to run whisper by itself on another input source or with different parameters, you can use these presets as a frame of reference.

If you are running Whisper without a GPU, this command may take a few minutes to complete. This will generate a caption file for the video in the .vtt format. You can inspect the captions using head or a text editor to verify that they match the song lyrics:

  1. head -20 Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt
Output
WEBVTT 00:00.000 --> 00:07.000 I need feeling you on me And I guess in a way you do 00:07.000 --> 00:19.000 All my breath on revelin' emotions I need some space to think this through 00:19.000 --> 00:29.000 Call me all night long Try to give you hints in a hard to see 00:29.000 --> 00:39.000 Right on the line, no Losing it on you is the last thing I need 00:39.000 --> 00:49.000 If I'm honest, I'll just make you cry And I don't wanna fight with you 00:49.000 --> 00:57.000 I would rather lie to you But if I'm honest, now's not the right time

You now have your separate audio tracks and your caption file. In the final step, you’ll assemble them all back together using ffmpeg.

Step 5 – Merging Audio and Video Tracks with Captions

Finally, it’s time to combine your outputs into a finalized video containing 1) the original background video, 2) the isolated instrumental track you generated using Spleeter, and 3) the captions you generated using Whisper. This can be done with a single, slightly more complicated, ffmpeg command:

  1. ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -i output/audio/accompaniment.wav -i "Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt" -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng -c:v copy -c:a aac -c:s mov_text final.mp4

Unlike the earlier ffmpeg command, this command is using three different inputs: the .webm video, the .wav audio, and the .vtt captions. It uses several map arguments to map the first (or 0th, counting from 0) input to the video track, then to the audio, and the last to subtitle metadata, like so: -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng. Next, it specifies the codecs being used for each track:

  • c:v copy means that you are preserving the original video source and not reencoding it. This usually saves time and preserves video quality (video encoding is usually the most CPU-intensive use of ffmpeg by far) as long as the original source is in a compatible format. youtube-dl will almost always default to using the common H264 format, which can be used for streaming video, standalone .mp4 files, Blu Ray discs, and so on, so you should not need to change this.

  • c:a aac means that you are reencoding the audio to the AAC format. AAC is the default for most .mp4 video, is supported in virtually all environments, and provides a good balance between file size and audio quality.

  • c:s mov_text specifies the subtitle format you are encoding. Even though your subtitles were in vtt format, mov_text is a typical subtitle format to embed within a video itself.

Note: You may also want to offset your subtitles up by a couple seconds to help viewers anticipate which lines are coming next. You can do this by adding -itsoffset -2 to the ffmpeg command.

Finally, you provide an output format, final.mp4. Notice that you did not actually specify .mp4 output other than in this filename — ffmpeg will automatically infer an output format based on the output path you provide. When working with audio and video files, the codecs you use are generally more important than the file types themselves, which act as containers for the content. The important differences are in which video players expect to be able to read which kinds of files. An .mp4 file containing H264 video and AAC audio is, as of this writing, the most common media file used anywhere, and will play in almost any environment, including directly in a browser without needing to download the file or configure a streaming server, and it can contain subtitles, so it is a very safe target. .mkv is another popular container format that supports more features, but it is not as widely deployed.

Your final.mp4 video can now be downloaded, shared, or projected up on the wall for karaoke night. Good luck with your performance!

You now have an end-to-end karaoke video solution using four tools. These can be combined into a standalone script, integrated into another application, or run interactively as needed.

Conclusion

In this tutorial, you used two machine learning tools to create a separated vocal track and a set of captions from a source video, then joined them back together. This is uniquely useful for making karaoke videos from existing audio sources, but can also be applied to many other tasks.

Next, you may want to configure a video streaming server, or experiment with some other AI or Machine Learning libraries.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


About the authors
Default avatar
Senior DevOps Technical Writer

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment
Leave a comment...

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click here to Sign up and get $200 of credit to try our products over 60 days!