By Alex Garnett
Senior DevOps Technical Writer
AI tools are useful for manipulating images, audio, or video to produce a novel result. Until recently, automatically editing images or audio was challenging to implement without using a significant amount of time and computing power, and even then it was often only possible to run turnkey filters to remove certain frequencies from sounds or change the color palette of images. Newer approaches, using AI models and enormous amounts of training data, are able to run much more sophisticated filtering and transformation techniques.
Spleeter and Whisper are open source AI tools that are designed for audio analysis and manipulation. Both were developed and released along with their own pre-trained language models, making it possible to run them directly on your own provided input, such as MP3 or AAC audio files, without any additional configuration. Spleeter is used to separate vocal tracks from instrumental tracks of music. Whisper is used to generate subtitles for spoken language. They both have many uses individually, and they have a particular use together: they can be used to generate karaoke tracks from regular audio files. In this tutorial, you’ll use Whisper and Spleeter together to make your own karaoke selections, or integrate into another application stack.
These tools are available on most platforms. This tutorial will provide installation instructions for a Ubuntu 22.04 server, following our guide to Initial Server Setup with Ubuntu 22.04. You will need at least 3GB of memory to run Whisper and Spleeter, so if you are running on a resource-constrained server, you should consider enabling swap for this tutorial.
Both Spleeter and Whisper are Python libraries and require you to have installed Python and
pip, the Python package manager. On Ubuntu, you can refer to the Step 1 of How To Install Python 3 and Set Up a Programming Environment on an Ubuntu 22.04 Server.
Additionally, both Spleeter and Whisper use machine learning libraries that can optionally run up to 10-20x more quickly on a GPU. If a GPU is not detected, they will automatically fall back to running on your CPU. Configuring GPU support is outside the scope of this tutorial, but should work after installing PyTorch in GPU-enabled environments.
First, you’ll need to use
pip, Python’s package manager, to install the tools you’ll be using for this project. In addition to
spleeter, you should also install
youtube-dl, a script that can be used to download YouTube videos locally, which you’ll use to retrieve a sample video. Install them with
- sudo pip install spleeter youtube-dl
Rather than installing Whisper directly, you can install another library called
yt-whisper directly from Github, also by using
yt-whisper includes Whisper itself as a dependency, so you’ll have access to the regular
whisper command after installation, but this way you’ll also get the
yt-whisper script, which makes downloading and subtitling videos from YouTube a one-step process.
pip install can parse Github links to Python repositories by preceding them with
- sudo pip install git+https://github.com/m1guelpf/yt-whisper.git
Finally, you’ll want to make sure you have
ffmpeg installed to do some additional audio and video manipulation.
ffmpeg is a universal tool for manipulating, merging, and reencoding audio and video files. On Ubuntu, you can install it using the system package manager by running an
apt update followed by
- sudo apt update
- sudo apt install ffmpeg
Now that you have the necessary tools installed, you’ll obtain sample audio and video in the next step.
youtube-dl, which you installed in Step 1, is a tool for downloading videos from YouTube to your local environment. Although you should take care when using potentially copyrighted material out of context, this can be useful in a number of contexts, especially when you need to run some additional processing on videos or use them for source material.
youtube-dl, download the video that you’ll be using for this tutorial. This sample link is to a public domain song called “Lie 2 You”, but you can use another:
- youtube-dl https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt
youtube-dl will download the song along with some metadata and merge it into a single
.webm video file. You can play this video in a local media player such as mpv, but that will depend on your environment.
Note: Because the use of
youtube-dl is not explicitly supported by YouTube, downloads can occasionally be slow.
Next, you’ll separate the audio track from the video you just downloaded. This is a task where
ffmpeg excels. You can use the following
ffmpeg command to output the audio to a new file called
- ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -c:a libmp3lame -qscale:a 1 audio.mp3
This is an example of
ffmpeg command syntax. In brief:
-i /path/to/inputis the path to your input file, in this case the
.webmvideo you just downloaded
-c:a libmp3lamespecifies an audio codec to encode to. All audio and video needs to be encoded somehow, and
libmp3lameis the most common mp3 encoder.
qscale:a 1specifies the bitrate of your output mp3, in this case corresponding to a variable bit rate around 220kbps. You can review other options in the ffmpeg documentation.
audio.mp3is the name of your output file, presented at the end of the command without any other flags.
After running this command, FFmpeg will create a new file called
Note: You can learn more about
ffmpeg options from ffmprovisr, a community-maintained catalog of
ffmpeg command examples, or refer to the official documentation.
In the next step, you’ll use Spleeter to isolate the instrumental track from your new
Now that you have your standalone audio file, you’re ready to use
spleeter to separate the vocal track. Spleeter contains several models for use with the
spleeter separate command, allowing you to perform even more sophisticated separation of piano, guitar, drum, bass tracks and so on, but for now, you’ll use the default
2stems model. Run
spleeter separate on your
audio.mp3, also providing a path to an
-o output directory:
- spleeter separate -p spleeter:2stems -o output audio.mp3
If you are running Spleeter without a GPU, this command may take a few minutes to complete. This will produce a new directory called
output, containing two files called
accompaniment.wav. These are your separated vocal and instrumental tracks. If you encounter any errors, or need to further customize your Spleeter output, refer to the documentation.
You can try listening to these files in MPV or another audio player. They will have a relatively larger file size for now because
spleeter decodes them directly to raw WAV output, but in the next steps, you’ll encode them back into a single video.
Now that you have your instrumental audio track, you just need to generate captions from the original video. You could run
whisper directly on the
.webm video you downloaded, but it will be even quicker to run the
yt_whisper command on the original YouTube video link:
- yt_whisper https://www.youtube.com/watch?v=dA2Iv9evEK4&list=PLzCxunOM5WFJxaj103IzbkAvGigpclBjt
If you review the yt_whisper source code, you can understand the presets that
yt_whisper is passing to
whisper to generate captions from a YouTube video. For example, it defaults to the
--model small parameter. The Whisper documentation suggests that this model provides a good tradeoff between memory requirements, performance, and accuracy. If you ever need to run
whisper by itself on another input source or with different parameters, you can use these presets as a frame of reference.
If you are running Whisper without a GPU, this command may take a few minutes to complete. This will generate a caption file for the video in the
.vtt format. You can inspect the captions using
head or a text editor to verify that they match the song lyrics:
- head -20 Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt
OutputWEBVTT 00:00.000 --> 00:07.000 I need feeling you on me And I guess in a way you do 00:07.000 --> 00:19.000 All my breath on revelin' emotions I need some space to think this through 00:19.000 --> 00:29.000 Call me all night long Try to give you hints in a hard to see 00:29.000 --> 00:39.000 Right on the line, no Losing it on you is the last thing I need 00:39.000 --> 00:49.000 If I'm honest, I'll just make you cry And I don't wanna fight with you 00:49.000 --> 00:57.000 I would rather lie to you But if I'm honest, now's not the right time
You now have your separate audio tracks and your caption file. In the final step, you’ll assemble them all back together using
Finally, it’s time to combine your outputs into a finalized video containing 1) the original background video, 2) the isolated instrumental track you generated using Spleeter, and 3) the captions you generated using Whisper. This can be done with a single, slightly more complicated,
- ffmpeg -i "Lie 2 You (ft. Dylan Emmet) – Leonell Cassio (No Copyright Music)-dA2Iv9evEK4.webm" -i output/audio/accompaniment.wav -i "Lie_2_You__ft__Dylan_Emmet____Leonell_Cassio__No_Copyright_Music.vtt" -map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng -c:v copy -c:a aac -c:s mov_text final.mp4
Unlike the earlier
ffmpeg command, this command is using three different inputs: the
.webm video, the
.wav audio, and the
.vtt captions. It uses several
map arguments to map the first (or 0th, counting from 0) input to the video track, then to the audio, and the last to subtitle metadata, like so:
-map 0:v -map 1:a -map 2 -metadata:s:s:0 language=eng. Next, it specifies the codecs being used for each track:
c:v copy means that you are preserving the original video source and not reencoding it. This usually saves time and preserves video quality (video encoding is usually the most CPU-intensive use of
ffmpeg by far) as long as the original source is in a compatible format.
youtube-dl will almost always default to using the common H264 format, which can be used for streaming video, standalone
.mp4 files, Blu Ray discs, and so on, so you should not need to change this.
c:a aac means that you are reencoding the audio to the AAC format. AAC is the default for most
.mp4 video, is supported in virtually all environments, and provides a good balance between file size and audio quality.
c:s mov_text specifies the subtitle format you are encoding. Even though your subtitles were in
mov_text is a typical subtitle format to embed within a video itself.
Note: You may also want to offset your subtitles up by a couple seconds to help viewers anticipate which lines are coming next. You can do this by adding
-itsoffset -2 to the
Finally, you provide an output format,
final.mp4. Notice that you did not actually specify
.mp4 output other than in this filename —
ffmpeg will automatically infer an output format based on the output path you provide. When working with audio and video files, the codecs you use are generally more important than the file types themselves, which act as containers for the content. The important differences are in which video players expect to be able to read which kinds of files. An
.mp4 file containing H264 video and AAC audio is, as of this writing, the most common media file used anywhere, and will play in almost any environment, including directly in a browser without needing to download the file or configure a streaming server, and it can contain subtitles, so it is a very safe target.
.mkv is another popular container format that supports more features, but it is not as widely deployed.
final.mp4 video can now be downloaded, shared, or projected up on the wall for karaoke night. Good luck with your performance!
You now have an end-to-end karaoke video solution using four tools. These can be combined into a standalone script, integrated into another application, or run interactively as needed.
In this tutorial, you used two machine learning tools to create a separated vocal track and a set of captions from a source video, then joined them back together. This is uniquely useful for making karaoke videos from existing audio sources, but can also be applied to many other tasks.
Next, you may want to configure a video streaming server, or experiment with some other AI or Machine Learning libraries.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Click below to sign up and get $200 of credit to try our products over 60 days!
Enter your email to get $200 in credit for your first 60 days with DigitalOcean.
I was looking for how to install whisper, I managed to complete all the steps and everything looked fine, but the captions never end in the final file, is just the Video without the Vocals, but no captions… I guess for my karaoke singers will have to know the lyrics xD