Update: We have also created Part 2 of this tutorial on Textual Inversion, and it is now available online.
Dreambooth offers an exciting new approach to the technology behind Latent Diffusion models, including the widely-used pre-trained model Stable Diffusion.
This new method allows users to input a minimum of three images of a subject along with the corresponding class name (e.g., “dog,” “human,” “building”) to fine-tune and personalize any text-to-image model. This model encodes a unique identifier that refers to the subject. When combined with the remarkable versatility of Stable Diffusion, we can create impressive models that feature robust representations of any object or style we choose.
In this tutorial, we will walk step-by-step through the setup, training, and inference of a Dreambooth Stable Diffusion model within a Jupyter notebook.
Once we have launched the Notebook, make sure to follow the instructions on the page to set up the environment. We will start by running the install cell at the top first to get the necessary packages.
We will log into Hugging Face to access the model files. Make sure to obtain permission to access the files using your Hugging Face account. Once you have permission, paste the token found here into the code cell below where it says <your_huggingface_token>.
Run the cell to log in.
Finally, we will complete the setup by importing the relevant libraries and creating an image_grid
function to display our image outputs in a grid based on the number of samples and rows.
Now that the setup is complete, we will begin setting up the data and concept variables that we will use for fine-tuning later on.
We have two options moving forward: we can either follow the demo exactly as it is or use our own images. If we choose to use our own images, we should create an "inputs"
directory and upload our images there. If we prefer to follow the demo, we can run the cell below. This will create the "inputs"
directory and set up a list of image URLs to download into that directory.
Now, we need to begin setting up the subject for our DreamBooth training by uploading or downloading images for our concept to the repo inputs
.
You can also use the code in the cell above to alter the sample image URLs downloaded for this demo, if the data is stored publicly online.
This cell allows users to download data to the inputs folder for the demo.
To ensure quality concept tuning, these images should be consistent across a theme but differ in subtle ways like lighting, perspective, distance, and noise. Below is a grid showing the images Now, we have downloaded the images.
As we can see, the same image is displayed from a variety of angles and perspectives and with diverse backgrounds to prevent over-fitting.
Next, we will instantiate the training variables for model training and those that will correspond to the image data. We will use Runway ML’s Stable Diffusion-v1-5 checkpoint for this demonstration. Be sure to click the acknowledgment to access the model on its homepage, if you have not yet done so.
Next, we set the settings for our model training.
The instance_prompt
is a prompt that should contain a good description of what your object or style is. It is paired together with the initializer word sks
.
We can set prior_preservation
to True if we want the class of the concept (e.g., toy, dog, painting) to be guaranteed to be preserved. This increases the quality and helps with generalization at the cost of training time.
We then set prior_preservation_class_folder
and class_data_root
to set the input folder path.
The batch size is set to 4, and the prior_loss_weight is set to .5 to determine the strength of the class for prior preservation. The total number of class images is determined by the folder size.
To get started teaching the desired concept, we need to instantiate the DreamBoothDataset
and PromptDataset
classes to handle the organization of the data inputs for training. These Dataset objects are constructed to ensure the input data is optimized for fine-tuning the model. The code for this is shown below:
Next, we will load in the three separate components we will need to run this tuning process. These together form the full pipeline for the model. The final output for our concept will be in a similar format. Later in this tutorial, we will show how to convert this repository concept format into a classic Stable Diffusion checkpoint.
Remember, we need to visit the v1-5 checkpoint page and accept the user agreement to download these files.
Now that the model has been set up, we need to instantiate the training parameters. By running the cell below, we instantiate the tuning arguments for the DreamBooth training process using Namespace
.
In particular, we may want to edit:
max_train_steps
: The arguably most important argument - change this to raise or lower training epochs, which can drastically affect the outcome.seed
: The seed of the sample images generated to compute the loss during training - has a significant effect on the final results of sample images, and what features are considered salient to the model after training.output_dir
: The final output directory for your trained Dreambooth conceptresolution
: The size of the input training images.mixed_precision
: This argument tells the model to use both full and half-precision to accelerate training even further.Here, we instantiate the training function for the training run below. This uses the accelerator package to add the ability to train this function on multi-GPU configurations. Follow the comments within to see what each step of the training function takes before running the code in the cell below.
Finally, we are ready to begin fine-tuning our DreamBooth concept. If we are using more than one GPU, we may now take advantage of acceleration and change the num_processes
argument accordingly.
Once training has been completed, we can use the code in the inference section to sample the model with any prompt.
We first instantiate the pipe in half precision format for less expensive sampling. The StableDiffusionPipeline.from_pretrained()
function takes in our path to the concept directory to load in the fine-tuned model using the binary files inside.
We can then load our prompt variable into this pipeline to generate images corresponding to our desired output.
Following the demo, our prompt “a photo of a cat toy riding a bicycle” is called. We can then use num_samples
and num_rows
with the image_grid
function to place them within a single photo grid. Let’s look at an example of these images generated with the prompt “a photo of cat toy riding a bicycle” from a model trained on the sample images for a relatively short 250 epochs:
The generated images retain many qualities of the original cat toy images. The bottom center image closely resembles the original object. The remaining images display varying degrees of similarity to the original sample as the model attempts to position it on the bicycle. This is because a cat riding a bike is unlikely to be present in the original data, making it harder to recreate with diffusion than scattered features. The final images show a random collection of Dreambooth concept aspects combined with different prompt-related features. With further training, these images could better represent our prompt.
Let’s examine the results from a similar concept extended to 500 epochs of training:
As you can see, several of these samples are more accurately approximating the human, qualitative interpretation of the prompt than the 250 epoch concept. In particular, the three samples in the middle row show all of the features of a clay cat toy riding a mostly accurate bicycle. The other rows feature a more realistic cat object but still show more complete bicycle images than the previous test. This difference is likely due to a lack of a confounding effect in the further trained model during inference from the concept object’s features. By being more exposed to the initial data, the recreation of the central object in the concept is less likely to interfere with the generation of the additional features extrapolated from the prompt - in this case, a bicycle.
Furthermore, it is important to call attention to the consistent variation across each row caused by the random seed. Each row shows similar salient features that are not conserved across the entire grid. This shows clear evidence that the seed effectively controls the randomness of the concept’s inference. A seed is also applied during training, and altering it can have major effects on the results of concept training. Be sure to test out various seeds in both tasks to get the best for your final outputs.
To continue from this stage, it’s recommended to try extending your training to 1000 epochs to see how our outputs improve or degrade. As the number of epochs grows, we need to be wary of overfitting. If our inputted images are not variable enough in terms of background features and perspectives of objects, and training is extensive enough, it will be difficult to manipulate the objects out of their original positions during generation due to this over-training. Make sure the input images contain diverse content to ensure a more robust final output for inference.
Finally, now that we have finished creating our Dreambooth concept, we are left with a Diffusers-compatible folder at the directory dreambooth-concept
. If we want to use this with classic Stable Diffusion scripts or other projects like the Stable Diffusion Web UI, then we can use the following script to create a shareable, downloadable .ckpt
file.
This script will clone the huggingface/diffusers
library to our Notebook. We then change directories into scripts
to make use of the convert_diffusers_to_original_stable_diffusion.py
script. Simply input the model_path
as the path to your repository holding the Dreambooth concept, the desired output path and name for your new checkpoint to checkpoint_path, and use the half
flag if you would like to save it in half-precision format for less computationally expensive inference.
This tutorial is based on a Jupyter Notebook from the HuggingFace Diffusers Notebooks repo. We walked through the complete process of creating a Dreambooth concept from scratch and demonstrated how you can adapt this notebook for any type of image dataset to generate new, personalized concepts. Using text prompts, we showed how to sample from the trained model to produce novel images that reflect the original concept. We also explored how changes in training duration and seed values affect the output. To wrap up, we covered how to export the trained concept as a .ckpt
file, making it easy to reuse in future projects—including those running on platforms like DigitalOcean with powerful GPU Droplets for faster training and inference.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!