Self-supervised learning has driven major progress in natural language processing (NLP), allowing models to learn useful representations from large amounts of unlabelled text. Among these approaches, denoising autoencoders—which train models to reconstruct original text after masking out random words—have shown particularly strong results.
By learning to predict missing parts of a sentence, these models develop a deep understanding of grammar, context, and meaning. Recent research has further improved these methods by experimenting with how words are masked, the order in which predictions are made, and the context provided during training. While these improvements have pushed performance even further, many of the resulting models tend to be limited to specific tasks like span prediction or generation, restricting their broader usefulness. The BART model was introduced to address this limitation—offering a more general, flexible approach to self-supervised training that can handle a wide range of NLP tasks with high performance.
In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.
If you do not have access to a GPU, we suggest using DigitalOcean GPU Droplets.
To help you get started with Python code, we recommend following this beginner’s guide to set up your system, which will prepare you to run beginner tutorials.
The BART research paper presents a pre-training technique that integrates Bidirectional and Auto-Regressive Transformers. As a denoising autoencoder employing a sequence-to-sequence framework, BART proves valuable across diverse applications. Its pretraining process involves two stages: first, text is corrupted through a chosen noising function; second, a sequence-to-sequence model is trained to recover the original text.
BART’s Transformer-based neural machine translation architecture can be seen as a generalization of BERT (due to the bidirectional encoder), GPT (With the left-to-right decoder), and many other contemporary pre-training approaches.
In addition to its strength in comprehension tasks, BART’s effectiveness increases with fine-tuning for text generation. It generates new state-of-the-art results on various abstractive conversation, question answering, and summarization tasks, matching the performance of RoBERTa with comparable training resources on GLUE and SQuAD.
Except changing the ReLU activation functions to GeLUs and initializing parameters from (0, 0.02), BART follows the general sequence-to-sequence Transformer design (Vaswani et al., 2017). There are six layers in the encoder and decoder for the base model and twelve layers in each for the large model.
Similar to the architecture used in BERT, the two main differences are that (1) in BERT, each layer of the decoder additionally performs cross-attention over the final hidden layer of the encoder (as in the transformer sequence-to-sequence model); and (2) in BERT, an additional feed-forward network is used before word prediction, whereas in BART, there isn’t.
To train BART, we first corrupt documents and then optimize a reconstruction loss, which is the cross-entropy between the decoder’s output and the original document. In contrast to conventional denoising autoencoders, BART may be used for any type of document corruption.
The worst-case scenario for BART is when all source information is lost, which becomes analogous to a language model. The researchers try out several new and old transformations, but they also believe there is much room for creating even more unique alternatives.
In the following, we will outline the transformations they performed and provide some examples. Below is a summary of the transformations they used, and an illustration of some of the results is provided in the figure.
Several potential uses for the representations BART generates in subsequent processing steps exist:
More specifically, they swap out the embedding layer of BART’s encoder with a brand new encoder using random initialization. When the model is trained from start to end, the new encoder is trained to map foreign words into an input that BART can then translate into English. In both stages of training, the cross-entropy loss is backpropagated from the BART model’s output to train the source encoder.
In the first stage, they fix most of BART’s parameters and only update the randomly initialized source encoder, the BART positional embeddings, and the self-attention input projection matrix of BART’s encoder first layer. Second, they perform a limited number of training iterations on all model parameters.
It takes much time for a researcher or journalist to sift through all the long-form information on the internet and find what they need. You can save time and energy by skimming the highlights of lengthy literature using a summary or paraphrase synopsis.
The NLP task of summarizing texts may be automated with the help of transformer models. Extractive and abstractive techniques exist to achieve this goal. Summarizing a document extractively involves finding the most critical statements in the text and writing them down.
One may classify this as a type of information retrieval. More challenging than literal summarizing is abstract summarization, which seeks to grasp the whole material and provide a paraphrased text to sum up the key points. The second type of summary is carried out by transformer models such as BART.
HuggingFace gives us quick and easy access to thousands of pre-trained and fine-tuned weights for Transformer models, including BART. You can choose a tailored BART model for the text summarization assignment from the HuggingFace model explorer. Each submitted model includes a detailed description of its configuration and training.
The beginner-friendly bart-large-cnn model deserves a look, so let’s look at it. Either use the HuggingFace Installation page or run pip install transformers to get started. Next, we’ll follow these three easy steps to create our summary:
The Transformers model pipeline should be loaded first. A module in the pipeline is defined by naming the task and the model. The term “summarization” is used, and the model is referred to as “facebook/bart-large-xsum.” If we want to attempt something different than the standard news dataset, we can use the Extreme Summary (XSum) dataset. The model was trained to generate one-sentence summaries exclusively.
The last step is constructing an input sequence and putting it through its paces using the summarizer() pipeline. In terms of tokens, the summary length can also be adjusted using the function’s optional max_length and min_length arguments.
Another option is to use BartTokenizer to generate tokens from text sequences and BartForConditionalGeneration for summarizing.
Assume you have to summarize the same text as in the example above. You can make use of the tokenizer’s batch_encode_plus() feature for this purpose. When called, this method produces a dictionary that stores the encoded sequence or sequence pair and any other information provided.
How can we restrict the shortest possible sequence that can be returned?
In batch_encode_plus(), set the value of the max_length parameter. To get the IDs of the summary output, we feed the input_ids into the model.generate() function.
The summary of the original text has been generated as a sequence of IDs by the model.generate() method. The function model.generate() has many parameters, among which are:
The decode() function can be used to transform the ID sequence into plain text.
The decode() converts a list of token IDs into a list of strings. It accepts several parameters, among which we will mention two of them:
As a result, we get this:
Ktrain is a Python package that reduces the amount of code required to implement machine learning. Wrapping TensorFlow and other libraries, it aims to make cutting-edge ML models accessible to non-experts while satisfying the needs of experts in the field. With ktrain’s streamlined interface, you can handle a wide variety of problems with as little as three or four “commands” or lines of code, regardless of whether the data being worked with is textual, visual, graphical, or tabular.
Using a pretrained BART model from the transformers library, ktrain can summarize text. First, we’ll create a TransformerSummarizer instance to perform the actual summarizing. (Please note that the installation of PyTorch is necessary to use this function.)
Let’s go ahead and write up an article:
In this article, we explored the BART (Bidirectional and Auto-Regressive Transformers) model. By examining both theoretical insights and practical code examples, we demonstrated how BART’s powerful seq2seq capabilities can be leveraged for tasks like text generation, summarization, and translation.
With the growing demand for AI/ML solutions and scalable infrastructure, DigitalOcean offers the perfect environment to host and scale your machine learning workloads. With their powerful GPU Droplets and managed Kubernetes solutions, you can efficiently deploy and manage Transformer models like BART for real-time inference and training tasks. Whether you’re running intensive NLP models or handling large-scale AI projects, DigitalOcean provides the resources you need to scale quickly, ensuring both cost-efficiency and performance.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!