Prepare a dataset for training and validation of a Large Language Model (LLM)

Updated on December 20, 2024

Technical Writer

Prepare a dataset for training and validation of a Large Language Model (LLM)

Introduction

Generating a dataset for training a Language Model (LLM) involves several crucial steps to ensure its efficacy in capturing the nuances of language. From selecting diverse text sources to preprocessing to splitting the dataset, each stage requires attention to detail. Additionally, it’s crucial to balance the dataset’s size and complexity to optimize the model’s learning process. By curating a well-structured dataset, one lays a strong foundation for training an LLM capable of understanding and generating natural language with proficiency and accuracy.

This brief guide will walk you through generating a classification dataset to train and validate a Language Model (LLM). While the dataset created here is small but it lays a solid foundation for exploration and further development.

Prerequisites

Basic Knowledge: Familiarity with LLM concepts and data preprocessing techniques.
Data Sources: Access to clean, diverse, and relevant datasets in text format.
Toolkits: Install Python, libraries like pandas, numpy, and frameworks like TensorFlow or PyTorch.
Storage: Sufficient computational resources for handling large datasets.

Datasets for Fine-Tuning and Training LLMs

Several sources provide great datasets for fine-tuning and training your LLMs. A few of them are listed below:-

1.Kaggle: Kaggle hosts various datasets across various domains. You can find datasets for NLP tasks, which include text classification, sentiment analysis, and more. Visit: Kaggle Datasets

2.Hugging Face Datasets: Hugging Face provides large datasets specifically curated for natural language processing tasks. They also offer easy integration with their transformers library for model training. Visit: Hugging Face Datasets

3.Google Dataset Search: Google Dataset Search is a search engine specifically designed to help researchers locate online data that is freely available for use. You can find a variety of datasets for language modeling tasks here. Visit: Google Dataset Search

4.UCI Machine Learning Repository: While not exclusively focused on NLP, the UCI Machine Learning Repository contains various datasets that can be used for language modeling and related tasks. Visit: UCI Machine Learning Repository

5.GitHub: GitHub hosts numerous repositories that contain datasets for different purposes, including NLP. You can search for repositories related to your specific task or model architecture. Visit: GitHub

6.Common Crawl: Common Crawl is a nonprofit organization that crawls the web and freely provides its archives and datasets to the public. It can be a valuable resource for collecting text data for language modeling. Visit: Common Crawl

7.OpenAI Datasets: OpenAI periodically releases datasets for research purposes. These datasets often include large-scale text corpora that can be used for training LLMs. Visit: OpenAI Datasets

Code to Create and Prepare the Dataset

The code and concept for this article are inspired by Sebastian Rashka’s excellent course, which provides comprehensive insights into constructing a substantial language model from the ground up.

1.We will start with installing the necessary packages,

import pandas as pd #for data processing, manipulation
import urllib.request #for downloading files from URLs zip file
import zipfile #to deal with zip file
import os #for dealing with the OS
from pathlib import Path  #for working with file paths

2.The below lines of code will help to get the raw dataset and extract it,

# getting the zip file from the url
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
data_zip_path = "sms_spam_collection.zip"
data_extracted_path = "sms_spam_collection"
data_file_path = Path(data_extracted_path) / "SMSSpamCollection.tsv"

3.Next, we will use the ‘with’ statement, for both opening the URL and opening the local file.

# Downloading the file
with urllib.request.urlopen(url) as response:
    with open(data_zip_path, "wb") as out_file:
        out_file.write(response.read())

# Unzipping the file
with zipfile.ZipFile(data_zip_path, "r") as zip_ref:
    zip_ref.extractall(data_extracted_path)

4.The below code will ensure that the downloaded file is properly renamed with the “.tsv” file

# Add .tsv file extension
original_file_path = Path(data_extracted_path) / "SMSSpamCollection"
os.rename(original_file_path, data_file_path)
print(f"File downloaded and saved as {data_file_path}")

After successful execution of this code we will get the message as “File downloaded and saved as sms_spam_collection/SMSSpamCollection.tsv”

5.Use the pandas library to load the saved dataset and further explore the data.

raw_text_df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
raw_text_df.head()
print(raw_text_df["Label"].value_counts())

Label ham 4825 spam 747 Name: count, dtype: int64

6.Let’s define a function with pandas to generate a balanced dataset. Initially, we count the number of ‘spam’ messages, then proceed to randomly sample the same number to align with the total count of spam instances.

def create_balanced_dataset(df):
  # Count the instances of "spam"
  num_spam_inst = raw_text_df[raw_text_df["Label"] == "spam"].shape[0]
  # Randomly sample "ham' instances to match the number of 'spam' instances
  ham_subset_df = raw_text_df[raw_text_df["Label"] == "ham"].sample(num_spam, random_state=123)
  # Combine ham "subset" with "spam"
  balanced_df = pd.concat([ham_subset_df, raw_text_df[raw_text_df["Label"] == "spam"]])
  return balanced_df

balanced_df = create_balanced_dataset(raw_text_df)

Let us do a value_count to check the counts of ‘spam’ and ‘ham’

print(balanced_df["Label"].value_counts())

Label ham 747 spam 747 Name: count, dtype: int64

As we can see that the data frame is now balanced.

#change the 'label' data to integer class
balanced_df['Label']= balanced_df['Label'].map({"ham":1, "spam":0})

7.Net, we will write a function which will randomly split the dataset to train, test and validation function.

def random_split(df, train_frac, valid_frac):
    df = df.sample(frac = 1, random_state = 123).reset_index(drop=True)
    train_end = int(len(df) * train_frac)
    valid_end = train_end + int(len(df) * valid_frac)

    train_df = df[:train_end]
    valid_df = df[train_end:valid_end]

    test_df = df[valid_end:]

    return train_df,valid_df,test_df

train_df, valid_df, test_df = random_split(balanced_df, 0.7, 0.1)

Next save the dataset locally.

train_df.to_csv("train_df.csv", index=None)
valid_df.to_csv("valid_df.csv", index=None)
test_df.to_csv("test_df.csv", index=None)

Conclusion

Building a large language model (LLM) is quite complex. However, with this ever-evolving A.I. field and new technologies coming up, things are getting less complicated. From laying the groundwork with robust algorithms to fine-tuning hyperparameters and managing vast datasets, every step is critical in creating a model capable of understanding and generating human-like text.

One crucial aspect of training LLMs is creating high-quality datasets. This involves sourcing diverse and representative text corpora, preprocessing them to ensure consistency and relevance, and, perhaps most importantly, curating balanced datasets to avoid biases and enhance model performance.

With this, we came to the end of the article, and we understood how easy it is to create a classification dataset from a delimited file. We highly recommend using this article as a base and create more complex data.

We hope you enjoyed reading the article!

References

Code Reference

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee

Author

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

See author profile

Category:

Tutorial

Tags:

AI/ML