Report this

What is the reason for this report?

How to Create Data for Fine-Tuning LLMs

Published on January 15, 2026
Shaoni Mukherjee

By Shaoni Mukherjee

Technical Writer

How to Create Data for Fine-Tuning LLMs

Fine-tuning a large language model (LLM) is only as good as the data on which it is trained. High-quality, well-structured datasets directly determine how well the model follows instructions, answers questions, and behaves in production.

This guide walks through end-to-end data preparation for LLM fine-tuning, from raw text to a production-ready dataset.

Key Takeaways

  • High-quality data matters more than large quantities when fine-tuning LLMs. Clean, well-structured, and task-aligned datasets consistently outperform noisy, oversized ones.
  • Start with a clearly defined objective before collecting data. Knowing whether you’re fine-tuning for instruction-following, domain expertise, or conversational behavior guides every data decision.
  • Instruction–response pairs are the foundation of effective fine-tuning. Well-written prompts with accurate, concise responses help models learn desired behaviors faster.
  • Proper data formatting is critical. Aligning your dataset with the target framework (JSON, JSONL, or chat format) prevents training failures and improves learning efficiency.
  • Continuous evaluation and iteration are necessary. Fine-tuning is an iterative process: monitor results, refine data, and retrain to achieve optimal performance.

Understanding LLM Fine-Tuning Data Requirements

Fine-tuning data is not raw text scraped from the internet. Unlike pretraining, where models learn language patterns from a massive corpus, fine-tuning teaches a model how it should behave. The goal is to shape responses so the model follows instructions, answers questions accurately, and adopts a consistent persona.

In practice, this means transforming knowledge into structured examples that pair a user intent with an ideal response. Each data point explicitly shows the model what a good answer looks like for a given prompt. Over time, the model internalizes these patterns and begins to generalize them to unseen inputs.

Most modern fine-tuning pipelines rely on either instruction-style data or chat-style data. Instruction-style datasets consist of an instruction, an optional input, and an output. Chat-style datasets use role-based messages such as system, user, and assistant. Both approaches serve the same purpose; the choice usually depends on the target model and training framework.

Data Formats for LLM Fine-Tuning

Before creating or exporting a dataset, it is critical to understand the different data formats used for LLM fine-tuning. The format we choose directly affects how the model interprets instructions, learns conversational flow, and generalizes to real-world usage. While most modern training frameworks support multiple formats, each format serves a slightly different purpose and is suited to different fine-tuning goals.

Here are a few of the most widely used data formats for model finetuning.

Completion-Style Format

Completion-style datasets are the simplest and most traditional form of fine-tuning data. In this format, the model is shown a prompt and trained to predict the continuation. This format was commonly used in early GPT-style fine-tuning and is still supported by many APIs.

A completion-style example typically looks like this:

{
"prompt": "What is a DigitalOcean GPU Droplet?
Answer:",
"completion": " A GPU Droplet is a virtual machine on DigitalOcean that includes NVIDIA GPUs for AI, machine learning, and high-performance computing workloads."
}

In this setup, the model learns to continue generating text after the prompt. While this approach is simple, it has limitations. Because the structure is implicit, the model may struggle with complex instruction-following or multi-turn conversations. Completion-style fine-tuning is best suited for narrowly scoped tasks such as short answers, classification, or controlled text generation.

Instruction-Style Format

Instruction-style fine-tuning makes the learning objective explicit by clearly separating the instruction from the expected output. This format has become the de facto standard for adapting open-source LLMs because it improves instruction-following behavior and reduces ambiguity.

An instruction-style example usually contains three fields: an instruction, an optional input, and an output. The instruction describes the task, the input provides additional context if needed, and the output represents the ideal response.

{
"instruction": "Explain what a GPU Droplet is in DigitalOcean",
"input": "",
"output": "A GPU Droplet is a virtual machine provided by DigitalOcean that comes with NVIDIA GPUs and is designed for AI, machine learning, deep learning, and high-performance computing workloads."
}

This format is highly readable and easy to debug. This allows for mixing different task types, such as explanations, summaries, and troubleshooting, within the same dataset. Instruction-style data is ideal when the goal is to teach a model how to follow commands reliably and produce consistent, domain-specific responses.

Chat-Style Format

Chat-style datasets are designed to mirror real conversational interactions. Each training example consists of a sequence of messages, where each message is associated with a role such as system, user, or assistant. This format closely aligns with how chat-based LLMs are used in production.

A chat-style example typically looks like this:

{
"messages": [
{"role": "system", "content": "You are a helpful DigitalOcean support assistant."},
{"role": "user", "content": "What is a GPU Droplet?"},
{"role": "assistant", "content": "A GPU Droplet is a virtual machine on DigitalOcean equipped with NVIDIA GPUs, designed for AI, machine learning, and high-performance computing workloads."}
]
}

The system message defines the model’s behavior and tone, while the user and assistant messages teach conversational flow. This format is particularly effective for chatbots, customer support agents, and multi-turn assistants.

However, chat-style datasets are more verbose and slightly more complex to curate. They are best used when the final application is conversational and context-dependent.

Before deciding on the appropriate format for model training, we must first clarify the specific task we aim to accomplish.

If we are adapting a model for simple prompt–response tasks, completion-style data may be sufficient. For most domain adaptation and instruction-following use cases, instruction-style datasets offer the best balance of clarity and flexibility. When building conversational agents or assistants that must maintain context and persona, chat-style datasets are the most natural choice. What matters most is consistency. A well-structured, clean dataset in any of these formats will perform better than a larger but poorly designed one.

Where Fine-Tuning Data Comes From

High-quality fine-tuning datasets are almost always derived from authoritative sources. These include product documentation, API references, internal support tickets, help center articles, and curated expert explanations. In many real-world systems, teams also create synthetic examples to cover edge cases or underrepresented queries.

There is always a need to identify reliable sources that reflect the knowledge and behavior we want the model to learn. For domain-specific fine-tuning, this often includes official documentation, internal knowledge bases, support tickets, FAQs, and expert-written guides. The closer the data is to real user questions and authoritative answers, the more effective the fine-tuned model will be. Sometimes, there might also be a requirement to perform ethical web scraping to generate reliable data for certain domain-specific chatbots.

In practice, many teams rely on a combination of proprietary data and open datasets. Platforms like Hugging Face play a central role in this process. Hugging Face hosts thousands of publicly available datasets covering instruction tuning, question answering, summarization, and conversational tasks. These datasets can be used directly, adapted to a new domain, or serve as templates for creating custom data. Hugging Face Datasets also provide standardized loading, versioning, and streaming utilities, which make large-scale data acquisition and preprocessing significantly easier.

LLM-Generated Data and Synthetic Dataset Creation

One of the modern practices in LLM fine-tuning is generating training data using an existing large language model. This approach, often referred to as synthetic data generation or LLM-generated data, has become increasingly popular because it significantly reduces the time and cost required to build high-quality datasets.

In this workflow, a strong base model is prompted to generate instruction–response pairs, question–answer examples, or multi-turn conversations based on a predefined schema. The generated outputs are then reviewed, filtered, and refined before being added to the fine-tuning dataset. When done carefully, this technique can produce data that closely mirrors real user interactions.

LLM-generated data is particularly useful when human-written examples are scarce, when covering edge cases, or when expanding an existing dataset to improve coverage and diversity. For example, once we define a small set of trusted DigitalOcean facts, an LLM can generate dozens of semantically varied questions and high-quality answers while maintaining consistency with the original knowledge.

However, synthetic data must be used with caution. Models tend to reinforce their own biases and phrasing patterns, which can lead to overfitting or reduced linguistic diversity if the dataset is entirely synthetic. For this reason, LLM-generated data works best when combined with human-curated or authoritative source data. Human review, automatic validation checks, and deduplication are essential steps in this process.

In production pipelines, teams often follow a hybrid approach. Human-written data establishes correctness and tone, while LLM-generated data scales the dataset and fills in gaps. This balance allows teams to achieve strong fine-tuning results without sacrificing accuracy or reliability.

Preparing Hugging Face Datasets for LLM Fine-Tuning

Now Hugging Face has a vast number of open-sourced datasets that can be used to create high-quality fine-tuning data. Many of these datasets already contain instruction–response pairs but still need to be formatted into a consistent prompt structure before training.

Example_1:

from datasets import load_dataset
from itertools import islice

# Load Dolly dataset (non-streaming; small enough)
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Take first 1000 samples
samples = list(islice(dataset, 1000))

print("Dolly sample structure:")
print(samples[0])

# Format for instruction fine-tuning
def format_dolly(example):
   instruction = example["instruction"]
   context = example.get("context", "")
   response = example["response"]

   prompt_parts = [
       f"### Instruction:\n{instruction}"
   ]

   if context.strip():
       prompt_parts.append(f"### Input:\n{context}")

   prompt_parts.append(f"### Response:\n{response}")

   return {"text": "\n".join(prompt_parts)}

# Apply formatting
formatted_samples = [format_dolly(sample) for sample in samples]
Dolly sample structure:
{'instruction': 'When did Virgin Australia start operating?', 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}

Example_2:

from datasets import load_dataset
from itertools import islice

dataset = load_dataset(
   "Open-Orca/OpenOrca",
   split="train",
   streaming=True
)

samples = list(islice(dataset, 1000))

def format_openorca(example):
   system = example.get("system_prompt", "You are a helpful assistant.")
   question = example["question"]
   answer = example["response"]

   text = (
       f"### System:\n{system}\n"
       f"### Instruction:\n{question}\n"
       f"### Response:\n{answer}"
   )

   return {"text": text}

formatted_samples = [format_openorca(s) for s in samples]
{   'text': '### System:\n'
            '\n'
            '### Instruction:\n'
            'You will be given a definition of a task first, then some input '
            'of the task.\n'
            'This task is about using the specified sentence and converting '
            'the sentence to Resource Description Framework (RDF) triplets of '
            'the form (subject, predicate object). The RDF triplets generated '
            'must be such that the triplets accurately capture the structure '
            'and semantics of the input sentence. The input is a sentence and '
            'the output is a list of triplets of the form [subject, predicate, '
            'object] that capture the relationships present in the sentence. '
            'When a sentence has more than 1 RDF triplet possible, the output '
            'must contain all of them.\n'
            '\n'
            "AFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax "
            'Youth Academy also play.\n'
            'Output:\n'
            '### Response:\n'
            '[\n'
            '  ["AFC Ajax (amateurs)", "has ground", "Sportpark De '
            'Toekomst"],\n'
            '  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n'
            ']'}
  • Ensures consistent prompt structure across all samples,
  • Handles optional inputs cleanly,
  • Produces a single "text" field that works seamlessly with tokenizers,
  • Compatible with LoRA, QLoRA, and full fine-tuning pipelines.

Once formatted, this dataset can be directly tokenized and passed to a training loop using Hugging Face Trainer, SFTTrainer, or custom PyTorch code. Tip:
Always inspect a few formatted samples before training. Small formatting inconsistencies can significantly impact model behavior during fine-tuning.

Creating Data for Domain-Specific LLM Fine-Tuning

To fine-tune a large language model for a specific domain, such as healthcare, legal, finance, or education, requires collecting raw data.

This section outlines a structured approach to creating high-quality, domain-specific fine-tuning data. Define the Domain Scope and Target Tasks: Start by clearly defining what the model should learn and how it will be used. Domain-specific fine-tuning is most effective when the scope is narrow and task-driven.

Key questions to answer:

  • What domain knowledge should the model specialize in?
  • What tasks will users perform using the model?
  • What level of expertise should the responses reflect?

Examples:

  • Healthcare: clinical note summarization, medical concept explanation
  • Finance: financial metric interpretation, earnings analysis
  • DevOps: log analysis, incident troubleshooting

A well-defined scope prevents noisy or irrelevant data from decreasing the model performance. Collect High-Quality Domain Data: Domain-specific datasets should be sourced from reliable and authoritative materials.

Common data sources include:

  • Internal documentation and knowledge bases
  • Industry whitepapers and research publications
  • Technical manuals and product documentation
  • Support tickets, FAQs, and customer interactions
  • Transcripts or notes from subject matter experts

Ensure that all data collection complies with privacy, security, and licensing requirements.

Transform Raw Content into Instruction–Response Pairs: Raw domain text must be converted into supervised learning examples that teach the model how to respond to domain-specific queries.

Each sample should represent a realistic task the model is expected to perform.

Example (Finance):

  • Instruction: Explain EBITDA and its role in company valuation.
  • Response: EBITDA represents earnings before interest, taxes, depreciation, and amortization, and is commonly used to evaluate a company’s operating performance.

This transformation can be done manually, semi-automatically using LLM assistance, or fully programmatically with validation. Use a Consistent Prompt Structure: Consistency in formatting is critical for stable fine-tuning. A standardized prompt template helps the model learn task boundaries and response expectations.

### Instruction:
<Task description>

### Input:
<Optional domain context>

### Response:
<Expected output>

This structure is widely compatible with Alpaca-style, Dolly-style, and LLaMA-based fine-tuning pipelines**.**

Format Domain Data Programmatically: Once instruction–response pairs are defined, they can be converted into training-ready text samples.

def format_domain_example(example):
    instruction = example["instruction"]
    context = example.get("context", "")
    response = example["response"]

    sections = [f"### Instruction:\n{instruction}"]

    if context.strip():
        sections.append(f"### Input:\n{context}")

    sections.append(f"### Response:\n{response}")

    return {"text": "\n".join(sections)}

The resulting “text” field can be directly tokenized and passed to supervised fine-tuning workflows. Validate Data Quality Before Fine-Tuning: Before training begins, manually review a subset of samples to ensure the data is factually correct, the instructions are correct and unambiguous, and are quality responses.

Poor-quality data, even in small quantities, can significantly affect how a model behaves after fine-tuning. For further information on this topic, feel free to check out the detailed blog on LLM Poisoning.

Choose an Appropriate Fine-Tuning Strategy: For most domain-specific use cases:

  • LoRA / QLoRA offer fast, cost-effective adaptation
  • Full fine-tuning provides maximum performance at a higher cost

Most teams achieve strong results using LoRA-based supervised fine-tuning with well-curated data. We have already created a step-by-step tutorial showing how to use LoRA for model finetuning using a custom dataset. Feel free to check the resources section for more information.

Generating Domain-Specific Fine-Tuning Data via Web Scraping

Web scraping is a practical way to collect domain-specific content from public documentation, blogs, or knowledge bases. Once scraped, this raw text can be transformed into instruction–response pairs for supervised fine-tuning.

Warning: Always ensure the website permits scraping and is done ethically.

Step 1: Install Required Libraries

Use pip to install the necessary libraries.

pip install requests beautifulsoup4

Step 2: Scrape the Content Using Beautiful Soup

In this example, we scrape headings and paragraphs from a technical documentation page.

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    response = requests.get(url, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    title = soup.find("h1")
    paragraphs = soup.find_all("p")

    content = {
        "title": title.get_text(strip=True) if title else "",
        "paragraphs": [
            p.get_text(strip=True)
            for p in paragraphs
            if len(p.get_text(strip=True)) > 100
        ]
    }

    return content

Step 3: Convert Scraped Content into Instruction–Response Pairs

After scraping, we convert the raw text into training samples suitable for instruction tuning.

def create_instruction_data(scraped_content):
    instruction = (
        f"Explain the following topic in a clear and concise manner: "
        f"{scraped_content['title']}"
    )

    response = " ".join(scraped_content["paragraphs"])

    return {
        "instruction": instruction,
        "response": response
    }

Step 4: Format Data for LLM Fine-Tuning

Finally, format the instruction–response pairs into a single prompt field.

def format_for_finetuning(example):
    text = (
        f"### Instruction:\n{example['instruction']}\n\n"
        f"### Response:\n{example['response']}"
    )
    return {"text": text}

Step 5: End-to-End Example

url = "https://www.digitalocean.com/community/tutorials/deploy-coreflux-mqtt-mongodb
"
scraped = scrape_page(url)

instruction_data = create_instruction_data(scraped)
formatted_sample = format_for_finetuning(instruction_data)

print(formatted_sample["text"][:500])
### Instruction:
Explain the following topic in a clear and concise manner: Deploy Coreflux MQTT Broker with MongoDB on DigitalOcean

### Response:
MQTT brokersare essential for modernIoTinfrastructure andautomationsystems, where the need for a centralized, unified, and fast data hub is the key part for system interoperability and data exchange.Corefluxis a powerful,low codeMQTT broker that expands the traditional MQTT broker to a system that provides advanced features forreal-time dataprocessin

Generating Synthetic Data Using LLMs (Without Paid APIs)

Collecting and manually annotating domain-specific data is often expensive and time-consuming. A practical alternative is synthetic data generation using LLMs, where an LLM is used to create realistic instruction–response pairs at scale. Often times we can use a GROQ API key or any other paid services to use the LLM to generate the synthetic data but in this data pipeline, we will demonstrate how to generate synthetic customer support data using a free, locally running LLM from Hugging Face, and no API keys or paid services are required.

Example: Synthetic Customer Support Data Generator

The following example uses the Hugging Face transformers library with a lightweight open-source model (flan-t5-base) to generate instruction–response pairs for customer support scenarios.

Install Dependencies

pip install transformers torch

Python Code: Generating Synthetic Data Using a Local LLM

import random
import json
from transformers import pipeline

class FreeSyntheticDataGenerator:
    def __init__(self):
        # Instruction-tuned model
        self.generator = pipeline(
            "text2text-generation",
            model="google/flan-t5-base"
        )

        self.templates = {
            "order_inquiry": [
                "Where is my order #{order_id}?",
                "Can you track order #{order_id}?",
                "What is the status of order #{order_id}?"
            ],
            "return_request": [
                "I want to return my {product}",
                "How do I get a refund for {product}?"
            ],
            "technical_support": [
                "My {device} is not turning on.",
                "I am facing error code {error_code} on {software}"
            ]
        }

        self.variables = {
            "order_id": ["12345", "67890", "ABC999"],
            "product": ["laptop", "phone", "headphones"],
            "device": ["laptop", "router", "tablet"],
            "software": ["Windows", "Android", "website"],
            "error_code": ["404", "E-001"]
        }

    def generate_examples(self, category, count=5):
        dataset = []

        for _ in range(count):
            instruction = random.choice(self.templates[category])

            for var, values in self.variables.items():
                instruction = instruction.replace(
                    f"{{{var}}}", random.choice(values)
                )

            prompt = f"""
You are a professional customer support agent.
Respond clearly and concisely.
Customer query: {instruction}
"""

            response = self.generator(
                prompt,
                max_length=150,
                do_sample=False
            )[0]["generated_text"]

            dataset.append({
                "instruction": instruction,
                "output": response.strip(),
                "category": category
            })

        return dataset


if __name__ == "__main__":
    generator = FreeSyntheticDataGenerator()
    samples = generator.generate_examples("order_inquiry", 3)
    print(json.dumps(samples, indent=2))


Output Format for Fine-Tuning

Each generated example follows an instruction-tuning format, making it suitable for supervised fine-tuning (SFT):

[
  {
    "instruction": "What is the status of order #12345?",
    "output": "Order #12345 has been placed.",
    "category": "order_inquiry"
  },
  {
    "instruction": "What is the status of order #12345?",
    "output": "Your order is currently being processed and is expected to be delivered within the estimated delivery timeline.",
    "category": "order_inquiry"
  },
  {
    "instruction": "Where is my order #12345?",
    "output": "Where is your order #12345?",
    "category": "order_inquiry"
  }
]

Feel free to explore other free options as well; however, if you are using paid API options, you can use them for quality synthetic data generation (and more). Here are some examples:

Provider Key Models Best For Notes
OpenAI GPT-4.1, GPT-4.1-Nano, GPT-4.5 High-quality text, instruction following, code, multi-turn Widely used; strong ecosystem
Groq LLaMA 3 (8B/16B) Faster inference; fixed cost tiers Very good mix of speed & accuracy
Anthropic Claude 3 (e.g., Claude 3 Opus) Conversational, safe responses Great for chat assistants
Cohere Command R Retrieval augmented generation Good for RAG workflows
Google Vertex AI Gemini models Multimodal support Integrates with GCP workflows

However, paid APIs are at times better for synthetic data generation because they use large, instruction-tuned models that are specifically trained to follow prompts accurately and produce structured, high-quality responses.

Unlike free or base models, paid APIs understand intent, role instructions, and formatting requirements, which results in outputs that are coherent, relevant, and consistent across thousands of samples. This significantly reduces hallucinations, repetition, and off-topic text, meaning the generated data requires far less manual cleaning before it can be used for fine-tuning.

Additionally, paid APIs handle scalability, reliability, and performance automatically, allowing teams to generate large volumes of synthetic data quickly and making them a more efficient and easier choice for production-grade datasets.

Why Data Quality Matters More Than Data Volume

One of the most common mistakes in fine-tuning is prioritizing dataset size over dataset quality. Large volumes of noisy, repetitive, or ambiguous examples can actively harm model performance. In contrast, a smaller dataset with clean instructions and precise answers often yields far better results.

High-quality fine-tuning data has a few defining characteristics. Each example focuses on a single intent, the answer is complete but concise, the tone is consistent across the dataset, and there are no contradictions or hallucinated facts. Duplicate or near-duplicate examples should be avoided unless repetition is intentional for reinforcement.

As a general rule, thousands of carefully curated examples are usually sufficient for domain adaptation and behavior shaping. We only need tens of thousands of samples when we are pushing for deep specialization or covering a very broad range of tasks.

FAQs

How much data is required to fine-tune an LLM?

The amount of data required depends on the fine-tuning method and the task. For parameter-efficient fine-tuning methods like LoRA or QLoRA, high-quality results can often be achieved with as few as 500 to 5,000 instruction–response pairs. Full model fine-tuning typically requires much larger datasets, often tens or hundreds of thousands of examples. In most cases, data quality and relevance matter more than sheer volume.

Can I fine-tune an LLM using synthetic data only?

Yes, LLMs can be fine-tuned using only synthetic data, especially for well-defined tasks such as customer support, summarization, or domain-specific Q&A. Synthetic data works best when it is generated using instruction-tuned models and validated through automated checks and human review. Many production systems use a mix of synthetic and real data, but high-quality synthetic data alone can still deliver strong performance.

What’s the best data format for LLM fine-tuning?

The most commonly used and recommended format is JSONL (JSON Lines), where each line represents one training example. For instruction tuning, this usually includes fields like instruction and output. JSONL is widely supported by Hugging Face, PEFT (LoRA/QLoRA), and most fine-tuning pipelines, making it easy to stream and scale during training.

How long does data preparation take?

Data preparation time varies based on dataset size and complexity. For small to medium datasets (1,000–5,000 samples), preparation can take a few hours to a couple of days, including cleaning and validation. Larger or domain-specific datasets may take several days or weeks, especially if human validation is involved. Automating parts of the pipeline can significantly reduce this time.

How often should fine-tuning data be updated?

Fine-tuning data should be updated whenever the domain, user behavior, or requirements change. For fast-evolving products or customer support systems, updates every few weeks or months are common. Regular updates help the model stay accurate, reduce drift, and adapt to new terminology, policies, or user expectations.

Conclusion

Preparing high-quality data is the major step of successful LLM fine-tuning. While model choice and training techniques often get the most attention, it is the quality, structure, and relevance of the data that ultimately determine how well a fine-tuned model performs. A well-designed data pipeline covering acquisition, cleaning, validation, bias checks, and human review ensures that the model learns the right behaviors and produces reliable, aligned outputs.

As this article demonstrated, data preparation does not have to rely solely on expensive or hard-to-source datasets. Synthetic data generated using instruction-tuned LLMs, when combined with proper quality controls, can be an effective and scalable solution for many fine-tuning use cases. By exporting data in standardized formats and continuously updating it as requirements evolve, teams can build fine-tuning workflows that are both efficient and future-proof. Ultimately, investing time in data preparation leads to more stable models, better generalization, and stronger real-world performance.

References and Resources

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee
Shaoni Mukherjee
Author
Technical Writer
See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.