Tutorial

How To Design a Document Schema in MongoDB

Published on November 29, 2021
Default avatar

By Mateusz Papiernik

Software Engineer, CTO @Makimo

How To Design a Document Schema in MongoDB

The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.

Introduction

If you have a lot of experience working with relational databases, it can be difficult to move past the principles of the relational model, such as thinking in terms of tables and relationships. Document-oriented databases like MongoDB make it possible to break free from rigidity and limitations of the relational model. However, the flexibility and freedom that comes with being able to store self-descriptive documents in the database can lead to other pitfalls and difficulties.

This conceptual article outlines five common guidelines related to schema design in a document-oriented database and highlights various considerations one should make when modeling relationships between data. It will also walk through several strategies one can employ to model such relationships, including embedding documents within arrays and using child and parent references, as well as when these strategies would be most appropriate to use.

Guideline 1 — Storing Together What Needs to be Accessed Together

In a typical relational database, data is kept in tables, and each table is constructed with a fixed list of columns representing various attributes that make up an entity, object, or event. For example, in a table representing students at a a university, you might find columns holding each student’s first name, last name, date of birth, and a unique identification number.

Typically, each table represents a single subject. If you wanted to store information about a student’s current studies, scholarships, or prior education, it could make sense to keep that data in a separate table from the one holding their personal information. You could then connect these tables to signify that there is a relationship between the data in each one, indicating that the information they contain has a meaningful connection.

For instance, a table describing each student’s scholarship status could refer to students by their student ID number, but it would not store the student’s name or address directly, avoiding data duplication. In such a case, to retrieve information about any student with all information on the student’s social media accounts, prior education, and scholarships, a query would need to access more than one table at a time and then compile the results from different tables into one.

This method of describing relationships through references is known as a normalized data model. Storing data this way — using multiple separate, concise objects related to each other — is also possible in document-oriented databases. However, the flexibility of the document model and the freedom it gives to store embedded documents and arrays within a single document means that you can model data differently than you might in a relational database.

The underlying concept for modeling data in a document-oriented database is to “store together what will be accessed together.”" Digging further into the student example, say that most students at this school have more than one email address. Because of this, the university wants the ability to store multiple email addresses with each student’s contact information.

In a case like this, an example document could have a structure like the following:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ]
}

Notice that this example document contains an embedded list of email addresses.

Representing more than a single subject inside a single document characterizes a denormalized data model. It allows applications to retrieve and manipulate all the relevant data for a given object (here, a student) in one go without a need to access multiple separate objects and collections. Doing so also guarantees the atomicity of operations on such a document without having to use multi-document transactions to guarantee integrity.

Storing together what needs to be accessed together using embedded documents is often the optimal way to represent data in a document-oriented database. In the following guidelines, you’ll learn how different relationships between objects, such as one-to-one or one-to-many relationships, can be best modeled in a document-oriented database.

Guideline 2 — Modeling One-to-One Relationships with Embedded Documents

A one-to-one relationship represents an association between two distinct objects where one object is connected with exactly one of another kind.

Continuing with the student example from the previous section, each student has only one valid student ID card at any given point in time. One card never belongs to multiple students, and no student can have multiple identification cards. If you were to store all this data in a relational database, it would likely make sense to model the relationship between students and their ID cards by storing the student records and the ID card records in separate tables that are tied together through references.

One common method for representing such relationships in a document database is by using embedded documents. As an example, the following document describes a student named Sammy and their student ID card:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "id_card": {
        "number": "123-1234-123",
        "issued_on": ISODate("2020-01-23"),
        "expires_on": ISODate("2020-01-23")
    }
}

Notice that instead of a single value, this example document’s id_card field holds an embedded document representing the student’s identification card, described by an ID number, the card’s date of issue, and the card’s expiration date. The identity card essentially becomes a part of the document describing the student Sammy, even though it’s a separate object in real life. Usually, structuring the document schema like this so that you can retrieve all related information through a single query is a sound choice.

Things become less straightforward if you encounter relationships connecting one object of a kind with many objects of another type, such as a student’s email addresses, the courses they attend, or the messages they post on the student council’s message board. In the next few guidelines, you’ll use these data examples to learn different approaches for working with one-to-many and many-to-many relationships.

Guideline 3 — Modeling One-to-Few Relationships with Embedded Documents

When an object of one type is related to multiple objects of another type, it can be described as a one-to-many relationship. A student can have multiple email addresses, a car can have numerous parts, or a shopping order can consist of multiple items. Each of these examples represents a one-to-many relationship.

While the most common way to represent a one-to-one relationship in a document database is through an embedded document, there are several ways to model one-to-many relationships in a document schema. When considering your options for how to best model these, though, there are three properties of the given relationship you should consider:

  • Cardinality: Cardinality is the measure of the number of individual elements in a given set. For example, if a class has 30 students, you could say that class has a cardinality of 30. In a one-to-many relationship, the cardinality can be different in each case. A student could have one email address or multiple. They could be registered for just a few classes or they could have a completely full schedule. In a one-to-many relationship, the size of “many” will affect how you might model the data.
  • Independent access: Some related data will rarely, if ever, be accessed separately from the main object. For example, it might be uncommon to retrieve a single student’s email address without other student details. On the other hand, a university’s courses might need to be accessed and updated individually, regardless of the student or students that are registered to attend them. Whether or not you will ever access a related document alone will also affect how you might model the data.
  • Whether the relationship between data is strictly a one-to-many relationship: Consider the courses an example student attends at a university. From the student’s perspective, they can participate in multiple courses. On the surface, this may seem like a one-to-many relationship. However, university courses are rarely attended by a single student; more often, multiple students will attend the same class. In cases like this, the relationship in question is not really a one-to-many relationship, but a many-to-many relationship, and thus you’d take a different approach to model this relationship than you would a one-to-many relationship.

Imagine you’re deciding how to store student email addresses. Each student can have multiple email addresses, such as one for work, one for personal use, and one provided by the university. A document representing a single email address might take a form like this:

{
    "email": "sammy@digitalocean.com",
    "type": "work"
}

In terms of cardinality, there will be only a few email addresses for each student, since it’s unlikely that a student will have dozens — let alone hundreds — of email addresses. Thus, this relationship can be characterized as a one-to-few relationship, which is a compelling reason to embed email addresses directly into the student document and store them together. You don’t run any risk that the list of email addresses will grow indefinitely, which would make the document big and inefficient to use.

Note: Be aware that there are certain pitfalls associated with storing data in arrays. For instance, a single MongoDB document cannot exceed 16MB in size. While it is possible and common to embed multiple documents using array fields, if the list of objects grows uncontrollably the document could quickly reach this size limit. Additionally, storing a large amount of data inside embedded arrays have a big impact on query performance.

Embedding multiple documents in an array field will likely be suitable in many situations, but know that it also may not always be the best solution.

Regarding independent access, email addresses will likely not be accessed separately from the student. As such, there is no clear incentive to store them as separate documents in a separate collection. This is another compelling reason to embed them inside the student’s document.

The last thing to consider is whether this relationship is really a one-to-many relationship instead of a many-to-many relationship. Because an email address belongs to a single person, it’s reasonable to describe this relationship as a one-to-many relationship (or, perhaps more accurately, a one-to-few relationship) instead of a many-to-many relationship.

These three assumptions suggest that embedding students’ various email addresses within the same documents that describe students themselves would be a good choice for storing this kind of data. A sample student’s document with email addresses embedded might take this shape:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ]
}

Using this structure, every time you retrieve a student’s document you will also retrieve the embedded email addresses in the same read operation.

If you model a relationship of the one-to-few variety, where the related documents do not need to be accessed independently, embedding documents directly like this is usually desirable, as this can reduce the complexity of the schema.

As mentioned previously, though, embedding documents like this isn’t always the optimal solution. The next section provides more details on why this might be the case in some scenarios, and outlines how to use child references as an alternative way to represent relationships in a document database.

Guideline 4 — Modeling One-to-Many and Many-to-Many Relationships with Child References

The nature of the relationship between students and their email addresses informed how that relationship could best be modeled in a document database. There are some differences between this and the relationship between students and the courses they attend, so the way you model the relationships between students and their courses will be different as well.

A document describing a single course that a student attends could follow a structure like this:

{
    "name": "Physics 101",
    "department": "Department of Physics",
    "points": 7
}

Say that you decided from the outset to use embedded documents to store information about each students’ courses, as in this example:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ],
    "courses": [
        {
            "name": "Physics 101",
            "department": "Department of Physics",
            "points": 7
        },
        {
            "name": "Introduction to Cloud Computing",
            "department": "Department of Computer Science",
            "points": 4
        }
    ]
}

This would be a perfectly valid MongoDB document and could well serve the purpose, but consider the three relationship properties you learned about in the previous guideline.

The first one is cardinality. A student will likely only maintain a few email addresses, but they can attend multiple courses during their study. After several years of attendance, there could be dozens of courses the student took part in. Plus, they’d attend these courses along with many other students who are likewise attending their own set of courses over their years of attendance.

If you decided to embed each course like the previous example, the student’s document would quickly get unwieldy. With a higher cardinality, the embedded document approach becomes less compelling.

The second consideration is independent access. Unlike email addresses, it’s sound to assume there would be cases in which information about university courses would need to be retrieved on their own. For instance, say someone needs information about available courses to prepare a marketing brochure. Additionally, courses will likely need to be updated over time: the professor teaching the course might change, its schedule may fluctuate, or its prerequisites might need to be updated.

If you were to store the courses as documents embedded within student documents, retrieving the list of all the courses offered by the university would become troublesome. Also, each time a course needs an update, you would need to go through all student records and update the course information everywhere. Both are good reasons to store courses separately and not embed them fully.

The third thing to consider is whether the relationship between student and a university course is actually one-to-many or instead many-to-many. In this case, it’s the latter, as more than one student can attend each course. This relationship’s cardinality and independent access aspects suggest against embedding each course document, primarily for practical reasons like ease of access and update. Considering the many-to-many nature of the relationship between courses and students, it might make sense to store course documents in a separate collection with unique identifiers of their own.

The documents representing classes in this separate collection might have a structure like these examples:

{
    "_id": ObjectId("61741c9cbc9ec583c836170a"),
    "name": "Physics 101",
    "department": "Department of Physics",
    "points": 7
},
{
    "_id": ObjectId("61741c9cbc9ec583c836170b"),
    "name": "Introduction to Cloud Computing",
    "department": "Department of Computer Science",
    "points": 4
}

If you decide to store course information like this, you’ll need to find a way to connect students with these courses so that you will know which students attend which courses. In cases like this where the number of related objects isn’t excessively large, especially with many-to-many relationships, one common way of doing this is to use child references.

With child references, a student’s document will reference the object identifiers of the courses that the student attends in an embedded array, as in this example:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ],
    "courses": [
        ObjectId("61741c9cbc9ec583c836170a"),
        ObjectId("61741c9cbc9ec583c836170b")
    ]
}

Notice that this example document still has a courses field which also is an array, but instead of embedding full course documents like in the earlier example, only the identifiers referencing the course documents in the separate collection are embedded. Now, when retrieving a student document, courses will not be immediately available and will need to be queried separately. On the other hand, it’s immediately known which courses to retrieve. Also, in case any course’s details need to be updated, only the course document itself needs to be altered. All references between students and their courses will remain valid.

Note: There is no firm rule for when the cardinality of a relation is too great to embed child references in this manner. You might choose a different approach at either a lower or higher cardinality if it’s what best suits the application in question. After all, you will always want to structure your data to suit the manner in which your application queries and updates it.

If you model a one-to-many relationship where the amount of related documents is within reasonable bounds and related documents need to be accessed independently, favor storing the related documents separately and embedding child references to connect to them.

Now that you’ve learned how to use child references to signify relationships between different types of data, this guide will outline an inverse concept: parent references.

Guideline 5 — Modeling Unbounded One-to-Many Relationships with Parent References

Using child references works well when there are too many related objects to embed them directly inside the parent document, but the amount is still within known bounds. However, there are cases when the number of associated documents might be unbounded and will continue to grow with time.

As an example, imagine that the university’s student council has a message board where any student can post whatever messages they want, including questions about courses, travel stories, job postings, study materials, or just a free chat. A sample message in this example consists of a subject and a message body:

{
    "_id": ObjectId("61741c9cbc9ec583c836174c"),
    "subject": "Books on kinematics and dynamics",
    "message": "Hello! Could you recommend good introductory books covering the topics of kinematics and dynamics? Thanks!",
    "posted_on": ISODate("2021-07-23T16:03:21Z")
}

You could use either of the two approaches discussed previously — embedding and child references — to model this relationship. If you were to decide on embedding, the student’s document might take a shape like this:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ],
    "courses": [
        ObjectId("61741c9cbc9ec583c836170a"),
        ObjectId("61741c9cbc9ec583c836170b")
    ],
    "message_board_messages": [
        {
            "subject": "Books on kinematics and dynamics",
            "message": "Hello! Could you recommend good introductory books covering the topics of kinematics and dynamics? Thanks!",
            "posted_on": ISODate("2021-07-23T16:03:21Z")
        },
        . . .
    ]
}

However, if a student is prolific with writing messages their document will quickly become incredibly long and could easily exceed the 16MB size limit, so the cardinality of this relation suggests against embedding. Additionally, the messages might need to be accessed separately from the student, as could be the case if the message board page is designed to show the latest messages posted by students. This also suggests that embedding is not the best choice for this scenario.

Note: You should also consider whether the message board messages are frequently accessed when retrieving the student’s document. If not, having them all embedded inside that document would incur a performance penalty when retrieving and manipulating this document, even when the list of messages would not be used often. Infrequent access of related data is often another clue that you shouldn’t embed documents.

Now consider using child references instead of embedding full documents as in the previous example. The individual messages would be stored in a separate collection, and the student’s document could then have the following structure:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ],
    "courses": [
        ObjectId("61741c9cbc9ec583c836170a"),
        ObjectId("61741c9cbc9ec583c836170b")
    ],
    "message_board_messages": [
        ObjectId("61741c9cbc9ec583c836174c"),
        . . .
    ]
}

In this example, the message_board_messages field now stores the child references to all messages written by Sammy. However, changing the approach solves only one of the issues mentioned before in that it would now be possible to access the messages independently. But although the student’s document size would grow more slowly using the child references approach, the collection of object identifiers could also become unwieldy given the unbounded cardinality of this relation. A student could easily write thousands of messages during their four years of study, after all.

In such scenarios, a common way to connect one object to another is through parent references. Unlike the child references described previously, it’s now not the student document referring to individual messages, but rather a reference in the message’s document pointing towards the student that wrote it.

To use parent references, you would need to modify the message document schema to contain a reference to the student who authored the message:

{
    "_id": ObjectId("61741c9cbc9ec583c836174c"),
    "subject": "Books on kinematics and dynamics",
    "message": "Hello! Could you recommend a good introductory books covering the topics of kinematics and dynamics? Thanks!",
    "posted_on": ISODate("2021-07-23T16:03:21Z"),
    "posted_by": ObjectId("612d1e835ebee16872a109a4")
}

Notice the new posted_by field contains the object identifier of the student’s document. Now, the student’s document won’t contain any information about the messages they’ve posted:

{
    "_id": ObjectId("612d1e835ebee16872a109a4"),
    "first_name": "Sammy",
    "last_name": "Shark",
    "emails": [
        {
            "email": "sammy@digitalocean.com",
            "type": "work"
        },
        {
            "email": "sammy@example.com",
            "type": "home"
        }
    ],
    "courses": [
        ObjectId("61741c9cbc9ec583c836170a"),
        ObjectId("61741c9cbc9ec583c836170b")
    ]
}

To retrieve the list of messages written by a student, you would use a query on the messages collection and filter against the posted_by field. Having them in a separate collection makes it safe to let the list of messages grow without affecting any of the student’s documents.

Note: When using parent references, creating an index on the field referencing the parent document can significantly increase the query performance each time you filter against the parent document identifier.

If you model a one-to-many relationship where the amount of related documents is unbounded, regardless of whether the documents need to be accessed independently, it’s generally advised that you store related documents separately and use parent references to connect them to the parent document.

Conclusion

Thanks to the flexibility of document-oriented databases, determining the best way to model relationships in a document databases is less of a strict science than it is in a relational database. By reading this article, you’ve acquainted yourself with embedding documents and using child and parent references to store related data. You’ve learned about considering the relationship cardinality and avoiding unbounded arrays, as well as taking into account whether the document will be accessed separately or frequently.

These are just a few guidelines that can help you model typical relationships in MongoDB, but modeling database schema is not a one size fits all. Always take into account your application and how it uses and updates the data when designing the schema.

To learn more about schema design and common patterns for storing different kinds of data in MongoDB, we encourage you to check the official MongoDB documentation on that topic.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


Tutorial Series: How To Manage Data with MongoDB

MongoDB is a document-oriented NoSQL database management system (DBMS). Unlike traditional relational DBMSs, which store data in tables consisting of rows and columns, MongoDB stores data in JSON-like structures referred to as documents.

This series provides an overview of MongoDB’s features and how you can use them to manage and interact with your data.

About the authors
Default avatar

Software Engineer, CTO @Makimo

Creating bespoke software ◦ CTO & co-founder at Makimo. I’m a software enginner & a geek. I like making impossible things possible. And I need tea.


Default avatar

Manager, Developer Education

Technical Writer @ DigitalOcean


Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more
DigitalOcean Cloud Control Panel