Collecting user data is common practice in modern websites and applications as a way of providing creators with more information to make decisions and create better experiences. Among other benefits, data can be used to help tailor content, drive product direction, and provide insight into problems in current implementations. Collecting relevant information and using it wisely can give organizations an edge over competitors and increase the impact of limited resources.
While data can help your organization fulfill its objectives, it is important to keep in mind that there are downsides to collecting and storing information about users. Privacy, security, ethical, and legal considerations can influence what type of data you collect, what you do with it, and your responsibilities to data owners. Failure to handle these concerns responsibly can result in significant financial and reputational damage, and potentially expose you to legal ramifications.
In this guide, we will discuss some of the ways that gathering and analysing data about your users can help make your organization more effective. We will also consider some of the risks and trade-offs that are associated with harvesting and retaining data and how to strike a balance that makes sense for your organization.
When talking about collecting, storing, and analyzing data about your site or application users, it is important to define what type of data we are referring to. In the broadest sense, user data means any type of data generated by people interacting with your products. This data can be divided into groups based on how it was collected.
Explicit data refers to data that was given by a user directly. This includes preferences, personally identifiable information like name, mailing address, email, social accounts, billing data, and more. This type of data can be obtained through forms on your site, by issuing surveys, or by asking users to share data from other profiles they might maintain online. This category of data offers solid, reliable information about individual users and can be used as-is without extensive analysis or interpretation as required with the second category of data, described below. It may be needed to implement basic functionality like collecting payments or it can be used to allow customized experiences based on personal preferences.
A second type of data is implicit data. This category is not provided by users directly, but is instead gleaned by collecting and analyzing data from user interactions or from existing explicit data. This might include behavior-based analytics like session duration, pages visited, or device profile but may also include inferences made from provided data, such as user persona segment, likely work and sleep schedules, or closest shopping district.
A third category consists of data acquired from external parties. This may have been gathered explicitly or implicitly originally, but your organization’s relationship to the data is filtered by another entity that has offered access to the information.
Explicit, implicit, and externally collected data are useful for developing a holistic representation of how users are interacting with your site and what would serve them best. In the next section, we’ll look at how this data can be used to improve user experience and define opportunities for product enhancements.
Another term that is helpful to know when discussing user data is personally identifiable information. Also known as PII, this refers to any type of information that can be traced back to a single, known individual. This type of data can be especially useful for many business functions, but it generally refers to a more sensitive class of data that might require special handling or consideration.
Working with relevant data about your users can transform the way you think about product design, resource allocation, and implementing iterative solutions. In this section, we will talk about some of the ways that user data can help your organization build better products, communicate with the right people with messages that resonate, and understand the significance of different changes or behaviors.
One of the most important applications of user data is to inform development and design decisions. Both explicit feedback provided from users and insight gained by interpreting user behavioral data from your site or application can provide guidance on how to improve your products.
Fundamentally, data is essential for iterative, feedback-oriented design. Without understanding how well current solutions function, it is difficult to make meaningful enhancements with confidence. Data can help you identify areas of friction within user flows, discover which designs provide the best outcomes, and determine which work would have the largest impact on users. Explicit feedback can help surface user requirements that you might not have thought of and opportunities to expand your offerings to address specific concerns.
In short, data is essential for planning, implementing, and evaluating changes into a system. While some data like application performance benchmarks might come from internal systems, a large percent of the data organizations care about is directly tied to how changes affect the people who interact with it.
Collected data is often frequently used to offer personalized experiences or messaging. By collecting user preferences directly or by analyzing past experiences to guess at what might be most relevant to a user, data can be used to create unique interactions that more closely align with your users’ interests and needs.
The ability to customize your interactions and user experiences has large implications for marketing materials, user interfaces, recommendation engines, and more. Data can be used to make sure you are targeting the correct audience, using messaging that is appropriate, and engaging at the most opportune time. It can help your users find the information they need more quickly and discover new content or features that match their interests. For many organizations, the goal of this process is to monetize their visitors by means of targeted advertisements.
Beyond personalization and driving product development, collecting data about your users can be required or helpful in a variety of other circumstances.
For example, prompting users for information like phone number or email address might be necessary to implement account recovery when users forget their credentials. Similarly, certain transactions require personally identifiable information to be submitted when processed by external parties (though for transactions like payment processing, this information is typically handled by the credit card processor itself).
Another instance where user data might be useful is in providing context for existing data from other sources. If your monitoring system shows a large increase in traffic over a short amount of time, it is probably useful to evaluate web analytics to determine where visitors are coming from. Similarly, if a subset of your users are reporting a problem with your application, understanding their geographic location can help you troubleshoot possible issues.
Other reasons you might collect data about your users are for auditing purposes and for compliance with government requirements. Records of user actions can help with mitigation and disclosure in case of security incidents. Certain industries require very specific records of information access, modification, creation, and deletion.
There are plenty of other cases where collected data can be useful for improving processes, providing decision makers with relevant information, and building products that users feel connected to.
With abundant examples of the benefits of collecting user data, it is also important to keep in mind some of the risks and problems that can arise from gathering and using this information. As with any solution, there are significant trade-offs that you should consider before determining if and how you leverage these resources.
One of the most important questions to pose when considering collecting information is how that information may compromise an individual’s privacy. Privacy is the ability to limit or deny access to information to outside parties. Collecting information about your users can impact their privacy, whether you share the data you gather or not.
Data privacy is important for personally identifiable information like names, addresses, and credit card information, but also for other data like page history and location data. Many people are familiar with the necessity for privacy with commonly recognized sensitive data, like medical or financial records, but privacy concerns must be evaluated in a broader context. Even seemingly harmless information can compromise user privacy. For example, displaying when a user was last seen on a site might not appear to have negative consequences, but it could expose sensitive information about that user’s activities to outside parties.
While users may not have a problem with certain types of information collection, they often do so under some assumptions about the scope of your use, how long it will be kept, and how it will be shared with external parties. For instance, users might be comfortable sharing their preferences to enhance the recommendation engine on your site, but they might not want those same preferences used for targeted advertising. Data usage and sharing beyond the agreed intent is considered a privacy violation.
The possibility of security-related incidents exposing private data is another related factor. Privacy can only be ensured if there are strong guarantees that the organization collecting and storing the data is capable of acting as safe stewards. Many high profile data breaches have highlighted the danger of centralizing private information because collected data is often shared more widely than was originally intended, either by accident or through malicious activities.
A separate, but related, impact of data collection is the erosion of anonymity. Anonymity means that activity or information cannot be attributed to a specific individual. While privacy is concerned mainly with controlling access to information, anonymity is a question of associating activity with identity.
Anonymity is important for a number of reasons. For instance, most people recognize the value of anonymity when identification can lead to consequences for whistleblowers — individuals disclosing unlawful or unethical conduct. However, anonymity is also an important option in many other contexts, like allowing people to avoid discrimination or bias that would be present if their identity were known. While anonymity and privacy are separate concepts, users often value anonymity as a necessary component of achieving privacy.
Like privacy, anonymity can be compromised by data collection through either intentional information sharing or accidental exposure. Information like IP addresses of visitors can be used to identify a user or household that accessed a site when cross-referenced against records from an ISP. Posts might be identified by a pseudonym or username, but may be traced back to a person by correlating data from other services. While anonymity can also be used to mask the identity of participants in illegal or harmful activities, legitimate users value anonymity on the internet as a method of operating in a hostile environment without exposing themselves to unnecessary harm.
Organizations often attempt to “anonymize” data by removing or obscuring identifiable attributes from datasets, but frequently, identities can be reestablished when combined with other sources of information. For example, if the names on published medical records are removed, the identities of the individuals can still be determined if there are other unique attributes within the disclosed information. Some data disclosure is harmful specifically because it destroys anonymity and privacy. For instance, exposing that an individual is a member of a certain organization or website can have negative consequences because it breaks anonymity (by tying user activities back to an identity) and privacy (because membership to the site is a data item that users may wish to keep private).
Increased data collection can help organizations optimize their practices, but this can come with serious ethical side effects. Especially when it comes to processing data automatically, systems responsible for categorizing users can inadvertently implement discriminatory practices based on unconscious human biases. Many of the same techniques used to programmatically segment users to better serve their interests can unintentionally entrench bias into the behavior of your applications or sites.
This problem is especially prevalent when working with big data systems and machine learning. While some proponents claim that these mechanisms only reveal patterns that are already present in the data, the algorithms used to locate those patterns can unintentionally put users from certain demographics at a disadvantage. They also suffer from deepening any bias found in the algorithms or data sets by finding patterns and emphasizing them. If you are not careful about how you use these tools, it’s possible that you may cross legal and ethical lines unintentionally.
Groups based on overly aggressive characteristics can be used to segment, with a high degree of accuracy, people based on protected classes even without specifically trying to target those users. For example, depending on how you’re using that data, you might inadvertently be dividing users along racial or gender lines in ways that have measurable impact on the price each customer sees. Even if you are not using the segmentation data for those purposes yourself, if that data is available to your advertising partners, it could enable similar situations.
One consideration that people tend to oversimplify when considering collecting and storing data about their users is access control. Access control can mean making sure that outside parties cannot read the data you are collecting, but more broadly speaking, it can mean defining boundaries for anyone interacting with the data. For instance, this can mean cordoning off entry to employees whose functions are unrelated to the data, ensuring that vendors or partners cannot access it without informing users, and considering what it would mean to be asked for access from government agencies. Access to data is usually more complicated than initially expected.
When it comes to government involvement, the complexity is compounded. Requests for access to collected data can come in the form of a subpoena or warrant, and in some cases, agencies have been known to request extremely broad access to data that is not well-targeted to specific users or investigations. In absence of access to sensitive, specific information, many governments have become adept at collecting and analyzing metadata, which is generally less well-protected but can still reveal important information. It is often a legal gray area as to whether collecting or requesting this type of information requires a warrant. Collecting user data requires organizations to be prepared to deal with these ambiguities and anticipate requests for data.
An additional factor in the potential reach of collected data is security. While you may be vigilant about the people you intentionally limit access to, when data is exposed either by accident or through a data breach it can effectively make the information you’ve gathered public. While taking security seriously is always important, the amount of valuable data on your system can affect your risk of exposure, the liability you may be burdened with, and the attractiveness of your organization as a target for malicious actors. Sharing agreements with partners or vendors also increases the security footprint of the data you’ve collected.
Before deciding to collect or use data from your visitors, it is important to know your legal responsibilities and understand what you must do to fulfil them.
When collecting data, you have certain responsibilities according to the laws of the countries you are operating in and, in some cases, the location(s) of your users. Being aware of the interplay of regulations from different jurisdictions is important for understanding the requirements you must fulfill and the liabilities you hold. Regulations can apply based on the type of data you’re collecting, how you intend to use it, where you plan on storing or processing it, how you are obligated to secure it, and where your users are from, for instance.
In contrast, many countries have more clearly defined requirements for working with user data. The European Union, for instance, has adopted a comprehensive set of rules under the General Data Protection Regulation (GDPR) which will go into effect May 25, 2018 to replace a previous generation of regulations. The rules have very broad scope and apply to anyone collecting or processing data within the EU or dealing with data from EU visitors. Organizations are required to provide explicit information about data use to visitors and receive explicit consent for working with a broad range of potentially sensitive data. Other parts of the regulation allow users to request removal of their data, require organizations to inform users of any security incidents related to their data within 3 days, and put forth penalties of heavy fines for noncompliance.
Many other locations have similar sets of data protection laws that can apply when working within those countries or processing information from their residents. When designing your specific policies, it is important to be aware of the types of laws that might apply to your intended collection and use. To minimize your risk, it is always a good idea to talk to a lawyer to get a better idea of the legal landscape and to review your proposed privacy and data collection policies.
Some common requirements or recommendations for privacy policies to disclose are:
For certain types of information, or in certain jurisdictions, you may also legally be required to incorporate descriptions of:
So far, we’ve talked about some of the risks inherent with collecting and storing information about your users and some of your legal requirements if you choose to do so. With that in mind, we can discuss some strategies you may wish to consider to help you implement a compliant and ethically-minded data policy. These suggestions are focused on limiting potential misuse and safeguarding user privacy while still yielding actionable information to capture the benefits outlined in the first part of this article.
Often, casting a wide net when gathering information seems like the most forward-thinking option for organizations. Many advocates of making data-driven decisions encourage gathering as many data points as possible, not only to help shape decisions you are making now, but to have a repository of historical data available for the future.
Rather than collecting data that you may need later, consider limiting the data you handle to current or near-future requirements. This restraint will ensure that you are not putting more data at risk of exposure than necessary. Beyond limiting security risks, being mindful of data creep can help you keep a clean signal in the data you are processing instead of inadvertently collecting data until it matches the patterns you are expecting. While connecting disparate data sources can help you discover interesting insights, an overabundance of data can be easily misinterpreted and lead you to draw dubious conclusions based on your previous assumptions.
After collecting data, you should also consider your policies regarding storage and retention. While some types of data are useful over long periods of time, others become less useful over time. Removing stale or dated data can help decrease your costs and can again limit the opportunity for misuse, accidents, or exploitation. If your historical data may be helpful in the future, consider aggregating, distilling, and analyzing the data and storing the results instead of the original data. While this limits the way that the historical data can be used and may require you to anticipate future needs, it can provide many of the benefits of storing historical data while minimizing some of the dangers.
Your systems and policies should distinguish between fundamentally sensitive data, like PII, and data that is not. While it might make sense to share basic business data across the organization rather freely, access to data that contains personal information should be closely guarded. This includes access to data that might not, at first glance, seem especially susceptible to misuse, like client names and addresses.
Being careful to limit access to this data to team members who need it for their specific job functions can help you provide better guarantees to your visitors. Reviewing the data policies of each of your vendors or partners can help ensure you are protecting your customers’ interests beyond your own systems. Limiting the collection of this personally identifiable information can also limit the amount of data you can be compelled to turn over in the case of government requests.
You might collect some types of sensitive information as a matter of course for things like processing payments. When possible, it is often preferable to defer these activities to dedicated service providers. This can help you avoid responsibility for handling your customers’ most sensitive data and often results in better security. Reliable processors have a deep understanding of how to appropriately collect and store this type of information and have insurance policies beyond what you may be able to invest in.
While your legal responsibilities related to user disclosure and control over the information you collect can vary depending on regulatory jurisdictions, being upfront about as many of your practices as possible is usually a good idea. Making it easy for users to view, download, or delete their data in your systems gives them agency over their own information and ensures that the data you do have is provided with consent.
Designing self-service pages where users can control their data can be a huge step forward for user privacy and consensual collection. Users can understand the data they’ve explicitly provided, the data you’ve gathered in the background based on their usage, and the ongoing ways that data is currently entering your systems. This encourages users to take an active and considered approach to their own privacy and allows users to refuse specific types of collection with an understanding of how that may affect their access.
Access to data about your visitors and customers can help focus your work and align your strategies in ways that would be difficult otherwise. By being thoughtful about the type of information that would be most valuable for your products and exploring options to help you gain that insight, you can answer difficult questions, make better decisions, and serve your users better.
A large part of why data stewardship must be taken seriously is because of the potential value and uses. Making visitors choose between protecting their privacy and using your services is an undesirable situation for both parties so it is important to temper your reliance on user data with responsible policies on collection, use, sharing, and security.
Note: The information in this article is provided for informational purposes only and should not be construed as legal advice. Please consult a legal professional to understand your full responsibilities.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Click here to Sign up and get $200 of credit to try our products over 60 days!