Data Anonymization: Protecting Your Customers and Your Brand

Carlos Perdigão - July 31, 2020 - 9 min read

The mass collection of data is a trend unstoppable. Vast data collection, previously the domain of enterprise-level organizations or governments, is now a passive and ongoing task built into the average business.

Something as trivial as a visit to your homepage now generates hundreds of data points. As the visitor connects, your servers log the network activity, your end-servers connect to 3rd-party or internal services, they cache data at each node, the client discovers a personalty-identifying piece of information (PII) in a cookie and uses that to recover additional account-specific data, meanwhile an array of client-side marketing tools fire up and send data back to home base.

All of which results in a cascading wave of data transmissions that ripple across a network of hundreds of interconnected providers like a single human thought igniting a fibrous network of neurons.

With so many digital touch points, it’s no surprise that the IDC, a market intelligence company, found that human activity had generated 18 zettabytes of data by 2018. And they predicted we’re on track to generate 175 zettabytes by 2025.

If you’re wondering how many bytes are in a zettabye, the answer is 10… followed by 21 zeroes (1 Trillion Gigabytes).

With Much Data, Comes Much Risk

That same year, the European Union enacted the General Data Protection Regulation (GDPR) to stem the negligent manner in which companies retained and unwittingly exposed PII to malicious actors.

The GDPR, among other things, required companies to anonymize or remove from their passive data collections any personally identifying information: names, addresses, IPs, phone numbers, birth dates, and beyond.

Unfortunately, the guidelines proved difficult to follow; since the GDPR’s inception, the EU has fined organizations a total of €158 Million in penalties. But government fines are only the beginning of the financial risks assumed by companies who collect PII en masse.

One of Europe’s most popular airlines, UK-based EasyJet Plc is facing a lawsuit against 10,000+ victims of data theft, each seeking £2,000 in damages, totaling a settlement worth upwards of £18,000,000,000.

Across the pond, US-based companies are faring no better. Yahoo!, Walmart, Salesforce, and others face multi-million dollar settlements all related to data breaches that led to the improper exposure of PII.

But even if EasyJet, Walmart, Salesforce, and the rest shake off their lawsuits, they will suffer business lost due to consumer mistrust.

According to research conducted across the Eurozone by PCI Pal, a payments compliance company, of every 10 customers compromised by a data breach, 4 will cut ties with that business.

And in the United States, a study conducted by Ketchum revealed that 48% of US consumers do not believe companies which claim to protect consumer data, and an astounding 25% do not trust any industry with their personal information.

Once lost, trust is nearly impossible to recover, therefore, it’s best to do everything in your power to avoid losing that trust to begin with.

Careful Precautions

A responsible CIO may then decide to reinforce her organization’s external defenses: enhance firewalls, obfuscate code, encrypt data in transit and at rest — that should do it, right?

Occasionally, those we trust implicitly with access to sensitive information are the same individuals who breach that trust.

Late last year, Google fired 4 employees for accessing and distributing confidential employee data. At Ring, an IoT doorbell company and subsidiary of Amazon, 4 employees were fired at the beginning of 2020 for improperly accessing live video feeds from customer devices.

Around the same time, Beaumont Health, a network of 8 hospitals that employs over 38,000 people, fired an employee for exposing the information of nearly 1,200 patients.

Not only are we liable for protecting data as it enters and exits our information facilities, we must take further steps to prevent unauthorized access from within.

One Step at a Time

To protect customers and ultimately ourselves from costly data breaches that appear to grow in both frequency and severity, upgrading external defenses is a start, but only reduces the number of attack vectors; this practice ignores the heart of the matter: the data itself.

Precise and accurate PII data, at rest in our virtual warehouses, is a risk by virtue of its existence.

Hypothetically, if we have no valuable data to leak, a breach would do as much harm as a thief cracking an empty safe.

But how can we, in the day-to-day operations of our business, leave our safes empty while retaining access to critical information? Namely, we begin by following some best practices.

1. Only Store What You Need

An unused piece of PII data, resting in your database, is a resting liability.

If you sell products which consumers purchase on average every 4-years, such as high-end televisions, do you really improve the customer experience by retaining their credit card information?

And if you plan to market exclusively through email, what do you gain by storing customer last names and complete mailing addresses?

By simply refusing to collect unnecessary data, you protect yourself and your customers. However, shrewd executives may say, “data we don’t have, is data we can’t act on” — they opt to collect as much PII as they can wrench from the customer’s hands.

2. Out With The Old

At the moment of transaction and until the customer receives their products, retaining their address and other relevant PII can be critical to the well-functioning of your business.

But one month after? 6-months after? 5-years after?

Protect yourself and your consumers by deleting data once it has served its purpose.

3. Databases, Not Spreadsheets

While easy to share and well-understood by a modern workforce, spreadsheets which retain customer or employee PII are ticking row-column time-bombs.

By converting your spreadsheets into internal applications, you gain a host of security controls that include encryption and strict access management.

4. Avoid Migrations for Testing Purposes

Another common task which places PII at risk is the duplication of production databases into testing environments.

If avoidable, this practice should be avoided — rather, create specific test cases and generate synthetic records that resemble those of the production database (see Synthesis below).

If unavoidable, a.k.a your developers refuse to do it, mutate the PII within the data set in such a way that allows it to retain its testing value while securing private information.

5. Assess Application & Code-Level Security

Depending on their level of education, your developers may lack the requisite skill or principle concern for data security.

I’ve personally witnessed authentication requests that opted to pass the user’s unencrypted credentials via URL. If you’re wondering why that’s a problem, you’ve come to the right place.

Routinely assess the security practices followed by your engineers, retrain them as new techniques becomes available, and upgrade your dependencies as their security strengthens.

Learn more by reading OWASP’s top 10 web application security risks.

But even after following these precautions, storing only what you need, regularly deleting what you don’t, isolating your PII and governing its access, you may still retain thousands, if not hundreds of thousands of PII data points for business-purposes.

To secure them in the event of a data breach, you can take an additional and critical step.

Data Anonymization

Data anonymization is the application of one or more data-manipulation techniques which compromise the precision of a data set, thereby securing the anonymity of the individuals it represents, while retaining some or all of its business value.

Practically speaking, these techniques apply to big data upon which specialists use statistical analysis and machine learning to generate predictions or discover correlations — outside of these applications, these techniques may be of limited use, but can be applied in nuanced ways throughout your software ecosystem.

A sample of these techniques is found below. For more details you can also read this great article from Imperva.

Masking

Often seen when typing a password, the input field masks the characters you enter with an ‘*’ symbol.

This swap-and-replace technique allows you to replicate a true data source for testing or training purposes while maintaining user security.

For example, you may modify the email addresses in a duplicated ‘email’ column by swapping the domain name with `email.fake`, thereby obfuscating the true domain name of each individual email address.

This technique is ideal for situations in which your data may appear outwardly false without sacrificing its purpose.

Pseudonymization

This process modifies the original data set by swapping personally-identifying information with pre-generated misinformation; pseudonymization retains the statistical integrity of a data set.

For example, in a first name column, you swap all names that begin with ‘J’ with either ‘Alexis’ or ‘Stephen.’

This process allows the data to retain a semblance of precision for situations which demand such (presenting the names on-screen to end-users, for example).

Generalization

This technique weakens the statistical accuracy of a data set, but helps secure the privacy of the persons represented. Specifically, generalization removes precision from a data set.

For example, you may generalize a database featuring complete mailing addresses by removing the street number, or the street name, or even the town, so far as you avoid hampering the statistical integrity below the point of usefulness.

Google employs data generalization prior to sharing PII across its various services.

Permutation

Otherwise known as ‘data swapping’ or ‘shuffling,’ permutation is the process by which column values in a database are reassigned to new rows.

For example, the first name field of row zero swaps with the first name field of row 10, row one swaps with row 33, so on and so forth until the entirety of the columns have moved from their original position.

Perturbation

Simply put, the rounding, multiplication, or slight computational adjustment of numerical data by a standard quantum.

To perturb the birth date of individuals, you may multiply their birth month (1-12) by a random factor, then perform a modulus operation to determine a new, perturbed month of birth.

Similarly, you can add (or subtract) a small constant to their birth year such that your statistical models retain relative accuracy while securing privacy.

Synthesis

When all else fails, you can generate synthetic records which adhere to the patterns and standard deviations found in the original data set.

By performing statistical analysis on the true data set, you can build a synthetic data set that exhibits many behaviors found in the original.

Combining two or more of these techniques further reduces the chances that malicious actors may deduce valuable, accurate PII from pilfered data.

How and When to Anonymize

Data anonymization is often an intermediary step which takes place during a data transaction at the application layer.

For example, a back-end staging server may query the production database, anonymize the data in transit, cache a local anonymized copy, and reply to a front-end staging server with the anonymized result — the front-end being none the wiser.

To enable this behavior, your developers must build and deploy nuanced functions to apply manipulations at runtime.

The crafting of which may be time consuming and, if not well documented, challenging to unravel as staging software moves to the production environment.

Low-Code Data Anonymization

In my tech talk, you will discover details of the techniques I’ve shared here and see how a low-code platform enables your developers to deploy data anonymization techniques in minutes.

If you’re unfamiliar with the term, a low-code development platform is a tool which enables developers to build backend, frontend, databases, and integrations within a single visual development environment.

The benefits of low-code take the form of speed, code quality, compliance, consistency, and platform reliability.

I encourage you to watch our talk, Data Anonymization: Best Practices for Handling Sensitive Data, in which we demonstrate how developers may mask, mutate, and synthesize data to enhance its anonymity all within the comfort of a drag-and-drop low-code environment.

Carlos Perdigão

After years on the government and financial sectors, struggling to achieve the delivery speed demanded by the business, Carlos found in low-code the solution for the IT challenges he had experienced. With a Master's Degree in Computer Science and Engineering, Carlos is a Solution Architect at OutSystems focused on the Southern Europe area and on a mission to share his experience with everyone.

See All Posts From this author

Data Anonymization: Protecting Your Customers and Your Brand

With Much Data, Comes Much Risk

Careful Precautions