Is data anonymous when we remove the personal identifiers?

This is a question we get all the time and requires some knowledge on the legal-tech side.

We all know: anonymisation is a critical piece of the digital health scenario and really hard to achieve. Sharing health data for the purposes of data analysis, product improvements, and research can have huge benefits.
The question is how to do so in a way that protects individual privacy but still ensures that the data is of sufficient quality and that the analytics are useful.

When personal identifiers are removed from data, they can be considered de-identified data, but they may not always be fully anonymous.

It's important to note that the level of anonymization necessary will depend on the sensitivity of the data, the purpose for which it will be used, and the relevant legal and ethical requirements.

Why it is so important?

Because if the data is not anonymous, you will need some legal basis (like consent) to share or use it.

What are personal identifiers?

Personal Identifiers (PID) are a subset of personal data which identify an individual and can permit another person to “assume” that individual's identity, such as:

Name and identifying information (e.g., date of birth, titles, ID numbers).
Contact data (e.g., mailing and e-mail address, phone number).
Biometric and health data (e.g., height, weight, hair color, genetic fingerprint, medical conditions, drug use).
Psychological information (e.g., political opinions, religious or ideological convictions, desires, attitudes, beliefs, legal competence).
Connections and relationships (e.g., friends and relations, employers).
Other data (e.g., location data, usage data, activities, statements, value judgments, career, banking information, etc.).

Data is pseudonymised (or de-identified) when it doesn’t contain explicit personal data but only unique references to it. Pseudonymisation is a good and relatively easy-to-manage security technique to make sensitive health data less explicit while still linking it to a physical subject.

Remember that pseudonymised data are still personal data according to the GDPR.

What is anonymisation?

Just briefly, anonymization is a process of removing personal data and then treating the remaining data to remove indirect identifiers. There must be nothing, 0r information, that links back data to a patient.

Anonymization is relevant when health data is used for secondary purposes. Secondary purposes are generally understood to be purposes that are not related to providing patient care. So, things such as research, healthcare, and marketing would be considered secondary purposes.

⬇️ ⬇️ ⬇️

🔎 Secondary purpose: the possibility of re-use of health data that were collected initially in the context of providing care but which may later be re-used for another purpose. It may be exercised by public entities (including universities and public health laboratories for research purposes), regulators, med-tech companies, and small pharma.

The standard techniques to perform anonymisation include:

➡️ Generalisation
➡️ Swapping
➡️ Perturbation
➡️ Aggregation

Keep in mind that anonymous data is not covered by GDPR.

In healthcare, anonymization allows for the sharing of health information when it’s not mandated or practical to obtain consent and when the sharing is discretionary, and the data controller doesn’t want to share that data.

Keep in mind that it is very hard to achieve full anonymisation. The point is this: for this data to become anonymous, in the beginning, it was either personal or pseudonymised data. This part of the data was still under your responsibility (as a data controller).

Turning pseudonymised data into anonymous is still considered a processing activity. Here, you still have obligations under the GDPR: this means that you need to ask for consent or a proper legal basis.

To guarantee that the data is anonymous, you have to really guarantee that it can’t be re-identified with any other publicly available dataset.

If you are interested in learning more about this, we talked about the legal basis for anonymisation here!

What is de-identified data?

De-identified data refers to information that has had personal identifiers removed, such as names, addresses, and Social Security numbers. However, even after removing personal identifiers, it may still be possible to re-identify individuals through the remaining data, especially if it contains other sensitive information or is combined with other datasets. This is known as re-identification risk.

While the de-identification and anonymisation process both look to remove key identifiers from data, they take different approaches that result in differing outcomes.

De-identification is an important capability. It looks at a single item and removes sensitive information, such as the person’s name or social security number, so outsiders can’t tell who it is. What’s considered sensitive depends on the use case. In the case of clinical trials, it could be a patient’s current health information or medical history.

De-identification involves protecting fields covering things like demographics and individuals’ socioeconomic information. This can be useful if you are training an ML model.

So, is de-identified data anonymous?

As we can understand, de-identification is part of the anonymization process. BUT it is not anonymisation according to the GDPR (although useful for data minimisation).

In fact, de-identification only sometimes tends to anonymise successfully because there are so many sources if data in the world outside (and they still have information that can help to re-identify them.) Thus, re-identification is still possible. This is not anonymisation because it is still pseudonymous.

On the other hand, with anonymisation is not possible to do so, no matter what other information you have in your hand!

So, if you have the idea to perform a de-identification process on your data set, remember that you will need consent (or any other legal basis) to process those data!

What about aggregated data? Are they anonymous?

The answer is: technically, yes. Let’s see why:

➡️ Aggregated data can be anonymous, but it depends on the level of aggregation and the data used. Aggregated data refers to data that has been combined or summarised from individual-level data to provide insights at a higher level.

➡️ If the level of aggregation is high enough and there is enough variability in the data, it can be difficult or even impossible to identify individuals from the aggregated data. On the other hand, if the level of aggregation is low or there are only a few data points, it may be possible to re-identify individuals.

➡️ In addition, even if the aggregated data is anonymous, there may still be privacy concerns if the data is sensitive.

For these reasons, we suggest you conduct an anonymisation assessment to protect yourself.

Image on blue background that says "Are aggregated data anonymous?"

Use case: Is it better to de-identify or anonymise data for clinical trials?

Under safe harbor methods, companies and hospitals must remove a host of potential identifiers, including names, email addresses, IP addresses, social security numbers, patient IDs, and biometric identifiers.

Expert determination, meanwhile, requires the evaluation of de-identifying techniques by someone with knowledge and experience in this area to verify that the overall risk of re-identification is small. In the case of GDPR, data anonymisation is required to ensure that an individual’s personal data cannot be reconstructed and used.

In practice, de-identification and anonymisation help clear the way for improved clinical trial speed without sacrificing patient privacy. It’s worth noting, however, that regulatory obligations are a moving target: While HIPAA currently requires de-identification, this may change as larger and larger data sets are leveraged to inform new healthcare efforts.

De-identification is now considered the “base expectation” for data handling in clinical trials. However, it is important to ensure that these data are collected and used in a way that protects patient privacy and maintains data security.

As a rule of thumb, never use anonymisation to avoid consent for data processing. You can’t say, “we don’t need consent” or “we don’t need to comply with GDPR” because you declare that data is anonymous. Remember to carry out a proper anonymisation assessment, but always keep in mind that both the input and the output must be anonymous to claim real anonymisation.

What should you do to set up a clinical trial?

These are some simple steps to follow if you want to set up a clinical study and share the result in a paper (with the peace of mind of having done everything to ensure data privacy).

🟢 Ask for consent, giving all the information about the organisations involved in the study.

🟢 Inform that the institutions involved in the study may process those data for the same purpose.

🟢 The study will share those data (in an aggregated way) with the public. Those data are not re-identifiable (because you have made all the efforts to make them unidentifiable.)

🟢 An anonymisation assessment is not necessary but a matter of accountability - it is recommended to do it.

Even if you assume you have anonymous data at the end of the process, you still have obligations under GDPR.

Chino.io: your trusted compliance partner

The one-stop shop for solving all privacy and security compliance aspects.

As a partner of our clients, we combine regulatory and technical expertise with a modular IT platform that allows digital applications to eliminate compliance risks and save costs and time.

Chino.io makes compliant-by-design innovation happen faster, combining legal know-how and data security technology for innovators.

To learn more, book a call with our experts.