Anonymizing Quantitative Data

Overview

Exercises: 30 min
Questions
  • Why do we need to anonymize data?

  • How do we anonymize data?

Objectives
  • Gain an understanding of how to anonymise data

  • Create a checklist of anonymisation steps for quantitative data

What is anonymisation?

Anonymisation is the process of turning data from which individual people can be identified into data from which individual people cannot be identified.

Pseudonymisation is the process of turning data from which individual people can be identified into data from which individual people can only be identified using other, non-shared information.

How might we identify people in a dataset?

Discussion 10 min

How easy might it be to identify individual people in the datasets below?
What data might need to be included to allow us to identify individual people?

  • Chimpanzee community members and their relationships
  • National Census Data
  • 1-week of smartphone location tracking data
  • Advertising preference information for students in a university yeargroup
  • Adveritsing preference information for all students nation-wide
  • Behavioural data for a task conducted on Amazon Mechanical Turk
  • Genome-wide association study data
  • Longitudinal study of mental health of adopted children
  • Structural MRI and depression screening questionnaire

Building a checklist

Activity 20 min

Divide up the resources below between the group, and make a checklist of things to consider when anonymising or pseudo-anonymising data. How would you go about doing it in your projects? When would you go about doing it in your projects?

As you go along, collaborate in the collaborative editing document to create a checklist. If you find your needs are incompatible with someone else’s, add some conditional items to the checklist or create a new copy for each approach.

Resources

Location data

Location data renders people especially identifiable. You can use the tool at https://cpg.doc.ic.ac.uk/individual-risk/ to explore this - enter some details and scroll down on the second page to where you can add extra attributes. What happens to the identifiability when you check or uncheck the postcode field?

Key Points

  • Anonymised data are easier to share legally.

  • Remove direct identifiers

  • Reduce precision to stop outliers leading to identification

  • Consider the potential for reidentification by cross-tabulating fields