Anonymizing Quantitative Data
Overview
Exercises: 30 minQuestionsObjectives
Why do we need to anonymize data?
How do we anonymize data?
Gain an understanding of how to anonymise data
Create a checklist of anonymisation steps for quantitative data
What is anonymisation?
Anonymisation is the process of turning data from which individual people can be identified into data from which individual people cannot be identified.
Pseudonymisation is the process of turning data from which individual people can be identified into data from which individual people can only be identified using other, non-shared information.
How might we identify people in a dataset?
Discussion
10 min
How easy might it be to identify individual people in the datasets below?
What data might need to be included to allow us to identify individual people?
- Chimpanzee community members and their relationships
- National Census Data
- 1-week of smartphone location tracking data
- Advertising preference information for students in a university yeargroup
- Adveritsing preference information for all students nation-wide
- Behavioural data for a task conducted on Amazon Mechanical Turk
- Genome-wide association study data
- Longitudinal study of mental health of adopted children
- Structural MRI and depression screening questionnaire
Building a checklist
Activity
20 min
Divide up the resources below between the group, and make a checklist of things to consider when anonymising or pseudo-anonymising data. How would you go about doing it in your projects? When would you go about doing it in your projects?
As you go along, collaborate in the collaborative editing document to create a checklist. If you find your needs are incompatible with someone else’s, add some conditional items to the checklist or create a new copy for each approach.
Resources
- UK Data Service
- Finnish Social Science Data Archive
- UK Information Commissioner’s Office Guidance, Appendix 2
- Consortium of European Social Science Data Archives
- what else can you find?
- Guidance from your institution
- Guidance from your government
- Papers from others in your field
- Good old web search
Location data
Location data renders people especially identifiable. You can use the tool at https://cpg.doc.ic.ac.uk/individual-risk/ to explore this - enter some details and scroll down on the second page to where you can add extra attributes. What happens to the identifiability when you check or uncheck the postcode field?
Key Points
Anonymised data are easier to share legally.
Remove direct identifiers
Reduce precision to stop outliers leading to identification
Consider the potential for reidentification by cross-tabulating fields