The limits of anonymisation
by Ian Brown
Arvind Narayanan and Dr Vitaly Shmatikov (University of Texas at Austin) have a fascinating new paper on the impact of social networks on the anonymisation of personal data (thanks, Mo!):
Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.
We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.
Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.
This is important information for those relying on exemptions in data protection law for the processing of anonymised data. The authors identify the following non-exhaustive list of those affected:
- Advertisers on online social networks.
- Application developers on online social networks.
- Anonymous graphs published for academic research–either crawled from online social networks or collected through other means.
- Phone-call graphs outsourced for fraud detection, etc.
- A variety of special-purpose graphs, such as health networks (of doctors and patients).
As it becomes increasingly easy to relink this type of data with data subjects, data protection authorities are increasingly like to apply data protection rules to this processing. The Article 29 Working Party of EU Data Protection Commissioners has published an opinion covering the definition of personal data in much more detail. On p.17, they state that the protection required for “indirectly identifiable” personal data is directly related to the ease with which it can be reidentified with an individual. It seems that Narayanan and Shmatikov’s work has raised this bar considerably.