Medical data secondary use issues
by miguel.malheiros
Patient data is extremely useful and valuable. Hospitals and clinics use it for the provision of medical services and for the management of these services while pharmaceutical companies use it for research, which can result in the development of new treatments and drugs, as well as for marketing purposes, directed at both physicians and patients. But how is the secondary use of these data controlled or managed? According to Bradley Malin, speaking at a data mining session at the Computer, Freedom and Privacy conference in Washington DC last week, there are currently several flaws in the handling of medical data in hospitals (in the U.S. at least), including: lack of role based access control; manual auditing of large numbers of patient records; organisational drift in workflows (actual business processes differ from the designed business processes); and a complex relational system where several people access the data at the same time and team members share accounts.
What about the control and protection offered to patients then? Khaled El Emam, a research chair in Electronic Health Information at the University of Ottawa who currently works with the university’s children’s hospital, mentioned that, in order to run an Infection Control Portal based on patients data – and following the Canadian legal framework – patient’s consent was sought and de-identification algorithms were applied to the data. However, consent is sometimes hard to obtain due to losing contact with the patient and even if it’s obtained it introduces bias in studies because consenters and non-consenters differ on many levels including behaviour.
De-identification has several limitations as well. Any linkage of data sets must be made before de-identification, since it’s no longer possible after it. A more serious problem is that data can be re-identified. El Emam gave the example of a data set comprising several million patients where if we considered only the date of birth, full Canadian postal code and one year of medical data 98% or records were unique. Uniqueness however is not a perfect indicator of re-“identifiability”, since even if a data set has a low percentage of uniqueness it may still be relatively easy to re-identify someone, such as when the de-identified data set is compared against a set which has a high uniqueness for example. El Emam told the story of an anonymous girl whose death was associated with a prescribed acne drug and that was eventually identified by the media who compared the de-identified prescription data set against obituaries. This comparison yielded four possible girls and by contacting all families the right one was found. Because of this possibility some data protection agencies recommend that each data set is de-identified to the point where it contains at least four similar records for each record. This of course has implications in the conclusions that be drawn from the data set. Thus, there is an implicit trade-off between the easiness of re-identification and usefulness of the data set i.e., between privacy and the economic and scientific value of the data.