Olivia Rousseau
Nantes Université, INSERM, Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, Nantes, France
Date et heure
-

Abstract: For many years, clinicians and researchers have collected a huge amount of health-related data. However, more information implies a greater re-identifying risk for patients1 and pseudonymization is not enough to overcome this data security problem. The aim of the Avatar method is to compute a
synthetic dataset which conserves global characteristics of a sensitive dataset while ensuring the data security by computing re-identification metrics. Individuals are projected in a dimension reduction multidimensional space (PCA, FAMD) and then a local model is built for each sensitive individual
considering his 𝑘 nearest neighbors. The avatar is created randomly on this area to not easily associate a sensitive individual to its synthetic version. This method has been tested on a randomized clinical trial on multiple sclerosis treatment: the REFLEX study. A Cox analysis performed by Comi et al. gives
hazard ratios 0.49 [95% CI 0.38-0.64] and 0.69 [0.54-0.87] for the two arms2. In comparison, the hazard ratios of a synthetic dataset computed with the Avatar method is equal to 0.44 [0.34-0.57] and 0.64 [0.50-0.82]. Synthetic databases can reproduce pseudonymous dataset analyses and can be shared to the scientific community with less re-identifying risk for patients compared to pseudonymous datasets.

Keywords: Synthetic data; Anonymization; Data protection; Multiple sclerosis

1. Rocher L and al. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun.
2. Comi G and al. (2012). Comparison of two dosing frequencies of subcutaneous interferon beta-1a in patients with a first clinical demyelinating event suggestive of multiple sclerosis (REFLEX): a phase 3 randomised controlled trial. Lancet Neurol.

Attachment Size
Olivia Rousseau30juin2022.pdf 2.06 Mo