Privacy-preserving Methods and Tools for Handling Missing Data in Distributed Health Data Networks

Specific Aims

Distributed health data networks (DHDNs) that leverage electronic health records (EHRs) (e.g., eMerge, pSCANNER, PEDSnet) have gained substantial interests in recent years, as they a) eliminate the need to create, maintain, and secure access to central data repositories, b) minimize the need to disclose protected health information outside the data-owning entity, and c) mitigate many security, proprietary, legal, and privacy concerns. Missing data are ubiquitous and present analytical challenges in DHDNs. However, very limited research has been conducted to address missing data in such settings.  When applying to a distributed environment, the current state-of-the-art approaches for handling missing data typically require pooling raw data into a central data repository before analysis. Such an approach needs patient-level data sharing, which may not be feasible for a number of reasons, including institutional policies prohibiting such sharing, high regulatory hurdles, public privacy concerns and costs of moving massive amounts of data. In particular, improper disclosure of patient-level data may have serious implications. A large body of research has demonstrated that given some background information about an individual such as data from EHRs, an adversary can learn sensitive information about the individual from de-identified data. To address the challenges associated with handling missing data in distributed analysis and fill a crucial methodology gap, we propose the following specific aims.

Aim 1: Develop privacy-preserving distributed methods for handling missing data in horizontally partitioned data. Different data custodians such as hospitals and healthcare service providers have the same type of data for different sets of patients. They would like to collaborate and address the missing data problem together but are reluctant to transmit their patient-level data due to aforementioned reasons. We will develop, implement, and evaluate privacy-preserving inverse probability weighting and multiple imputation methods for univariate and general missing data patterns under both MAR and MNAR.

Aim 2: Develop privacy-preserving distributed methods for handling missing data in vertically partitioned data. Different data custodians such as hospitals, insurance companies, and sequencing centers have different pieces of patient information (i.e., data from the same patient are distributed across different institutions). In many cases, accessing the complete profile of patients would help better address the missing data problems at individual institutions but patient-level data cannot be shared between institutions. We will develop, implement, and evaluate privacy-preserving inverse probability weighting and multiple imputation methods for vertically partitioned data for both univariate and general missing data patterns under both MAR and MNAR. 

Aim 3: Develop a user-friendly open-source toolkit, enabling researchers to handle missing data for distributed analysis in distributed health data networks. We will develop an integrated toolkit including the methods developed in Aims 1 and 2. The integrated toolkit includes different modules for communication, storage, and algorithms, in which each client keeps its local data (stored in local private zone) and calculates aggregated statistics (e.g., co-variance, kernel matrix, etc.) that are needed for the proposed methods (stored in the shared zone). We will implement the toolkit as open-source software with a web interface and provide Docker containers to support efficient deployment. The toolkit will subsequently be disseminated as open source software with APIs, allowing users to incorporate into their own systems.

Aim 4: Evaluate and validate the proposed methods and tool kit using the UCSD obesity patient data prepared for pSCANNER, and data from PEDSnet, as well as via simulations. 

The proposed approaches will enable using data across multiple sites and will not require pooling patient-level data into a central data repository. They also have an advantage of computational efficiency because the decomposed computation can be parallelized to all participating parties. As such, they can be scaled up to handle massive amounts of data in DHDNs. The results of our study will significantly advance the state-of-the-art in missing data methodology for DHDNs. The privacy-preserving software toolkit will enable researchers to use more complete data in their research by leveraging information from multiple sites without compromising patient privacy, and help lower regulatory and other hurdles for collaboration across multiple institutions and build the public trust. As such, it will encourage more institutions and healthcare systems to become part of a clinical data research network and more patients to participate in clinical studies, which will improve the validity, robustness and generalizability of research findings and offer substantial benefits in areas including, but not limited to, precision medicine and informatics practice.

Collaborator Institutions