This study is specifically directed at developing an accurate and scalable approach to facilitate early detection of coronavirus infection and deterioration in the pediatric population. We propose to use state-of-the-art artificial intelligence (AI) technologies, including natural language processing (NLP) and machine learning, to analyze research articles, identify pertinent risk factors, and predict risk of coronavirus infection and deterioration based on real-world data.
Specific Aim 1: Construct a comprehensive set of risk factors for coronavirus infection in the pediatric population. The CORD-19 dataset represents the most extensive coronavirus literature collection extracted from PubMed Center, medRxiv and bioRxiv. To date, the dataset includes 31,755 full-text scholarly articles about COVID-19 and the coronavirus family. The dataset contains 1,371 pediatric-related articles, including infants (380), children (867), and adolescents (124). The objective of this aim is to identify a comprehensive and accessible set of risk factors that characterize coronavirus infection. We will test Hypothesis 1a: our manual review of the CORD-19 dataset will capture a core set of risk factors associated with coronavirus infection in the pediatric population, and Hypothesis 1b: using NLP technologies will enable the exploration of the entire CORD-19 dataset for additional risk characteristics. Our approach is to manually review pediatric-related articles in the CORD-19 dataset, and to automatically analyze the entire CORD-19 dataset using NLP technologies. The expected outcome is a comprehensive and clinically meaningful set of risk factors for detecting coronavirus infection in the pediatric population.
Specific Aim 2: Develop and evaluate a machine learning-based system for detecting coronavirus infection and deterioration for pediatric patients. Given the risk factors identified from the literature, the objective of this aim is to develop a machine learning-based system to predict risk of coronavirus infection and deterioration for individual patients. We will test Hypothesis 2a: a machine learning-based system will predict risk of coronavirus infection and deterioration with greater than 75% AUC, and Hypothesis 2b: the system can identify key predictors for diagnosis detection and subsequent intervention.Our approach is to collect a large set of clinical data from the institutional electronic health records (EHRs) and PEDSnet learning health system and use machine learning and feature selection technologies to predict risk outcomes and identify clinical predictors. The expected outcome is a prototype machine learning system that analyzes the risk factors, predicts risk of coronavirus infection and deterioration, and identifies key clinical predictors for subsequent intervention. A study protocol will be submitted to the CCHMC Institutional Review Board and PEDSnet to perform data collection and technology development for this Aim.
By successfully completing this project, we will be well-positioned to compete for federal funding from the National Institutes of Health. The developed algorithms will be incorporated into subsequent grant applications to expand the research to a diverse patient population and for a broad range of healthcare institutions
.