Computable phenotypes to identify respiratory viral infections in the All of Us research program

Categories: Disease & Virus

May 29, 2025

We analyzed EHR data from 265,222 All of Us participants between 1981 and 2022, developing computable phenotypes for eight respiratory viruses: rhinovirus (RV), human metapneumovirus (hMPV), respiratory syncytial virus (RSV), adenovirus (ADV), SARS-CoV-2, parainfluenza (PIV), common human coronavirus (hCoV), and influenza virus. Patient encounters were identified in the EHR if they had a virus-specific ICD code (e.g., ICD-9-CM 487 “influenza” or ICD-9-CM 487.0 “Influenza with pneumonia”), a positive laboratory test, or an antiviral prescription (for influenza and SARS-CoV-2 only). All virus-specific ICD codes, laboratory results, and medications used for phenotyping are provided in Tables S1-3. All subsequent related events within 90 days were grouped into the same illness episode (Fig. 1a)⁵.

Cohort characteristics

We identified respiratory virus episodes that varied substantially in size and composition (Fig. 1b). The largest cohorts were SARS-CoV-2 (n = 28,729 distinct episodes) and influenza virus (n = 19,784); followed by RV (n = 1,620), hCoV (n = 1,437), RSV (n = 1,161); and the smaller cohorts, hMPV (n = 486), PIV (n = 400), and ADV (n = 238). EHR data availability varied by virus, with the earliest records dating back to 1981 (influenza virus), followed by 1987 (ADV), 1997 (RSV), 2002 (PIV, hMPV), 2003 (RV), 2012 (hCoV), and 2020 (SARS-CoV-2).

Across all cohorts, participants were predominantly female (61–68%) with median ages mostly between 50 and 58 (Table S4). Participants who self-reported as White were the plurality for every virus (32.9–60.1%), compared to participants self-reporting as Black (16.5–28.5%) or Hispanic/Latino (17-32.1%). All other options (Asian, multiple selected, Middle Eastern or North African, and Native Hawaiian or Other Pacific Islander) were rare (0-2.2%). SARS-CoV-2 and influenza virus participant demographics most closely mirrored the overall All of Us cohort with ICD, laboratory, or medication data (Table S4). Compared to all other groups, SARS-CoV-2 and influenza virus cohorts had a higher proportion of participants who self-reported as White (50.4–60.1%), and they more frequently reported a higher income, education, and employer-provided insurance. Demographic data were only notably missing for insurance type (46,487/265,222 = 17.5% for all participants with EHR data). For each virus, participants identified as a viral case by the phenotype algorithm (‘infected’ in Table S4) received more tests per person compared to participants ever tested for that virus (‘tested’ in Table S4).

Analysis of episode composition revealed differences between viruses. Some viruses primarily consisted of laboratory results alone (predominantly for RV [74.7%], PIV [65.0%], SARS-CoV-2 [31.3%], hMPV [32.9%]) or single ICD codes (predominantly for ADV [45.8%], hCoV [43.1%], RSV [35.4%], influenza [34.8%]; Fig. 1b). Antiviral use varied: SARS-CoV-2 episodes rarely included antiviral prescriptions (4.57%), while medication-only episodes were frequently observed for influenza virus (22.9%), even after excluding prophylactic prescriptions.

Phenotype performance for detecting true positives

To understand how ICD code counts affected phenotype performance, we calculated sensitivity, specificity, and positive predictive value (PPV) for each virus using nucleic acid amplification and virus culture test results as a reference standard (Fig. 2). We compared phenotypes requiring at least 1, 2, 3, or 4 instances of relevant ICD code per episode. The sensitivity of using at least one virus-specific ICD code varied between viruses and decreased with the inclusion of additional codes. The sensitivity of just one ICD code was highest for influenza virus (66.8%), compared to moderate sensitivity for RSV (55.2%), SARS-CoV-2 (44.8%), ADV (42.4%), hMPV (40.2%), and hCoV (33.4%). RV (9.2%) and PIV (8.3%) were rarely identified by ICD codes alone regardless of how many appeared in an encounter.

Specificity and PPV demonstrated similar patterns, with exaggerated variation in PPV initially demonstrating three groupings. First, for influenza virus and SARS-CoV-2, the PPV for one or more ICD codes was lower (69.7% and 68.8%, respectively), but it increased as the minimum N ICD count increased (78.1% and 76.7% for at least 2 ICD codes, respectively; Fig. 2). Second, for other respiratory viruses except hCoV, the PPV was high (89.7–97.3%) regardless of ICD code count. Third, hCoV initially demonstrated a high PPV (79.5%) that decreased as ICD count increased (71.8% for at least 2 ICD codes; Figure S1a).

The unusual pattern for hCoV occurred during the COVID-19 pandemic, when the nonspecific hCoV ICD code counts spiked above historical maxima despite an absence of positive tests (Figure S1b). After excluding hCoV ICD codes recorded after February 1, 2020, the PPV pattern for hCoV aligned with non-influenza, non-SARS-CoV-2 viruses (Fig. 2, Figure S1a).

Adding medication use to the phenotype had varying effects on performance. As with the medication-exclusive phenotypes, specificity and PPV increased with each additional ICD code for the medication-inclusive influenza and SARS-CoV-2 cohorts. While only a small proportion (1,345/28,741 = 4.67%) of SARS-CoV-2 episodes included a prescription for remdesivir, molnupiravir, or nirmatrelvir, the addition of medication to the phenotype did increase PPV for this subset of 1,345 participants (Fig. 2). For influenza virus, medication use alone was poorly predictive (PPV = 46.8%), but combining medications with 1 ICD code improved PPV compared to 1 or more codes alone (87.1% vs. 69.7%, respectively; Fig. 2).

We further evaluated combinations of ICD codes and antiviral requirements for both influenza virus and SARS-CoV-2 and demonstrated an expected trade-off in performance (Table 1). The broadest criteria – requiring only one ICD code or a medication – maximized sensitivity (76.0% influenza virus, 45.1% for SARS-CoV-2), but this caused the highest number of false positives and the lowest PPVs (65.8% and 68.8%, respectively). For influenza virus, by requiring at least two ICD codes or a medication accompanied by an ICD code, the lower sensitivity (47.7%) was accompanied by a marked reduction in false positives (778 to 238) and increase in PPV (65.8–79.8%). Similar trends were observed for SARS-CoV-2. Despite trade-offs, the φ coefficient, which quantifies the correlation between lab results and phenotypes, was highest for the most inclusive phenotypes for both viruses.

Table 1 Phenotype performance

Geographic analysis identified broad nationwide coverage of infections, particularly for SARS-CoV-2 and influenza virus (Fig. 3). While episodes generally matched the distribution of All of Us participants with EHR data (Fig. 3b) and participants tested for each virus (Figure S2c), incorporating ICD codes and medications resulted in higher infection rates in the Southeast and Texas despite lower testing coverage in these regions.

Temporal analysis showed that infection episodes composed of 1–3 ICD codes exhibited seasonality patterns consistent with laboratory test-positive episodes for all commonly detected viruses (Fig. 4). During the early COVID-19 pandemic (winter 2020 to spring 2021), only SARS-CoV-2 and RV were consistently identified.

Patterns in phenotype composition by level of care

Encounter level of care varied by virus and episode composition. For RV, hMPV, PIV, hCoV, and SARS-CoV-2, episodes defined by at least one test without ICD codes were the most frequent. For RSV, ADV, and influenza virus, ICD-only episodes predominated (Fig. 5). Influenza virus episodes with antiviral prescriptions were similar in the distribution of visit types compared to those without antiviral prescriptions, while SARS-CoV-2 episodes rarely included prescriptions during our study period.

By percentage, influenza virus and SARS-CoV-2 episodes included a mix of outpatient, ER, and inpatient encounters, while rates of ER visits and hospitalizations were higher for the other viruses. Episodes with positive tests had higher hospitalization rates compared to test-negative episodes, and hospitalization rates increased with the number of ICD codes per episode. The cohorts included very few post-acute care encounters and almost no urgent care encounters.

Laboratory result comparison

Using national epidemiological data from NREVSS, COVID Data Tracker, and GISRS, we compared All of Us laboratory results by geographic coverage, virus type proportion, and temporal trends.

We found broad national coverage of All of Us participants with relevant EHR data (265,222 participants), with enriched sampling near population centers in the Northeast corridor; Western Pennsylvania; Great Lakes Region; Southeast; Arizona; California; and the metropolitan areas of Austin/Dallas, Kansas City, Denver, and Seattle (Figure S2a, Table S7). Only 3.7% of zip3 codes had no All of Us participants with ICD, laboratory, or medication data.

Testing patterns in the All of Us data overlapped with CDC clinical laboratories reporting to NREVSS (Figure S2b) and mirrored participant distribution (Figure S2a) with a notable decrease in testing for all respiratory viruses in the Southeast relative to participant density (Figure S2c). Testing frequency varied substantially by virus; participants were more frequently tested for influenza virus and SARS-CoV-2 compared to all other viruses.

Virus type distributions in All of Us were similar to national surveillance data from NREVSS and GISRS^18,19,20. For PIV (2011–2019), HPIV-3 was most commonly detected and all other types were less frequent (Figure S3a). For hCoV (2014–2021), OC43 was most common and 229E was least common, while the order of NL63 and HKU1 differed (Figure S3b). Influenza virus type proportions (2010–2020) were nearly identical, with influenza virus A more common than influenza virus B (Figure S3c). Cross-dataset influenza subtype comparisons were not available, but in All of Us, H3N2 and H1N1 pdm09 were markedly more common than H1N1 and H5N1, as expected.

Test positivity patterns from 2017 to 2022 matched CDC rates for most viruses (mean absolute error 5.89% positive tests per week for RV and 1.18–2.82 for all other viruses; Fig. 6). SARS-CoV-2, influenza virus, and RV had the highest percent positivity, and test positivity for most viruses followed expected seasonal patterns: PIV and RV test positivity showed two seasonal peaks per year (spring-dominant for PIV, fall-dominant for RV), while RSV, influenza virus, and hMPV had single overlapping winter peaks. SARS-CoV-2 test positivity rates matched expected variant waves (e.g., Alpha, Delta, and Omicron BA.1). Notable differences in the All of Us data include more week-to-week variability in virus positivity, undercounted positivity by ~ 10% during peak respiratory season for influenza virus and RSV, and less ADV positivity, relative to CDC data.