Label efficient phenotyping for Long COVID using electronic health records

Categories: Disease & Virus

July 4, 2025

Study design and setting

The overall workflow (Fig. 4) includes cohort curation of patients with COVID-19 from two healthcare systems, the development and validation of the proposed phenotyping algorithm, followed by an illustrative downstream clinical application (i.e., healthcare utilization trends related to Long COVID).

**Fig. 4: Overall workflow for development of LATCH phenotype.**

Data source

The Consortium for Clinical Characterization of COVID-19 by EHR (4CE) is an international consortium for data-driven studies pertaining to COVID-19 and Long COVID³³. Two healthcare systems from the 4CE Consortium contributed data and chart review results for the current study: the Veterans Health Administration (VHA) and the University of Pittsburgh Medical Center (UPMC)^34,35,36. The VHA is the largest integrated healthcare system in the United States, with 171 medical centers, and UPMC is a Pittsburgh-based healthcare system with 43 hospitals. Over 15 million patients receive care at these two healthcare systems. The Institutional Review Boards (IRBs) at each of the participating healthcare systems approved the study (MVP: Supported by the Million Veteran Program (MVP000)—Central IRB 10-02; Phenotyping Protocol: Boston IRB 3097; Innovative Analytics: Central IRB 18–38; UPMC: STUDY20070095). Waivers of informed consent and waivers of HIPAA Authorization were received for these data only analysis studies.

We used data from VHA to train and internally evaluate the phenotyping algorithm, and those from UPMC to externally validate the component of the algorithm trained on structured data. At each healthcare system, we curated the cohort utilizing the same strategy. (1) Inclusion criteria and index date. The study cohort comprised patients who were either assigned at least one ICD-10 code of U07.1 (“COVID-19, virus identified”) or had confirmed positive results from a SARS-CoV-2 reverse transcription polymerase chain reaction (PCR) test, within the period from March 1, 2020, to September 30, 2022 for VHA, and to March 31, 2023 for UPMC. For each patient, the index date was set to the date of their first U07.1 code or positive PCR test for SARS-CoV-2. (2) Inpatient vs. outpatient. Patients meeting the inclusion criteria were further grouped as hospitalized (inpatient) or non-hospitalized (outpatient) depending on their hospital admission status within a window of 7 days before to 14 days after the index date. (3) Pre- vs. post-U09.9. To assess the potential role of the introduction of the U09.9 code on Long COVID phenotyping, we divided the cohort into two periods: pre- and post-U09.9. The infection cutoff date was set to September 1, 2021, to accommodate a lag window of up to 30 days for coding Long COVID, corresponding with the introduction of the U09.9 code in October 2021.

Long COVID definition and chart review process

Validation of Long COVID through chart review adhered to the World Health Organization (WHO) definition of Long COVID^37,38, following a VHA-developed protocol³⁹ with insights from 4CE Consortium. Eleven common Long COVID symptoms were identified into a “core” symptom cluster^40,41,42,43, with additional symptoms identified into an “extended” cluster by disease domain (e.g., cardiovascular, respiratory). Chart reviews were conducted on sampled patients with over six months of post-infection clinical notes and having at least one U09.9 code or new onset Long COVID related ICD code. This ensured documentation of symptom onset and duration aligning with Long COVID definitions (Supplementary Fig. 1). Descriptive statistics for the chart review cohort can be found in the previously published study by Maripuri et al.³⁹. A case was classified under less stringent criteria (WHO-1) if a single core symptom persisted for more than 60 days post-infection, whereas more stringent criteria (WHO-2) required at least two new symptoms (either two core or one core plus one extended) to persist for more than 60 days post-infection.

Domain experts chart reviewed clinical notes up to one year prior to the initial acute COVID-19 episode, excluding any symptoms present before or concurrently with the acute phase as the new onset. At the VHA, 474 patients were reviewed, including 332 randomly selected patients with U09.9 code and 142 patients without. At UPMC, 178 were reviewed, including 74 randomly selected patients with U09.9 code and 104 without.

Data process, feature curation and selection

Data from the VHA comprised both structured and unstructured types, while those from UPMC was solely structured data. For structured data, we rolled all ICD-10 diagnosis codes to one-digit level PheCodes⁴⁴ to capture broader diagnoses, intentionally omitting multi-level PheCodes. For unstructured data, we extracted Long COVID-related concepts as Concept Unique Identifiers (CUIs) using our established NLP pipeline^45,46. This involved applying named entity recognition (NER) to eight PubMed review articles and seven online knowledge databases to construct comprehensive CUI dictionaries. Following this, NLP was employed to process the narrative notes and extract the CUIs in the dictionary. We curated PheCode and NLP data into two types of features: the count of new-onset features post-COVID-19 infection per patient and the duration (in months) those features were observed. Following the exclusion of features with over 99% zero occurrences, we employed surrogate-assisted feature selection⁴⁷ to define our candidate feature set.

Semi-supervised phenotyping

Utilizing VHA data, we developed a three-step semi-supervised LATCH phenotyping algorithm as illustrated in Fig. 5, where we initially built unsupervised models without using gold-standard chart review labels, and finally built a supervised model, incorporating gold-standard chart review labels.

1.

Unsupervised XGBoost. We trained a set of XGBoost tree models⁴⁸ to classify the presence of U09.9, as a binary noisy label for the true Long COVID status, using curated candidate features from EHR data. Note that U09.9 data was retrospectively available even for the pre-U09.9 cohort due to back-coding procedures well established at institutions like the VHA and UPMC. Using the noisy label of U09.9 allowed for training with the full study cohort without gold-standard labels, and the XGBoost tree method was chosen to accommodate high dimensional data and to capture non-linear associations. The XGBoost models were tailored to specific sub-cohorts based infection period (i.e., pre-U09.9 or post-U09.9), hospitalization type. For each sub-cohort, models were trained with either 1) PheCode features only, or 2) combined PheCode and NLP features. This resulted in a set of XGBoost probabilities tailored to specific sub-cohorts and data types.
2.

Alignment of cohort-specific probabilities. We consolidated the XGBoost probabilities specific to each sub-cohort into a singular feature, assigning one probability per patient. Each patient’s final XGBoost probability was chosen from a model trained on the subcohort that matched the patient, based on infection period (i.e., pre-U09.9 or post-U09.9) and type of hospitalization. Within each sub-cohort, U09.9 status (presence or absence of code) determined which data type model was used: patients with U09.9 absent were assigned probability from the model trained with both PheCode and NLP features, patients with U09.9 present were assigned probability from the model trained with PheCode features only.
3.

Supervised logistic regression model. Employing logistic regression, we used the chart-review cohort to regress the gold-standard label against the binary indicator for period, the unified XGBoost probability, and logarithm of U09.9 code counts to refine patient-level Long COVID status classification.

**Fig. 5: Three-step semi-supervised LATCH phenotyping.**

Methods for comparison and evaluation metrics

We compared our semi-supervised LATCH algorithm against benchmark models such as binary U09.9 code presence, rule-based phenotypes at varying U09.9 code counts (e.g., ≥ 2, 3, 4), and unsupervised XGBoost-only, unsupervised XGBoost-only using only structured data, and the proposed model using only structured data. The different model architectures of the proposed method and benchmark models are summarized in Supplementary Table 2. Performance metrics, including area under the receiver operating characteristic curve (AUROC), F-score, TPR, PPV, and the proportion of Long COVID identified among COVID-19 patients, were evaluated against gold-standard chart review labels across periods (pre-U09.9, post-U09.9) and against different Long COVID definitions (WHO-1, WHO-2). For internal validation of the proposed method using VHA data, we conducted 10-fold cross-validation due to its reliance on labeled data for training and evaluation. To improve model interpretability, particularly for the unsupervised portion of our model, we also calculated Shapley values for feature importance.

External validation

Beyond internal validation, we assessed the generalizability of the proposed algorithm through external validation on UPMC data, focusing on the PheCodes only model due to the absence of NLP features for UPMC patients. We used the same benchmarks and evaluation metrics as previously detailed. However, we did not differentiate based on time periods because of fewer gold-standard labels at UPMC compared to VHA.

Downstream clinical application: temporal trend analysis of pre- and post-infection healthcare utilization

We demonstrate a proof of concept for a downstream clinical application of our method, by comparing pre- and post-infection healthcare utilization between patients identified as Long COVID-positive and Long COVID-negative in order to understand the healthcare impact of Long COVID. In contrast to existing studies relying on rule-based or survey data to identify Long COVID cases, we use the proposed computable phenotypes^{31,49,50,51,52}. Moreover, our analysis uniquely provides a month-by-month analysis, including the pre-infection period, of the degree of healthcare utilization. Using a longitudinal mixed-effects model, we analyzed healthcare utilization as the monthly total number of days with any PheCodes observed in the EHR, considering both fixed and random effects. The model includes variables such as period (pre-U09.9 vs post-U09.9), time in months pre- and post-infection, logarithm of baseline PheCode counts, and patient ID for random effects. Nonlinear temporal trends were captured using spline functions at the 2nd, 4th, and 6th months post-infection.

Source link