Integrating a host biomarker with a large language model for diagnosis of lower respiratory tract infection

Categories: Disease & Virus

December 16, 2025

Adjudication of LRTI status

Gold standard adjudication of LRTI status was performed retrospectively following ICU discharge by two or more physicians using all available information in the EMR, and based on the U.S. Centers for Disease Control and Prevention (CDC) PNEU1 criteria²⁷ as well as an identified pulmonary pathogen. Patients with negative microbiological testing and a clear alternative reason for their acute respiratory failure besides pulmonary infection, representing the clinically relevant control group, were also identified (No LRTI group). Any adjudication discrepancies were resolved by a third physician, and patients with indeterminate LRTI status were excluded.

Extraction of EMR data

The primary medical or ICU team’s clinical note from the day prior to study enrollment and the CXR read from the day of enrollment were extracted from the EMR. If no note was written on the day prior to enrollment, a note from two days prior was substituted (Table 1). Notes were written in the EPIC EMR platform by physicians from the primary care team, which included Internal Medicine, Critical Care, and several other additional services (Table 1). Notes varied in length and structure, reflecting the real-world diversity of clinical practice, and allowing a realistic scenario for GPT-4 use. If no CXR was performed on the day of enrollment, the next closest CXR read prior to the date of enrollment was used instead. Patients with no clinical notes available prior to study enrollment were excluded (N = 7 derivation cohort, N = 12 validation cohort). The clinical treatment team’s LRTI diagnosis was extrapolated based on administration of empiric antimicrobials (antibacterial, antiviral, and/or antifungal agents) for at least 24 h within one day of study enrollment, excluding agents given for established non-pulmonary infections or prophylaxis.

RNA sequencing

RNA was extracted from tracheal aspirates collected on the day of enrollment and underwent rRNA depletion followed by library preparation using the NEBNext Ultra II kit on a Beckman-Colter Echo liquid handling instrument, as previously described⁸. Finished libraries underwent paired-end sequencing on an Illumina NovaSeq.

FABP4 diagnostic classifier

All analyses were done in R version 4.5.0. FABP4 expression was normalized using the varianceStabilizingTransformation function from DESeq2 package (v1.48.1)²⁸, and used to train a logistic regression classifier. We chose to use logistic regression because among machine learning methods, it was best suited for the 1-2 features we sought to test within the sample size of the cohorts. More specifically, logistic regression is less vulnerable to overfitting compared to other more complex models, such as a random forest or gradient boosting classifiers. In addition, logistic regression is among the most broadly utilized statistical methods reported in the medical literature, and we believe that this inherent familiarity and interpretability would be particularly appealing in clinical settings compared to more advanced but less transparent machine learning models.

In each iteration of 5-fold cross-validation, both training and test sets were filtered to retain only genes with at least 10 counts across 20% of the samples in the training set. The test fold’s FABP4 expression level was normalized using variance-stabilizing transformation and the dispersions of the training folds, and input to the trained logistic regression classifier to assign LRTI or No LRTI status for each patient in the test fold. The performance and receiver operating characteristic (ROC) curve for each of the five folds was evaluated using the package pROC v1.19.0.1²⁹. The mean AUC and standard deviation were calculated from the average AUC derived from each test fold. The sensitivity and specificity at Youden’s index were extracted for each test fold separately using the function coords from the pROC package, and the average and standard deviation was calculated across the cross-validation folds.

GPT-4 input, scoring, and prompt engineering

We used the GPT-4 turbo model with 128k context length and a temperature setting of 0.2, implemented in Versa, a University of California, San Francisco (UCSF) Health Insurance Portability and Accountability Act (HIPAA)-compliant model. For each patient, compiled clinical notes and CXR reads were input into the GPT-4 chat interface. Prompt engineering was initially carried out by iterative testing on clinical notes and CXR reads from five randomly selected patients in the derivation cohort, who were excluded from subsequent analyses. We employed a chain-of-thought prompt strategy³⁰ that involved asking GPT-4 to analyze the note and CXR step-by-step. The validation cohort included patients enrolled during the height of the COVID-19 pandemic and thus we redacted the terms “SARS-CoV-2” or “COVID-19” from their notes to avoid biasing the GPT-4 analysis. In our final version of the prompt (Supplementary Appendix 1), we asked GPT-4 to choose either LRTI or no LRTI, as exemplified in two example responses (Supplementary Appendix 2 and 3). We found that GPT-4 would sometimes give different answers to the same prompt and EMR input data in separate chat sessions. Therefore, for each patient, GPT-4 was asked to diagnose LRTI in three separate sessions. A per-patient GPT-4 score was calculated based on the total number of LRTI-positive diagnoses made by GPT-4 (ranging 0-3).

Integrated classifier

The integrated classifier’s performance was tested using 5-fold cross-validation in the derivation cohort. Because of the smaller sample size, 3-fold cross-validation was used in the validation cohort. For each test fold, a logistic regression classifier was trained on the remaining training folds using both normalized FABP4 expression and the GPT-4 score. The performance and ROC curve for each fold was evaluated as described above. The sensitivity, specificity and accuracy were calculated based on an out-of-fold predicted LRTI probability threshold of greater than or equal to 50%.

Comparing GPT-4 to physicians provided the same data

We compared LRTI diagnosis by GPT-4 against LRTI diagnosis made by three physicians trained in internal medicine (ADK) or additionally subspecializing in infectious diseases (AC, NLR). The physicians were provided with identical information and prompts as GPT-4, and they were asked to assign each patient as either LRTI or No LRTI. The comparison physician group score (0–3) was calculated based on the total number of LRTI-positive diagnoses made by the comparison physicians.

Ethics statement

We studied patients from two prospective observational cohorts of critically ill adults with acute respiratory failure enrolled within 72 h of intubation at the University of California San Francisco (UCSF) Medical Center (Fig. 1, Table 1). The derivation cohort⁷ (N = 202) was enrolled between 10/2013 and 01/2019, and validation cohort (N = 115) was enrolled between 04/2020 and 12/2023. This research was approved by the University of California, San Francisco Institutional Review Board (IRB) under the following protocols: #10-02701 for the derivation cohort, and #20-30497 and #10-02852 for validation cohort.

If a patient met inclusion criteria, then a study coordinator or physician obtained written informed consent for enrollment from the patient or their surrogate. Patients or surrogates were provided with detailed written and verbal information about the goals of the study, data and specimens that would be collected, and potential risks to the subject. Patients and their surrogates were also informed that there would be no benefit to them from enrollment in these studies and that they may withdraw informed consent at any time during the course of the study. All questions were answered, and informed consent documented by obtaining the signature of the patient or their surrogate on the consent document or on an IRB-approved electronic equivalent. As previously described^25,26, the IRB granted an initial waiver of consent for patients who could not provide informed consent at time of enrollment.

More specifically, subjects who were unable to provide informed consent at the time of enrollment could have biological samples as well as clinical data from the medical record collected. Surrogate consent was actively pursued, and each patient was regularly examined to determine if and when they would be able to consent for themselves. For patients whose surrogates provided informed consent, direct consent from the patient was then obtained if they survived their acute illness and regained the ability to consent. A full waiver of consent was approved for subjects who died prior to consent being obtained. Further details on the enrollment and consent process for these studies can be found in two recent publications^25,26.