Study population
This study retrospectively analysed data from a multi-center prospective trial. The study population comprised consecutive patients admitted to participating hospitals specifically for suspected COVID-19 pneumonia. Data were derived from two distinct periods: the first Belgian wave (March 1 – May 15, 2020) at Universitair Ziekenhuis Brussel (UZ Brussel, Dataset 1), and during the late second and early third waves (January 1 – March 22, 2021) at AZ Delta (Dataset 2). UZ Brussel, an academic hospital affiliated with Vrije Universiteit Brussel, serves as a hub for clinical care, education, and research, while AZ Delta is a supraregional secondary care center for a population of approximately 500,000.
Admission criteria for all patients mandated suspicion of COVID-19 pneumonia, according to hospital-approved triage protocols. All included patients underwent chest CT imaging and SARS-CoV-2 PCR testing within 24 h of admission, as part of standard diagnostic workup. Only patients with PCR-confirmed SARS-CoV-2 infection were included in this analysis. Exclusion criteria were limited to patients under 14 years of age, pregnant individuals, and those without symptoms of COVID-19 (due to the absence of routine admission chest CT imaging in asymptomatic suspected cases). Dataset 1 included 252 records, and Dataset 2 comprised 175 records. Critically, no patients included in this study had received COVID-19 vaccination.
The study received ethical approval from the Institutional Research Board Commissie Medische Ethiek (Medical Ethics Committee, IRB) of UZ Brussel (reference BUN 2020/106) and the Institutional Research Board Commissie Medische Ethiek (Medical Ethics Committee, IRB) of AZ Delta (Clinical Trial Number: IRB B1172020000008). Both committees waived the requirement for written informed consent due to the secondary use of anonymized data collected during standard care. Financial support was provided by the European Commission’s Horizon 2020 Research and Innovation program (grant agreement 101016131, ICOVID). All methods of the present study were performed in accordance with the relevant guidelines and regulations.
CT protocol
At UZ Brussel, all patients underwent CT imaging using the GE Apex Revolution CT system, featuring a slice thickness of 1-mm. The CT protocol employed a spiral acquisition method with a pitch of 1, a rotation time of 0.35 s, and an automated selection of kilovoltage peak (kVp) and tube current (mA), resulting in an average dose-length product (DLP) of 149 mGy·cm.
At AZ Delta, patients were scanned using multidetector computed tomography (MDCT) on one of the following systems: Siemens Somatom Force, Siemens Somatom AS (both utilizing a 1-mm slice thickness), or the GE Optima 660 scanner (with a 1.25-mm slice thickness). The scanning parameters included a tube voltage of 120 kVp and automatic tube current modulation ranging from 30 to 70 mA·s, with a median DLP of 520 mGy·cm (range: 310 to 906 mGy·cm), and a median estimated effective dose of 7.6 mSv (range: 4.2 to 11.2 mSv).
All imaging procedures were conducted without intravenous contrast administration, with patients positioned supine and instructed to perform end-inspiratory breath-holding.
Image evaluation
Two thoracic radiologists, each with over ten years of experience, retrospectively analyzed the computed tomography (CT) scans using PACS workstations— Agfa Impax 6.5.3 at UZ Brussel and Sectra IDS7 at AZ Delta. Multiplanar reconstruction tools were employed to evaluate the number of affected lobes and to calculate the CT severity score (CTSS)13, which quantifies the severity of pulmonary involvement. For each patient, the five lung lobes were visually scored based on the extent of lung abnormalities as follows: 0 (no involvement), 1 (< 5% involvement), 2 (5–25% involvement), 3 (26–49% involvement), 4 (50–75% involvement), and 5 (> 75% involvement). The CTSS was determined by summing the individual lobe scores, yielding a total score ranging from 0 to 25. Final scoring was achieved through consensus.
Software-based quantitative assessment of lung parenchyma involvement was performed using Icolung (version 0.7; URL: https://icovid.ai/) by Icometrix (Leuven, Belgium), a state-of-the-art, commercially available, cloud-based AI algorithm. Notably, Icolung holds the distinction of being among the first AI tools to receive CE marking for the quantification of lung pathology on chest CT scans. Icolung utilizes deep learning models, specifically convolutional neural networks (CNNs) based on 2D and 3D U-Net architectures. These models were trained on a substantial dataset of clinical CT scans with voxel-level delineations of lung abnormalities, meticulously annotated by experienced radiologists. The training dataset’s precise composition is detailed in the software’s technical documentation available from Icometrix. Icolung enables fully automated, quantitative analysis of lung pathology on non-contrast chest CT scans in patients with COVID-19, generating reports that offer objective evaluations of lung lesions (Fig. 1). These reports provide volumetric analysis by lobe and for the total lung, alongside quantification and classification of CT findings such as ground-glass opacities (GGOAI), crazy paving patterns (CPPAI), consolidations (COAI), and the total percentage of lung parenchyma affected (TOTALAI). Performance validation demonstrated robust classification performance in detecting COVID-19 pneumonia. Icolung was developed through a collaborative effort between Icometrix, University Hospital Brussels, VUB/ETRO, and KU Leuven14,15.
Structured report with analysis results. The report includes 2D visualizations of 3D segmentation masks of abnormalities in both axial and coronal views. It also features a table summarizing total lung involvement percentages and the specific involvement percentages for each type of lung abnormality—ground glass opacities (GGO), crazy paving patterns (CPP), and consolidations (CO). These values are provided for each lung lobe and the entire lungs.
Laboratory testing
COVID-19 detection for Dataset 1 (UZ Brussel) utilized real-time RT-PCR with the RealStar® SARS-CoV-2 RT-PCR Kit 1.0 (Altona Diagnostics, Germany), targeting the E-gene and S-gene. Hematological analyses (white blood cell (WBC), lymphocyte, and neutrophil counts) were performed using the CELL-DYN Sapphire analyzer (Abbott Diagnostics, USA), while clinical chemistry parameters (CRP and eGFR) were measured on the Cobas c501 module (Roche Diagnostics, Switzerland).
For Dataset 2 (AZ Delta), a multiplex RT-PCR assay (Allplex™ 2019-nCoV, Seegene Inc., Korea) targeting the E, N, and RdRP genes was used for COVID-19 detection. Hematological parameters were assessed with Sysmex XN analyzers (Sysmex Corporation, Japan), and clinical chemistry evaluations (CRP and eGFR) were performed on the Cobas c501 module (Roche Diagnostics, Switzerland).
The standardized testing ensured consistent and comparable data across both datasets.
Data preprocessing, characterization and feature selection
Before selecting informative features, both hospital datasets undergo preliminary preprocessing. This includes checks for potential duplicate records and addressing missing data. Missing values are handled through imputation, ensuring complete datasets. Imputation is favored over record removal to preserve data, given the relatively small dataset sizes.
Descriptive and comparative statistics were computed for both datasets. Pairwise comparisons of all features were conducted based on mortality outcomes within each dataset separately. Given that none of the continuous variables followed a normal distribution, the Mann-Whitney U test was employed for these comparisons. For dichotomous variables, proportions were analyzed using the Chi-squared test. Additionally, cohort-level comparisons were performed for the 13 shared features across both datasets, analyzing the complete patient populations and stratifying by mortality outcomes between the two cohorts.
Starting with the complete feature set of Dataset 1, collinearity was assessed by evaluating the Pearson correlation coefficient. Features exhibiting strong correlations (absolute values > 0.6) were removed. Following this, multicollinearity was examined using the Variance Inflation Factor (VIF). Features with a VIF exceeding 10 were iteratively excluded to ensure no remaining features exhibited significant multicollinearity.
Advanced feature selection methods (complete description in Supplementary Materials and Methods) were subsequently applied to the reduced feature set. These methods included wrapper techniques based on classification models (Sequential Feature Selectors and the Boruta method) as well as statistical scoring methods, such as Scikit-learn’s SelectKBest and the Minimum Redundancy Maximum Relevance (mRMR) method.
In total, 11 distinct feature selection techniques were implemented across 30 train-test splits (75%-25%) of Dataset 1. Feature rankings were determined by the frequency with which each feature was identified as most important during the selection process. The final feature set retained only those features consistently ranked as important across multiple methods and iterations.
Training of machine learning algorithms
Machine learning models were developed using Python (v3.11.4) with the Scikit-learn library (v1.2.2). Nine distinct algorithms were evaluated: Multiple Logistic Regression (MLR), Random Forest (RF), Support Vector Classifier with a Linear Kernel (SVC-Linear), Support Vector Classifier with a Radial Basis Function (RBF) Kernel (SVC-RBF), K-Nearest Neighbors (KNN), Gaussian Naive Bayes (GNB), Extreme Gradient Boosting (XGBoost), a Keras-based Neural Network (Keras NN), and an Ensemble Voting Classifier (EVC) that combines the outputs of all these models. The input features used in all models included age, white blood cell (WBC) count, and the total percentage of lung parenchyma affected (TOTALAI), as assessed by the Icolung algorithm.
Before initiating machine learning procedures, Dataset 1 was split stratified into an 80% training set and a 20% holdout set. Due to the class imbalance (15.5% positive samples), the Adaptive Synthetic Over-sampling Technique (ADASYN), based on SMOTE principles22,23,24,25, was applied to training folds to increase positive samples to 33.3%, with validation ensuring synthetic data remained within a controlled Euclidean distance of original data in feature space and minimizing bias. For each machine learning model, hyperparameter optimization was conducted using a grid search approach on the upsampled training set, employing stratified 5-fold cross-validation to reduce the risk of overfitting.
The ADASYN upsampling, implemented as part of the machine learning pipeline, was applied only to the training data within each cross-validation fold. This ensures the validation folds as well as the holdout set remain independent of any influence from the upsampling procedure, enabling an objective assessment of model performance.
Due to the small dataset size, a common challenge in medical research, the described method is sensitive to the specific split of Dataset 1 into training and holdout sets. This sensitivity can lead to machine learning performance metrics that are less reliable and not fully representative of real-world generalizability. To mitigate this limitation, we employed a robust statistical approach. The process was repeated 30 times, yielding 30 sets of optimized hyperparameters and corresponding performance metrics. The holdout AUC-ROC and F1-score values were averaged, and a final training and holdout split was determined. This was done by iteratively adjusting the split until the holdout ROC-AUC deviated by less than 0.02 from the average AUC-ROC across the 30 iterations. The final split was then used to train the ultimate estimator. A detailed explanation of this methodology is provided in the Supplementary Materials and Methods.
Performance evaluation
To assess the performance of the final estimators, Dataset 2, which is temporally and geographically distinct from Dataset 1, was used as a validation set. Each final estimator was applied to predict outcomes in Dataset 2, enabling the calculation of ROC-AUC values for each model. To statistically compare the ROC-AUCs of Dataset 1 and Dataset 2, the degree of overlap between their 95% confidence intervals was calculated following the methodology described by Cumming et al., 200526. At a significance level of 0.05, ROC-AUC differences are considered statistically significant if the proportion of overlap is 0.5 or less.
Categorization based on likelihood ratios
To enhance clinical interpretability and provide actionable insights16,17,18,19,20,21, the predicted probabilities for in-hospital mortality were categorized into clinically relevant groups based on increasing likelihood ratios (LRs): highly unlikely, unlikely, equivocal (gray zone), more likely, and likely. This categorization was derived using the UZ Brussel training set for the MLR model with the reduced feature set (“Age”, “TOTALAI”, and “WBC”), referred to as the “M3-score”.
The M3-scores for Dataset 2 were calculated and categorized according to the thresholds established in Dataset 1. For each category, LRs with corresponding 95% confidence intervals (CIs) were determined. To statistically compare the LRs from the UZ Brussel training set and the AZ Delta test set, two complementary approaches were used. First, the proportion of overlap between the 95% CIs was calculated following the method described by Cumming et al.., (2005)26. A significance threshold of 0.05 was applied, with statistical significance defined as a proportion overlap of 0.5 or less. Second, to further evaluate statistical differences in LRs between training and test sets, we applied the delta method27 to estimate confidence intervals and p-values for the ratio of likelihood ratios (RORs), using log-transformation for variance stabilization and Z-tests for significance, with a small value substitution in the lowest M3-score category (0.00-0.10) to accommodate zero likelihood ratios in the training set. A detailed explanation of this methodology is provided in the Supplementary Materials and Methods.
Statistical differences in the proportions of samples per LR category between the training set (dataset 1, UZ Brussel) and the test set (dataset 2, AZ Delta) were evaluated using the Chi-squared test.
Additionally, posttest probabilities were computed for all LR categories using Bayes’ theorem20, with the mortality prevalence serving as the pretest probability.
