Trial registration and consent
A prospective monocenter cohort study was carried out from April 2020 to November 2020. It received approval from an independent review board (CPP Est-III, reference 2020-A00885-34) and was registered on https://clinicaltrials.gov/ in April 2020 (reference NCT04357847, principal investigator: Pr E. Besnier). This study was classified as non-interventional and verbal informed consent was obtained from the patients before their participation. This study adhered to the principles outlined in the Helsinki Declaration.
Clinical data and sample analysis
Patients eligible for this study were required to have been admitted to an ICU for COVID-19. The criteria for non-eligibility included documented bacterial co-infection, pregnancy, known limitations in life-support due to the patient’s choice or significant comorbidities, and a life expectancy of less than 24 h.
The patient’s characteristics encompassed a range of factors, including age, gender, body mass index (BMI), underlying comorbidities, COVID-19 hemodynamic parameters, ICU transfer since the onset of symptoms, length of stay (days), clinical scores such as SAPS II (simplified acute physiological score) and SOFA (sequential organ failure assessment) and the PaO2/FIO2 ratio. Additionally, various biological parameters were meticulously documented, which included urea, creatininemia, C-reactive protein, high-sensitive troponin, NT-proBNP (N-terminal prohormone of brain natriuretic peptide), lactatemia, hemoglobin, platelets, leukocytes, neutrophils, lymphocytes, monocytes, fibrinogen, and D-dimers. We defined a positive composite score by the presence of at least one of the following clinical outcomes: invasive mechanical ventilation, organ failures, shock, or death. We used the composite score values (labels as zero or one) as the definition of the two groups we classified patients into (non-critical and critical respectively).
Blood samples were collected in ethylenediaminetetraacetic acid (EDTA) tubes within the first 24 h following ICU admission. Given that all patients were transferred from other departments (e.g., emergency or general hospitalization), pharmacological treatments were necessarily initiated prior to or during this 24-hour window. Following collection, the samples were centrifuged at 2,250 RCF for 10 min, and the resulting plasma was extracted and stored at -80 °C for subsequent analysis.
Continuous and count data were expressed as medians [interquartile range, IQR] and n (%), or n/N (%), respectively, where N represents the total number of patients with available data. P values were calculated using chi-square or Fisher’s exact tests for nominal data, and the Wilcoxon rank sum test for continuous data. The data analyses and statistical computations for these data were performed using R v.4.2.215.
Sample preparation and UPLC-MS analysis
Metabolites were extracted from 300 µL of patient plasma containing 30 µL of isotopically labeled internal standards using 700 µL of ice-cold methanol and subsequently purified on an Ostro column (Waters). After freeze-drying the resulting eluate, the samples were reconstituted in 200 µL of a 50:50 mixture of water and acetonitrile (H2O: ACN) and analyzed in random order. This analysis was carried out using an Acquity UPLC system coupled with a Synapt G2 quadrupole time-of-flight mass spectrometer (Waters, Manchester, UK) equipped with an electrospray ionization interface.
UPLC-MS analysis was performed as described by Paglia et al.16. Briefly, chromatographic separation was achieved in acidic conditions through hydrophilic interaction liquid chromatography (HILIC) using an Acquity amide column, 1.7 μm (2.1 × 150 mm; Waters) using as mobile phase A 100% H2O and as mobile phase B 100% ACN, both containing 0.1% formic acid. The elution gradient was as follows: 0 min 99% B, 7 min 30% B, 7.1 min 99% B, and 10 min 99% B. The flow rate was set at 0.4 mL/min, the column temperature was maintained at 45 °C, and the injection volume was 7.5 µL. Under acidic conditions, all samples were analyzed in both positive (Pos-Acidic) and negative modes (Neg-Acidic). Samples were similarly analyzed in basic pH conditions using the same column chemistry with mobile phase A consisting of 95:5 H2O: ACN containing 20 mM ammonium acetate and 20 mM ammonium hydroxide and mobile phase B 100% ACN. The elution gradient was as follows: 0 min 10% A, 5 min 20% A, 11 min 50% A, 13.5 min 60% A, 15 min 10% A, 20 min 10% A. The flow rate was set at 0.3 mL/min, the column temperature was maintained at 25 °C, and the injection volume was 7.5 µL. Samples were analyzed in negative ion mode under basic pH conditions (Neg-Basic).
The mass spectrometer operated with a capillary voltage of 1.5 kV, and the sampling cone and the extraction cone voltages were set at 30 V and 5 V, respectively. The cone and desolvation gas flow were 50 L/hr and 800 L/hr, respectively, while the source and desolvation gas temperatures were 120 °C and 500 °C, respectively. Mass spectra were acquired in centroid mode over the mass-to-charge (m/z) range of 50 to 1,000 with a scan time of 0.3 s. Leucine enkephalin (2 ng/µL) served as the lock mass (m/z 556.2771 and 554.2615 in positive and negative modes, respectively). Argon was used as the collision gas for both positive and negative modes.
For the high-resolution data-dependent acquisition (DDA) experiments, the collision energy in the trap cell was off, while in the transfer cell, it ranged from 10 to 30 eV in the positive mode and from 25 to 40 eV in the negative mode.
Metabolomics quantification
Data from each of the three LC methods (Pos-Acidic, Neg-Acidic and Neg-Basic) were preprocessed independently. Initially, the MassLynx files were centroided and converted to .mzData files using an in-house script in R15. Subsequently, XCMS17 was employed for automatic peak-picking (centWave;18) and retention time alignment (OBI-Warp;19). Following this, data underwent crop filtering, wherein all features eluting within the first 80 s and those with an m/z ratio less than 70 were eliminated from the dataset. The remaining feature intensity values were subjected to a generalized logarithmic transformation (glog;20) to achieve normally distributed feature intensities. This particular transformation was selected to accommodate the transformation of intensity values that could be zero.
To rectify the drift in feature intensity values that occurs following the UPLC-MS procedure, normalization was carried out using a quality control sample-based robust LOESS (locally estimated scatterplot smoothing) signal correction (QC-RLSC)21. To ensure quality control and reproducibility, features with a relative standard deviation of over 30% in the QC samples and those with a D-ratio exceeding 50% were systematically removed22.
To identify potential outliers, the density-based spatial clustering of applications with noise (DBSCAN) algorithm was utilized. A neighbor count of k = 5 was employed, and the threshold for outlier samples across all three UPLC-MS methods was automatically determined using the KneeLocator function from the kneed Python module23. Based on these results, one patient sample (from the combined LC methods) was excluded from further analysis. Subsequently, ComBat24 was used to eliminate any remaining batch effects. Afterward, the feature intensity values were standardized by subtracting the mean and scaling them to achieve unit variance. The data from the three LC methods were then combined into a single dataset comprising 6,667 features. A supplementary file containing a Jupyter Notebook with all the metabolomic data preprocessing steps is available in GitHub repository https://github.com/siggitrausti/covid19_metabolomics.
Module identification and annotation
We constructed a weighted correlation network to identify functionally interconnected features aggregated into modules. The construction of this network involved utilizing the NetworkX Python package to implement the network structure25, with weights assigned based on the absolute Pearson correlation coefficient values between features. Following network construction, we employed the Louvain community detection algorithm26 to identify modules within the weighted correlation network.
Each identified module within the weighted correlation network was subsequently represented by its eigenvector, obtained using singular value decomposition. To annotate individual modules within the weighted metabolomic correlation network, a random sampling-based enrichment analysis was conducted. Initially, a comprehensive list of 400,000 metabolites associated with metabolic pathways (HMDB-to-Pathway data set) was obtained from PathBank27. Subsequently, the mass-to-charge ratio (m/z) values of all features within the weighted correlation network were compared to annotated m/z values associated with known metabolites from the Human Metabolome Database (HMDB), including potential adducts formed during electrospray ionization. We accepted all potential matches within a 5 ppm range, and summarized them in a structured manner in the MZ-to-HMDB data set.
We counted the occurrence of all pathways mapped to each metabolic module feature (using the MZ-to-HMDB and HMDB-to-Pathway data sets). To assess the significance of pathway enrichment within each module, we conducted a random sampling-based analysis. This analysis involved building 1,000 lists of randomly sampled features from the entire dataset, with the occurrence of pathways counted for each sample. Subsequently, empirical P values were calculated for all pathways associated with a module using the formula:
$$\:P=\:\frac{r\:+\:1}{n\:+\:1}$$
where P represents the empirical P value, r denotes the number of random samples where the number of pathway occurrences is equal to or greater than that observed for the module, and n is the total number of random samples.
Jupyter Notebooks with all the weighted correlation network construction and annotation steps are available in GitHub repository https://github.com/siggitrausti/covid19_metabolomics.
Metabolites manually curated annotation
Importantly, the annotations for the features produced by this procedure are considered only putative, as they are solely based on m/z ratios. This necessitated the execution of MS2 analysis to confidently annotate the top features of interest. The selected m/z features were annotated using a previously published in-house database16. For m/z features not encompassed within that database and to enhance the annotation confidence of selected statistically significant m/z features within modules, a combination of two strategies was employed: (1) The CAMERA algorithm28 was applied to the output of the XCMS peak identification process, and peaks were subsequently grouped together based on the PC group number. All masses within a PC group were then incorporated as a fragment list into the Metlin v 2.0 precursor search engine. (2) The fragmentation pattern acquired from the DDA analysis of the sample pools was also examined. The list of fragment m/z values corresponding to each feature was compared to the Metlin online database and an in-house library. A minimum of 3 common fragment masses (ppm < 10) between the uploaded experimental spectra and those in Metlin were considered valid. The combination of strategies 1 and 2 resulted in annotation levels of 1 and 2. The methodology and the respective levels of annotation were adhered to as previously described29. Curated annotated metabolites are available in Supplementary Information Table SIT1.
Modeling of patient outcomes based on metabolomic features
The objective of the modeling procedure was to identify which modules exhibited the greatest discriminatory power in classifying COVID-19 patients according to their composite outcomes. To accomplish this, the module eigenvectors obtained from the weighted correlation network analysis served as reference points for feature selection. The module eigenvectors, which capture the overall pattern of features within a module, were employed as a robust metric for assessing the core characteristics of the module. This approach enabled the pinpointing of a singular metabolic feature that encapsulates the predominant metabolic behavior, providing a concise representation of the module’s biological significance.
Subsequently, a heuristic approach was conducted to determine the optimal combination of features representing each module within a logistic regression framework for predicting the composite outcome of patients. The search spanned from a minimum of 1 feature up to 20 features, allowing for the identification of a parsimonious yet informative set of predictors. We employed a bootstrap resampling approach to assess the performance of each logistic regression model comprehensively. This method involved repeatedly sampling with replacement from the dataset and evaluating the model performance on the out-of-bag set, therefore providing a robust estimate of the model’s predictive accuracy. This iterative process aimed to enhance the discriminatory power of the model by identifying a subset of features that collectively contribute to robust predictions in the context of COVID-19 severity. Administered drugs were excluded as top metabolic features predictive of patient outcome.
All the necessary code reproducing the modeling results is readily available in GitHub repository https://github.com/siggitrausti/covid19_metabolomics.