Comparison of different AI systems for diagnosing sepsis, septic shock, and cardiogenic shock: a retrospective study

Categories: Disease & Virus

May 7, 2025

Data source and selection

Given the difficulties of obtaining primary medical data, the MIMIC-III database was utilised in this study. This extensive, single-centre database consists of deidentified clinical data from patients admitted to critical care units of the Beth Israel Deaconess Medical Center in Boston, Massachusetts, between 2001 and 2012¹¹. The data are widely accessible to researchers under a data use agreement, so the study was exempt from the need for specific ethical review. Nevertheless, we adhered to fundamental ethical principles and ensured the validity and fairness of the study process.

The database contains data from 53,423 adult hospital admissions (aged 16 years or above)¹¹. Among these patients, this study focused on patients aged ≥ 18 years.

The inclusion criteria included admission with a diagnosis of sepsis, septic shock, or cardiogenic shock according to the International Classification of Diseases, 9th Revision (ICD-9) diagnosis table (codes 785.51, 785.52, 995.91, and 995.92). Due to the constraints of the data in the MIMIC-III collected prior to 2012, sepsis and septic shock admissions were re-evaluated on the basis of criteria from the 2016 and 2021 Surviving Sepsis Campaign guidelines^1,2 and cardiogenic shock guidelines^3,4.

A two-step reclassification process was conducted to differentiate between the sepsis and septic shock groups. Initially, patients diagnosed with either sepsis, severe sepsis, or septic shock who had a SOFA score ≥ 2 were classified as having sepsis. Patients who did not fulfill this criterion were excluded from the data set. Within this group, patients were then reclassified as septic shock cases if they also met the diagnostic criteria of vasopressor-dependent circulatory failure and hyperlactatemia (plasma lactate level > 2 mmol/L). After being reallocated into the septic shock group, those cases were subsequently removed from the sepsis group.

We used MIT-LCP/mimic-code to create the SOFA score table¹². The criteria for the cardiogenic shock group included a low systolic blood pressure (< 90 mmHg) for a period of ≥ 30 min or the presence of two of three clinical signs from among the following: bilateral lung sounds, cool/cold skin temperature and pale/mottled/cyanotic skin colour; data on admissions for patients who did not meet these criteria were excluded. Patients whose data did not adhere to the previously mentioned conditions were either reclassified or entirely excluded from this study.

To prevent duplicate diagnoses from affecting the learning and assessment processes of the models, patients diagnosed with multiple conditions (cardiogenic shock and either sepsis or septic shock) during a single admission were also excluded.

The data adjustment process is shown in Fig. 1.

Data processing

The final dataset comprised 5,970 distinct hospital admissions, each considered a unique observation. To enhance the dataset, the Elixhauser et al.¹³ comorbidity index was extracted from the MIMIC III diagnostic table, and selected categories were incorporated as variables in the dataset. This classification, along with the index value, has been demonstrated to be highly important in predicting prognoses and mortality for a range of diseases and injuries. Compared with detailed ICD9 codes, it allows better organisation of the corresponding data. The data from Quan et al.¹⁴, constructed via the MIT-LCP/mimic-code¹² methodology, were employed to generate predictor variables for this research.

In addition, specific predictor variables for conditions such as cardiac arrhythmias, myocardial infarction and infections were formulated by combining different ICD9 codes into diagnostic groups. We also introduced predictor variables for age, sex, SOFA score ≥ 2, vasopressor dependence, lactate level > 2 mmol/l, and other clinical signs and indicators (Supplementary Table 1 online offers an overview of the created variables).

Feature selection

A three-phase feature selection process was employed after cleaning the dataset for restructuring and adaptation to the new guidelines. Initially, variables that were deemed irrelevant to disease diagnosis were excluded. In the second phase, features that did not provide sufficient value and features exhibiting unsatisfactory importance due to low frequency or low variance between classes were discarded. The final phase involved evaluating the remaining variables with the mutual information score (MIS), employing the median as the cut-off threshold over the arithmetic mean because of the enhanced resilience of this metric against outliers. After the feature selection process, 12 of the 51 variables were included in the final dataset.

Table 1 offers an in-depth overview of the included prediction variables.

Table 1 Prediction variables.

Classifiers employed

To address the challenge of differentiating between the clinically similar conditions of sepsis, septic shock and cardiogenic shock, Bayesian network classifiers (BNCs) were employed. These classifiers leverage probabilistic relationships among selected clinical parameters (e.g., vasopressor dependency, lactate levels, SOFA scores, and infection status) to calculate diagnostic probabilities, thereby enabling nuanced differentiation among these clinically overlapping syndromes.

Various BNCs were trained on the dataset and evaluated for their predictive power in terms of accuracy, sensitivity, specificity, precision, F1-Score, and interpretability. To assess the usefulness of the BNCs, the results were compared with those of other commonly used classifiers, such as a naive Bayes (NB) classifier, the One rule Classifier (OneR), a Classification and Regression Trees (CART)-based classifier, and a feed forward backpropagation artificial neural network (ANN). The basic concepts, methods and relevant algorithms of the individual classifiers are briefly summarised below.

Naive bayes (NB)

A probabilistic algorithm based on Bayes’ theorem that operates under the assumption that all features are independent of one another.

Tree augmented naive bayes (TAN)

An enhancement of the NB, the TAN algorithm introduces feature dependencies via a tree structure, enhancing prediction accuracy over its naive counterpart. It employs methods such as the Chow–Lui algorithm (Akaike information criterion (AIC), Bayesian information criterion (BIC), and log-likelihood (LOG)) for structure optimisation.

Semi-naive bayes classifier (SNBC)

SNBC is an adaptation of NB that considers feature interdependencies and eliminates, selects or joins features to improve classification accuracy. It uses algorithms such as backward sequential elimination and joining (BSEJ) and forward sequential selection and joining (FSSJ).

Benchmarking

We used three alternative techniques/algorithms as benchmarks to evaluate our models.

One rule classifier (OneR)

This one-rule algorithm classifies data on the basis of a single attribute. Despite the simplicity of the resulting model, OneR has been shown to deliver good results and thus serves an excellent benchmark¹⁵.

Classification and regression tree (CART)

A decision tree trained through recursive partitioning. The tree’s structure represents the course of the subsequent classification process, determined by the iterative division of the data into subgroups¹⁶.

Feed-forward backpropagation neural network (ANN)

This algorithm accurately represents the biological nervous system and is designed to mimic the communication between neurons. The feedforward aspect describes the direction of signal transmission, whereas the backpropagation aspect describes the neural network’s learning process, wherein the loss function is minimised by adjusting the individual parameters through weights and biases.

Evaluation metrics and procedures

All classifiers were trained, validated, and tested. First, the data were randomly divided into training and test datasets at a ratio of 80:20. Performance metrics, including accuracy, sensitivity/recall, specificity, precision, and F1-score, were derived from the resulting confusion matrices. Specifically, the true negative (TN), true positive (TP), false negative (FN), false positive (FP) and true negative (TN) values presented in the confusion matrix were used to calculate the following metrics for qualifying the classifiers:

$$Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$

$$Sensitivity/{{Re}}call = \frac{TP}{{TP + FN}}$$

$$Specifity = \frac{TN}{{TN + FP}}$$

$${{Precision}} = \frac{{TP}}{{TP + FP}}$$

$$F1-Score = \frac{{2. Precision \cdot {{Re}}call}}{{ Precision + {{Re}}call}}$$

These metrics are class-specific and must therefore be calculated three times for this multiclass problem. Furthermore, the area under the curve (AUC) was calculated for the multiclass classification following the methods of Hand and Till¹⁷.

We performed tenfold cross-validation using accuracy and Cohen’s kappa as internal validation metrics to assess the accuracy and reliability of the models independent of the choice of randomiser. Additionally, we attempted to address potential imbalances in the models through upsampling and then compared the results with those obtained without upsampling.

Computation and visualisation

All computations and visualisations were performed with R Version 4.2.2, RStudio 2022.07.1 Build 554 and the packages bnclassify, caret, nnet, OneR, pRoc and rpart^{18,19,20,21,22,23}.

Source link