Semi-automated surveillance of surgical site infections using machine learning and rule-based classification models

Categories: Disease & Virus

October 17, 2025

Design, setting, and study population

This prospective cohort study was performed between 1 October, 2016 and 30 September, 2022 at Geneva University Hospitals (Geneva, Switzerland) as part of a national quality improvement programme (Swissnoso, Swiss National Center for Infection Prevention). We included adult patients ( ≥18years old) undergoing elective or urgent surgery in the following surgical categories: cardiac; coronary artery bypass grafting; colorectal surgery; laminectomy; and spinal fusion. SSI surveillance was performed according to Swissnoso SSI surveillance system guidelines³⁴, which follow the definitions of the United States Centers for Disease Control and Prevention^35,36. Patients undergoing selected surgical procedures were prospectively included, with postoperative surveillance for 30 days, or 90 days if an implant was placed. Surveillance targeted deep incisional and organ/space SSIs, defined according to Swissnoso/CDC criteria. Deep SSIs are defined as infections involving the fascia or muscle layers and are diagnosed based on purulent drainage, wound dehiscence with clinical signs, or evidence of infection (e.g., abscess) on imaging or reoperation. Organ/space SSIs are defined as infections involving anatomical compartments (e.g., peritoneum, mediastinum, bone) and are diagnosed based on purulent drainage from a drain, a positive culture, or similar evidence of infection in the affected compartment. SSI status was determined by trained IPC professionals based on manual chart review and post-discharge telephone follow-up. Surveillance of surgical procedures involving implants was initially performed up to one year post-surgery but was shortened to 90 days after October 2021 in accordance with updated guidelines. Superficial infections were excluded as their detection introduces significant methodological and practical challenges and they are mostly detected by post-discharge surveillance³⁷, which could potentially lead to the underdetection of SSI and misclassification if included in a (semi-)automated surveillance system. Post-discharge surveillance involved up to five telephone call attempts by IPC professionals who administered a standardised questionnaire. For study purposes, follow-up ended at 90 days for all patients undergoing surgery with implants. The clinical outcome was the occurrence of deep and organ/space SSI during the follow-up period.

The study was part of a quality improvement programme and did not require approval from an institutional review board or informed consent from participants. Data were de-identified prior to analysis.

Features

For each patient, we extracted an additional number of postoperative features from the eHR up to the end of the follow-up period including: number of days with antibiotic treatment; postoperative fever (temperature >38 °C); number of infectious disease consultations; number of bacteriological cultures (sterile fluids, biopsies, prosthetic materials, wound swabs, and aspirates from abscesses or joints); radiological examinations (diagnostic scans, targeted biopsies, and image-guided procedures); and frequency of occurrence of certain keywords in follow-up notes (e.g., “infection”, “redness”, “pus”). Supplementary Table 1 lists all input features, their encoding schemes, and preprocessing steps.

Outcome

The primary outcome was the classification performance and diagnostic accuracy of ML and rule-based models for detecting deep and organ/space SSIs in a semi-automated framework.

Statistical analysis

We split the dataset randomly into a training and a validation set using an 80/20 split. The validation set was used solely for model evaluation. The machine learning target variable was a binary outcome (y ∈ {0,1}), where y = 1 denoted the occurrence of a deep or organ/space SSI, and y = 0 indicated no SSI, as determined by IPC professionals following routine surveillance procedures. This final classification, performed after the end of the follow-up period, was used as the reference standard for model training and evaluation. All predictor features were derived from data available prior to this timepoint to avoid data leakage.

Each observation in the dataset corresponded to a unique surgical procedure, including associated perioperative and follow-up data. A schematic representation of the temporal alignment between surgery, hospital stay, and follow-up period for each procedure is shown in Supplementary Fig. 1. The train/test split was performed at the procedure level, in line with our aim to predict SSI occurrence following individual surgical interventions. All features and outcomes were specific to each procedure. The dataset was structured to minimise data sharing across procedures; only a single instance of overlapping follow-up between training and validation sets was identified. While the unique instance of overlap makes any practical data leakage negligible, dividing the data by procedure could, in theory, convey information whenever the same patient undergoes multiple operations. We acknowledge this methodological limitation and will therefore implement patient-level splits in forthcoming, larger cohorts to yield a more stringent assessment of model generalisability.

With the training set, we used five-fold cross-validation³⁸, whilst ensuring similar proportions of SSI cases in each fold. We selected a combination of linear and non-linear models to balance interpretability and predictive performance³⁹. Logistic regression and discriminant analysis were included for their transparency and widespread use in clinical research. Naïve Bayes was selected for its simplicity, speed, and robustness in high-dimensional data despite its strong independence assumption. Random forests and XGBoost were included for their capacity to capture non-linear relationships and complex interactions as ensemble models. The Dense Neural Network was included for its capacity to learn deeper representations from non-linear and complex relationships, due to its multiple fully connected layers with non-linear activation functions. Model implementation was performed using the Scikit-learn, TensorFlow, and XGBoost libraries in Python.

Hyperparameter tuning was performed using a grid search approach within the cross-validation framework, selecting the combination that maximised the area under the receiver operating characteristic curve (AUROC) and negative predictive value (NPV). All numerical predictors were then rescaled with minimum-maximum normalisation³⁹. We favoured this operation over z-standardisation because many variables are naturally bounded counts or proportions, minimum-maximum scaling preserves the sparsity of binary indicators, and preliminary tests showed slightly faster neural-network convergence with identical AUROC. The final hyperparameters are displayed in Supplementary Table 2.

In the second stage, final models were trained on the entire training set using the optimised hyperparameters and minimum-maximum normalisation. Model performance was evaluated on the independent validation set using minimum-maximum normalisation parameters derived from the training set to prevent data leakage.

We assessed the impact of omitting variables that may be difficult to extract from eHRs. Specifically, we evaluated model performance when excluding contamination class and keyword frequency features.

To benchmark ML models, we developed a rule-based classification model for SSI detection based on established clinical indicators of SSI routinely documented in the eHR (Supplementary Fig. 2). Patients were classified as having an SSI if they met any of the following criteria: ≥ 5 days of postoperative antibiotics; ≥ 1 readmission; ≥ 2 positive cultures; or ≥1 infectious disease consultation. The predictive performance of the model was compared with ML models to quantify improvements in prediction accuracy and workload reduction.

Assessment of model performance

Model performance was assessed using essential indicators of diagnostic accuracy: AUROC, area under the precision-recall curve (AUPRC), NPV, false-negative rate (FNR), and workload reduction. The latter was defined as the proportion of patients classified as not having SSI and therefore not requiring manual chart review under a semi-automated surveillance framework. For the primary analysis, we applied a default classification threshold of 0.5 for ML models, at which sensitivity, NPV and workload reduction were calculated. To evaluate threshold-dependent performance, we assessed model metrics, including sensitivity, specificity, negative predictive value (NPV), workload reduction, and F2 score, across a range of classification thresholds (from 0.0 to 1.0 in increments of 0.1). Performance metrics were averaged across cross-validation folds for each ML model and reported with 95% confidence intervals (CIs). The best-performing linear and non-linear ML models were selected based on sensitivity, NPV and workload reduction.

Model interpretability was assessed using SHapley Additive exPlanations (SHAP)⁴⁰, which quantifies the contribution of each characteristic (feature) to model predictions by considering all possible combinations in order to calculate their marginal impact on output. This ensures a fair attribution of prediction influence to each feature, providing both global feature importance insights and local interpretability for individual predictions. This approach is particularly beneficial for complex models, offering transparency in decision-making and feature relevance.

Model performance was assessed on a held-out validation set, and we compared cross-validation and validation performance to evaluate risk of overfitting. In a sensitivity analysis we introduced three random noise variables generated from Gaussian (N[0,1]), uniform (U[0,1]), and Bernoulli distributions (B[1,0.1]), into the training dataset. These probes were not correlated with the outcome and served as negative controls. We evaluated their relative importance across models using SHAP values to assess whether the models inappropriately prioritised irrelevant features.

Statistical analyses were performed using Python version 3.9. The following packages were used for ML: TensorFlow (version 2.8.4); XGBoost (version 1.5.0); and sklearn (version 1.1.1).

Source link