Enhanced diagnosis of multi-drug-resistant microbes using group association modeling and machine learning

Categories: Disease & Virus

March 26, 2025

Ethical approval

The use of archived clinical Mtb sample data analyzed in this study^25,53 was approved of use by the Beijing Chest Hospital, and West China Hospital, Sichuan University. For all prospective data analyzed, ethical approval was obtained from the Research Ethics Committee of Third People’s Hospital of Shenzhen.

Dataset and exclusion criteria

The CRyPTIC database was selected for its complete and consistent phenotyping across 13 drugs, as well as its extensive global coverage, features that distinguish it from other larger databases. Sequence and drug susceptibility testing (DST) data for 13 antibiotics were obtained from 12,289 Mtb isolates in the CRyPTIC database³⁷. These files were filtered using CRyPTIC criteria to remove entries that lacked high-quality DST results or sequence data²⁹, and to exclude entries with incomplete DST data profiles for any drug (N/A, blank, or indeterminate) prior to their use as input files in a GAM analysis (Supplementary Fig. 1).

GAM analysis procedure in identification of AMR-associated sequence variants

CRyPTIC database entries that did not meet study exclusion criteria (Supplementary Fig. 1) were grouped based on their unique drug resistance profiles to produce a control group that contained the genetic variations present in isolates susceptible to all 13 drugs, and an array of groups with distinct drug resistance profiles and associated genetic variants. Resistance groups analyzed by GAM were defined based on drug susceptibility testing (DST) profiles, where isolates sharing identical drug resistance profiles were grouped together. To ensure statistical robustness, we selected all observed resistance groups that included two or more Mtb strains to maximize both the number and diversity of resistance groups. Non-synonymous genetic variants that differed from the Mtb H37Rv reference genome (NC000962.3) were identified in all groups. Differences in variant detection frequency between the drug-susceptible control group and each drug-resistant group were analyzed using Fisher’s exact tests corrected for multiple tests to identify genetic variants associated with drug resistance without considering specific associations (Supplementary Fig. 2). Variants not significantly enriched in at least one drug-resistant group were eliminated from further analyses. Next, all variants enriched in groups resistant to a specific drug were compared to those present in all groups not resistant to that drug by Fisher’s exact tests corrected for multiple tests to identify variants associated with specific resistance phenotypes.

The GAM procedure identifies DNA variants that are significantly associated with specific single drug AMR phenotypes, while results are interpreted at both the gene and sequence variant level. AMR-associated variants are first linked with their respective genes and then all mutations within these genes are then analyzed to identify those that are significantly associated with the targeted single-drug AMR phenotype.

Significant drug-gene interactions detected by GAM were then further analyzed for variants associated with drug resistance by screening coding region mutations with these genes using WHO-proposed confidence gating criteria³⁵. A variant was considered to have positive predictive value (PPV) if it was the only variant in the target gene of ≥5 isolates (solo mutation), had a resistance-associated solo mutation odds ratio (OR-solo) ≥1, an OR-solo PPV 95% lower confidence interval ≥25%, and an OR-solo false discovery rate-corrected Fisher’s exact test p-value ≤ 0.05. Double mutations were considered when there were no solo mutations in a target gene and one mutation was a neutral mutation.

Correction for multiple testing

Fisher’s exact test results were adjusted using Bonferroni corrections for multiple tests based on the number of unique variants employed in each test. Variants were considered significantly associated with resistance if they were detected at odds ratios ≥1 and had p-values < α/n, where α = 0.05 was employed as the false positive rate and n denoted the number of analyzed genetic variants^28,54. There were 8320 mutations across all groups that significantly differed from the control group, resulting in a final p-value threshold of −log₁₀^p-value 5.22. The −log₁₀^p-value significance thresholds for all 126 group associations are available on Figshare (SF9) along with a full list of Mtb isolates within each group (SF10).

GAM identification of S. aureus AMR variants

A database of 3942 S. aureus isolates resistant to a spectrum of 13 drugs (gentamicin, penicillin, cefoxitin, fusidic acid, teicoplanin, vancomycin, erythromycin, clindamycin, linezolid, ciprofloxacin, rifampicin, tetracycline, and trimethoprim)⁵⁵ were analyzed using a methodology that mirrored that used to analyze the CRyPTIC Mtb isolates. However, given the smaller size of the overall sample and its control group, the order of GAM analyses was reversed, so that resistance groups, were first compared to identify SNPs that were detected at different frequencies in the resistance groups that were resistant and sensitive to a specific drug, and these SNPs were then compared to the control group to exclude neutral mutations. Groups were also allowed to contain single isolates, and the number of solo mutations required for inclusion was lowered to ≥1. For instances where drug resistance mechanisms stemmed from gene acquisition rather than mutation, a dataset of all genes present within each isolate was used and the analysis terminated at the gene level, as further data extraction was not feasible. S. aureus HO 5096 0412 (HE681097) was used as the reference genome.

Evaluation of GAM performance

To evaluate the accuracy of GAM identification of variants associated with resistance, GAM and GWAS LMM outputs were compared for true positive, false positive, and true negative counts based on known variant-resistance associations summarized in the WHO Mtb mutation catalogue, which is the clinically recognized gold standard reference for resistance-associated Mtb mutations³⁶.

Additionally, we assessed predictive performance by comparing the sensitivity, specificity, and PPV of GAM outputs versus the WHO mutation catalogue as predictors of phenotypic drug susceptibility across different datasets.

Comparison of ML models

Eight machine learning models (Gradient Boosting, Naive Bayes, Random Forest, AdaBoost, Nearest Neighbor, RBF SVC, Decision Tree, and Neural Network) were trained on the same 7129 Mtb isolates using default hyperparameter tuning set by scikit-learn library (Version 1.3.0). Statistical significance of accuracy differences among these models, the primary performance metric, was assessed using one-way analysis of variance (ANOVA) against the Gradient Boosting reference model with Dunnett’s test used to correction for multiple comparisons. Analyses were performed in Python using scikit-learn using a ten-fold cross-validation approach to ensure model robustness.

GAM + ML model generation

A multi-output classification was performed using the scikit-learn library (Version 1.3.0), where input data (X) was GAM-highlighted mutations and labeled data (y) was the drug resistance profiles for individual isolates. To assess the ML model’s performance, the data was analyzed using 10-fold stratified cross-validation (90% and 10% into training and testing sets, respectively) and fold outputs were conjugated for calculation of overall results. This ML classification model used a Gradient Boosting classifier with a learning rate of 0.1 and 950 estimators and a multi-output classifier wrapper to allow it to manage multiple target variables, using the mean of 100 repeats to evaluate the model’s performance.

ML model ROC and AUC analyses

ROC curves were generated for each drug using the scikit-learn library (Version 1.3.0) to assess the ability of the Gradient Boosting models to predict resistance to each analyzed drug. This process was repeated three times using different random seeds to ensure data robustness, after which mean AUC ± standard deviations were computed for each drug. Optimal cutoff value determinations for individual drug resistance predictions were identified as the values that matched the highest F1 scores.

Comparison of ML models generated with GAM and WHO catalog input data

Significant GAM outputs and 2021 and 2023 WHO mutation catalog results were utilized as inputs for a Gradient Boosting model that used consistent model parameters and differed only in their respective input sources. The 7129 Mtb isolate dataset was partitioned into training and testing subsets containing 75% and 25% of these isolates, respectively, and model accuracy for each drug was compared using 10 consecutive random seeds. Statistical significance of accuracy disparities versus the GAM reference model was analyzed using row-matched one-way ANOVAs employing Geisser-Greenhouse corrections and Dunnett’s test for multiple comparisons. These analyzes were performed in Python using the scikit-learn library (Version 1.3.0) using a 10-fold cross-validation approach to improve result reliability.

This process was replicated using held out data from 428 DS3 isolates employed as a validation dataset and with an independent dataset of 427 isolates from a multisite cross validation dataset (DS4) generated from three sites in China. DS4 consisted of 80 samples from West China Hospital (Chengdu), 281 from Beijing Chest Hospital (Beijing), and 102 samples from Shenzhen Third People’s Hospital (Shenzhen), after excluding 36 samples with low genome alignment rates (<50%) or read depths (<10), poor DST quality, or missing phenotypic information.

Clinical sample culturing, phenotyping, and genotyping

Archived data from West China Hospital and Beijing Chest Hospital isolates analyzed in this study was previously collected using published protocols^25,53. Data from Shenzhen Third People’s Hospital was generated from isolates cultured between 2015 and 2023. Sputum specimens obtained from patients were inoculated into Bactec MGIT 960 culture tubes (Becton, Dickinson and Co., Sparks, MD, USA, 245122) to verify positive isolates after standard NALC-NaOH decontamination (BaSO, Wuhan Jinhong Biotech Development Co., China, BC1999). Positive isolates were identified using acid-fast staining kit (BaSO, Wuhan Jinhong Biotech Development Co., China, Catalog Catalog BA4092A). Strains isolated from sputum were subcultured on Lowenstein-Jensen (L-J) medium agar (BaSO, Wuhan Jinhong Biotech Development Co., China, CUSTOM0038) for phenotypic drug susceptibility testing (DST) and DNA extraction.

Phenotypic MIC DST was performed according to the manufacturer’s instructions (Trek Diagnostic Systems Ltd., UK, V3020)^56,57. Colonies in the log-phase growth stage were suspended in a saline-Tween solution (Trek Diagnostic Systems, T3491), adjusted to a McFarland standard of 0.5, and allowed to settle for 15 min. A 100 µL aliquot was transferred to 11 mL of Middlebrook 7H9 broth (Trek Diagnostic Systems, T3441) and vortex-mixed for 20 s. Another 100 µL of this material was inoculated into each well of the Sensititre MYCOTB MIC plate (Trek Diagnostic Systems, MYCOTBI). Each MYCOTB plate consisted of 2 antibiotic-free positive control wells and 94 antibiotic-containing wells testing 7 antibiotics: amikacin, ethambutol, ethionamide, isoniazid, kanamycin, moxifloxacin, and rifampicin. Plates were covered with permanent plastic seals provided in the test kit and incubated at 37 °C in 5% CO₂. Growth was monitored at days 7, 10, 14, and 21 using a mirrored viewer. The lowest concentration with no visible growth was recorded as the MIC. Unless otherwise specified, MYCOTB test results were based on the first time point with adequate growth in the drug-free control wells.

Genomic DNA was extracted using the CTAB method with reagents from Sangon Biotech (Shanghai) Co., Ltd. Colonies from Lowenstein-Jensen (L-J) slants were collected into 500 µL of Tris-EDTA buffer (pH 8.0 Sangon Biotech, Catalog #B540625) and heated at 80 °C for 20 min. Lysozyme (50 mg/mL, Sangon Biotech, Catalog #B541002) was added (10 µL per tube), followed by vortex mixing and incubation at 37 °C for 2 h. Proteinase K (2 mg/mL, Sangon Biotech, Catalog #A414170) and 10% sodium dodecyl sulfate (SDS, Sangon Biotech, Catalog #A425678) were then added (50 µL each), vortexed gently, and incubated at 65 °C for 20 min. A 150 µL mixture of N-acetyl-N,N,N-trimethyl ammonium bromide (CTAB, Sangon Biotech, Catalog #A600108) and NaCl was added, followed by the addition of NaCl alone. The suspension was vortexed until milky and incubated at 65 °C for 10 min. Chloroform-isoamyl alcohol (24:1, Sangon Biotech, Catalog #A610278) was added (700 µL), vortexed, and centrifuged at 13,000 rpm for 5 min at room temperature using a microcentrifuge (Thermo Fisher Scientific Inc., USA, ModelST40R). The genomic DNA in the aqueous phase was then isolated by ethanol precipitation and resuspended in 30 µL of nuclease-free water (Sangon Biotech, Catalog #B300591).

DNA quality control was performed using an Agilent 5400 system. DNA shearing was conducted using a Covaris instrument (Covaris, USA, model ML230) to generate ~350 bp fragments. End repair was performed using T4 DNA polymerase (New England Biolabs, M0203L), and 3’ adenylation was performed with Klenow fragment (New England Biolabs, Catalog #M0212). DNA adaptors (Illumina, Catalog #FC-121-1031) were ligated using T4 DNA ligase (New England Biolabs, Catalog #M0202). Size selection was performed using SPRIselect beads (Beckman Coulter, USA, Catalog #2358413). The DNA library was amplified using high-fidelity polymerase (New England Biolabs, Catalog #M0530), and sequencing was performed using the Illumina NovaSeq platform (Illumina, USA, model NovaSeqX) with 150 bp paired-end reads (PE150).

Sample size and data completeness analyses

The effect of sample size on the ability for GAM to distinguish true-positive from false-positive associations was assessed by systematically reducing the dataset to contain 179, 529, 879, 1579, 2279, 2979, 4329, 5779, and 7179 samples. Each subset was analyzed using GAM to identify true- and false-positive drug-gene associations.

A comparative analysis of the effect of sample size on the ability of GAM and LMM to identify drug-gene associations was performed using the same datasets used to evaluate sample size effects on GAM performance. The LMM model employed in this analysis used FaST-LMM⁵⁸ with adjustments for kinship, lineage, geographic region of sample collection, and the testing site of each sample, where the kinship matrix was assessed using the Jaccard similarity index, and lineage, geographic region, and test site were included as covariates.

The effect of missing data was assessed by eliminating random drug resistance data fields from random isolates to generate data sets with missing DST data rates of 1, 5, 7.5 10, 12.5, 15, 17.5, 20, and 25%. Sample and data exclusion processes were repeated 10 times, and the resulting data was analyzed to determine the mean and standard error of the number of true- and false-positive drug-gene associations detected at each sample size or data drop rate. This analysis was not performed using the LMM approach, as FaST-LMM inherently excludes isolates with missing phenotype information, and thus the effects for missing DST data can be drawn from the effects of reduced sample size.

Both studies analyzed and/or dropped data from the three first-line drugs and six second-line drugs (amikacin, ethambutol, ethionamide, isoniazid, kanamycin, levofloxacin, moxifloxacin, rifabutin, and rifampicin), since the four new/reproposed drugs analyzed in the CRyPTIC dataset lacked significant drug-gene associations and thus did not contribute to GAM performance.

ML adjustment for missing data

To assess the impact of ML on GAM analyses performed with incomplete DST data, a multi-output classification task was performed using the scikit-learn library (Version 1.3.0). In this analysis, the 7179 CRyPTIC isolates suitable for GAM analyses were systematically divided into ML training sets containing 1, 25, 50, and 75% of these entries, with the remaining samples employed for the corresponding GAM testing dataset. These ML training datasets were supplemented with 428 entries (DS3) with unique drug resistance profiles that could not be grouped for GAM (Supplementary Fig. 1) but provided valuable information to train the ML model. These training datasets were used to train Gradient Boosting classifiers, configured with a learning rate of 0.1 and 100 estimators, and later applied to predict missing values in the testing datasets employed for GAM analyses. Missing data in these testing datasets was produced by randomly excluding a drug resistance data field from a randomly selected isolate to generate testing datasets 0, 5, 20, 40, 60, 80, and 99% missing DST data rates, and repeating this procedure five times for each missing data rate value. Missing DST data values were then imputed by the ML models and the adjusted data was subjected to GAM evaluation. All GAM results were analyzed to determine mean and standard error values.

Data visualization and statistics

Data visualization and statistical analysis were carried out using various software tools. Figures were created using GraphPad Prism (Version 10.0.2), with p-values obtained through a one-way analysis of variance (ANOVA) and Tukey’s multiple comparison test. Additionally, Matplotlib (Version 3.8.0) and Seaborn (Version 0.13.0) were used for generating other data plots to provide comprehensive visual representations of the study’s findings. The performance of machine learning models was assessed using the scikit-learn library (Version 1.3.0), enabling rigorous evaluation of their predictive capabilities. In investigating associations between categorical variables and addressing the need for multiple comparisons, we conducted Fisher’s exact tests and applied Bonferroni correction, making use of the statsmodel library (Version 0.14.0) to maintain statistical rigor. For the visualization of phylogenetic trees and their associated data, we utilized the phylobase (Version 0.8.10) and ggtree (Version 3.2.1) packages, providing a comprehensive representation of evolutionary relationships within the dataset. The GAM analysis code was developed and tested in Python (version 3.8.1) using Jupyter Notebook (version 7.3.2). Key libraries included Pandas (version 1.3.5), NumPy (version 1.18.2), SciPy (version 1.6.2), PySnpTools (version 0.0.2), FastLMM (version 0.0.1), and scikit-learn (version 1.6.1).