Plasma proteomics for biomarker discovery in childhood tuberculosis

Categories: Disease & Virus

July 19, 2025

Ethical considerations

This study complies with all relevant ethical regulations. All caregivers completed a written informed consent, including for storage of samples for future studies, and children completed an assent as applicable. The studies were approved by the Mulago Hospital Ethics Research Committee, Gambian Government, and MRC joint ethics committee, London School of Hygiene and Tropical Medicine, Institutional Ethics Committee for Research of National Institute of Health—Peru, University of Cape Town, and the University of California, San Francisco (UCSF) IRB.

Pediatric TB cohort

We analyzed plasma samples that were collected from children less than 15 years old evaluated for pulmonary TB who were previously enrolled as part of prospective diagnostic cohort studies in the Gambia, Peru, South Africa, and Uganda. Children were included if they had signs and symptoms of pulmonary TB, and excluded if they were already taking treatment for TB infection or disease for more than 72 h. All children completed a standard TB evaluation, including clinical exam, chest X-ray, and respiratory sample collection for Xpert MTB/RIF molecular testing and mycobacterial culture. All children had follow-up after 2–3 months, and were assessed for clinical response to any treatment. They were classified according to NIH consensus definitions as Confirmed, Unconfirmed, or Unlikely TB. Confirmed TB was defined as having microbiological evidence of TB disease by a positive Xpert MTB/RIF Ultra or mycobacterial culture positive for M. tuberculosis. Unconfirmed TB cases did not have microbiological evidence of TB, but had signs and symptoms of TB disease with other clinical signs or risk factors suggestive of TB, including abnormal chest X-ray and/or known TB contact. They were started on anti-TB treatment with improvement at the follow-up visit. Unlikely TB cases were symptomatic, but did not have microbiological evidence of TB disease nor other signs or risk factors. In addition, asymptomatic healthy children from Uganda were enrolled, who had interferon-gamma release assay (IGRA) testing with Quantiferon-Gold (Qiagen, Hilden, Germany) testing for TB infection. Healthy controls were defined as asymptomatic and IGRA negative, while Latent TB infection cases were defined as asymptomatic with positive IGRA results. The gender of participants was self-reported in the baseline questionnaire, and was not considered in the study design.

Sample collection and selection

Trained staff performed venipuncture and collected blood samples in all children at baseline and within 72 h of any TB treatment. Blood samples were centrifuged and plasma samples aliquoted and placed in −80 °C freezers. For this analysis, each study site randomly selected plasma samples from Confirmed, Unconfirmed, and Unlikely TB cases in a 1:1:2 ratio, respectively. In addition, a convenience sample of plasma specimens was selected of asymptomatic children from Uganda.

Sample preparation for plasma proteomics

We analyzed a total of 511 plasma samples, with each sample representing an individual patient (n = 1). From each sample, 1 μL of undepleted plasma was transferred in a 96-well plate with 200 μL of inactivation buffer (8 M urea, 100 mM ammonium bicarbonate, 150 mM NaCl), and 0.75 μL/mL of RNAse (NEB) was added. The proteins were transferred to a 96-well filter plate and processed similarly to what we previously described¹⁴. Briefly, the plates were dried by centrifugation (1800 × g at 25 °C for 30 min) and 50 μL of TUA buffer (8 M urea, 20 mM ammonium bicarbonate, 5 mM TCEP) were added. Following incubation at RT on a shaker (500 rpm, 25 °C), chloroacetamide (CAA) was added to 10 mM final concentration and the plates were incubated in the dark for 1 h at room temperature. TCEP/CAA were removed by centrifugation (2000 × g, 30 min, RT) and the plates were washed thrice with 200 μL of ddH20. Trypsin was added in a 1:50 ratio and the samples were digested overnight at 37 °C on a shaker (800 rpm). Peptides were collected by centrifugation (2000 × g, 30 min at RT) and the plate was washed once with 100 μL of ddH20. Resulting peptides were dried under vacuum and were resuspended at approximately 200 ng/μL prior to MS injection and DIA-PASEF analysis. Additionally, from these samples, a representative pool of HIV positive and TB-positive cases were further high-pH fractionated on C18 tips and measured by DDA-PASEF to generate a spectral library⁴⁷. Briefly, this high-pH fractionation was performed using C18 spin columns. These columns were first activated by treatment with one column volume of acetonitrile, followed by equilibration by two column volumes of 0.1% TFA. Peptides were subsequently loaded onto the C18 columns and washed twice with 0.1% TFA. A stepwise elution of bound peptides was performed using increasing concentrations of acetonitrile (5%, 7.5%, 10%, 12.5%, 15%, 17.5%, 20%, 50%) in 0.1% triethylamine (pH 10), and lastly with 2 washes of 50% acetonitrile. The resulting fractions were dried by vacuum centrifugation and resuspended on 0.1% formic acid prior to MS analysis by DDA-PASEF.

DIA-PASEF data acquisition for abundance proteomics

Approx 200 ng per sample were analyzed on a Bruker TimsTOF Pro interfaced with a Ultimate 3000 UHPLC. Peptides were separated using a 15 cm PepSep column (Bruker, 150 cm length, 1.7 μm Reprosil Saphir C18 beads) and sprayed into the Captive source kept at 1700 V and 200 °C. The peptides were separated from 2 to 33% of buffer B (0.1% formic acid in acetonitrile) for 26 min, then B was increased to 90% buffer B for 5 min, and then the column was re-equilibrated at 5% buffer B for 2 min, reaching a total gradient time of 33 min. Buffer A of this separation was 0.1% formic acid. The samples were acquired in DIA-PASEF mode using nine 32 m/z DIA-PASEF windows (500–966 mz) and ion mobility between 0.85 and 1.3 Vs/cm². Data for selected samples was re-acquired when significant mass shifts were observed or when consecutive injections had reduced signal.

DDA-PASEF and DIA-PASEF data analysis

To generate a spectral library for the analysis of DIA-PASEF data files, DDA-PASEF files were searched using MSfragger⁴⁸ within the FragPipe toolkit (v1.8) using the library generation workflow (“DIA-Speclib-quant”) using a human FASTA downloaded in January 2022 (20408 entries). This search was performed using tryptic cleavage specificity, with 2 missed cleavages, fixed modification of carbamidomethylation on cysteine residues, variable modification of methionine oxidation and protein n-terminal acetylation, a precursor mass tolerance of optimized per sample ranging from −20 to +20 ppm (default in FragPipe), as product ion mass tolerance of 20 ppm, and a minimum peptide length of 8. Resulting peptide identifications were filtered to a 1% FDR at the peptide and protein level. The generated library and our previously reported plasma library⁴⁷ were merged using easypqp (https://github.com/grosenberger/easypqp). All DIA-PASEF samples were searched with DIA-NN (v1.8)⁴⁹ using a library-based strategy. MS1 and MS2 tolerances were set to 10 ppm. Protein grouping was performed based on the library ids and cross run-normalization was disabled. Following search, the global report file was filtered to <= 1% protein group Q-values (‘Lib.PG.Q.Value’). Samples were excluded if the number of peptides was below 3 standard deviations of the median number of peptides (2591), which removes samples with less than 1700 peptides. The peptide-level data was normalized using median-centering of the peptides identified in all samples.

Following normalization, the missing values were imputed utilizing an heuristic strategy based on their identification frequency to leverage the large number of samples analyzed in this study.

The following rules were applied:

Peptides identified in > 50% of the samples (at least 250 independent identifications) were imputed with the mean identification value,
Peptides identified in <50% but > 10% of the samples were imputed utilizing a random value extracted from a generated gaussian distribution with mu and sigma of the data downshifted 1.8 × sigma
Peptides identified in <10% of the samples were removed.

Following imputation, the peptide-level data was batch corrected using COMBAT¹⁸ to normalize any variation between the clinical sites, batches of sample preparation, or MS acquisition batches. We used as batches the various clinical sites, with added covariates of the MS acquisition and sample preparation batches (i.e., the different plates). Peptides were rolled into proteins utilizing only proteotypic peptides and a topN strategy (max 3 proteotypic peptides per protein), using the mean intensity to represent a protein intensity. For gene set enrichment analysis, we used the MDtest function (nperm = 1000) from the GSAR R package using the protein intensity values from Confirmed and Unlikely TB samples as input⁵⁰. Protein sets corresponding to known biological pathways were used as the input gene sets. For each signaling pathway, this function performed a two-sided mean difference test of the null hypothesis that there is no difference in the mean of a set of features (i.e., proteins) between two conditions (confirmed TB vs. unlikely TB). Resulting p-values were then adjusted by the Benjamini–Hochberg (BH) approach.

Machine learning based identification of a TB biosignature

Protein-level intensities after normalization across all clinical sites and HIV status for Confirmed TB (n = 120) and Unlikely TB (n = 211) were selected and z-scored. For increased stringency in our proteins for biosignature development, we restricted it to only proteins with 50% or less missing values among the combined collection of patient samples from the Confirmed and Unlikely TB groups. We then selected from the remaining proteins, combinations exceeding the required WHO target product profile for a diagnostic test. Confirmed TB and Unlikely TB cases were included, given clear reference standards for TB and not TB. First, a random 75% of the data was selected for training a LASSO model using scikit-learn LASSOCv function (20 folds stratified by TB class, max_iter = 10000, tol = 0.0001). The feature importance was calculated and the proteins with non 0 coefficients were used for combinational analysis (n = 50 proteins). In this analysis, we generated all possible combinations of features ranging from 1 (50 combinations) to 6 (n = 15,890,700 combinations) and trained a logistic regression model based on the z-scored abundance for each specific combination. The remaining 25% of data was then used as a test set for model evaluation for all models and was not utilized for training at any step in this initial analysis. Models for every N were ranked based on the sensitivity achieved at 90% specificity (on our 25% test split) and the top scoring models for every N were kept for subsequent analysis. Confidence intervals were calculated using the Clopper-Pearson (exact binomial) method. We then applied models achieving the required WHO TPP (3, 4, 5, and 6 protein models) to the Unconfirmed TB cases to determine what proportion could be diagnosed using this model.

Computational packages utilized

Raw proteomics data was analyzed with either MSFragger⁴⁸ (DDA data) or with DIA-NN (DIA data)⁴⁹, and the generated DDA library and our previous reported plasma library⁴⁷ were merged using easypqp (https://github.com/grosenberger/easypqp). For data processing, model training, and figure generation, we used the following packages in Python (v3.8.2): scikit-learn (v1.5.1), pandas (v2.2.2), numpy (v.1.26.4), pyCombat (v), https://github.com/epigenelabs/pyComBat, joblib (v.1.4.2), seaborn (0.13.2), matplotlib (v.3.9.2), matplotlib-base (v3.9.2), scipy (v1.13.1), statsmodel (v0.14.2). The following packages in R (v.4.3.1, release ‘Beagle Scouts’) were used for figure generation: ggplot2 (v.3.5.1), RcolorBrewer (v1.1.3), viridis (v0.6.5), ggpubr (v0.6.0), ggsci (v3.2.0). Additionally, the GSAR R package (v.1.40.0) was used for analysis of the log2FC between Confirmed and Unlikely TB. All code for data analysis, imputation, and figure plots is available here: https://github.com/anfoss/COMBO_code.git.

Statistics and reproducibility

We randomly selected plasma samples in a 1:1:2 ratio of Confirmed:Unconfirmed:Unlikely TB, and sample size was determined by availability of specimens and to ensure adequate precision in the test set. With a sample size of 500 and 25% held for the test set, we would be powered to measure a sensitivity of 90% +/− 12% and specificity of 70% +/− 10% when comparing Confirmed to Unlikely TB. Samples were batched by country, and randomized within a given sample preparation plate and data acquisition for each country and staff were blinded to TB status during data acquisition. All samples were analyzed once with the exception of selected samples where there was evidence of instrument performance deviation, including the observation of significant mass shifts or consecutive injections with reduced signal. Data for these samples was re-collected, and this re-collected data is presented in this study. Samples not passing QCs defined in the section “DIA-PASEF enabled high-throughput plasma proteomics” were removed (n = 7). In the machine learning analysis, data were excluded for greater than 50% missingness.