Machine learning approaches to dissect hybrid and vaccine-induced immunity

Categories: Disease & Virus

July 9, 2025

Participants

A group of 116 healthy participants, vaccinated with the 2-dose primary SARS-CoV-2 vaccination cycle, followed by a booster ~5–7 months later, was included in the study, as described in Table 1. Among them, 36 (31.03%) were male and 80 were female (68.97%), with a median age of 49 (range: 24–81). Eighty-two participants (70.69%) received mRNA vaccines during their 2-dose primary vaccination cycle (mRNA-1273 or BNT162b2), and 34 (29.31%) an adenovirus-based vaccine (ChAdOx1 nCoV-19, AZD1222). As for the booster dose, all participants received the Wuhan original monovalent mRNA vaccines. Sixty-eight participants (58.62%) never self-reported a SARS-CoV-2 infection, while 48 participants (31.38%) self-reported a previous SARS-CoV-2 infection. Among them, 9 (18.75%) self-reported infection before the booster dose, and 39 (81.25%) self-reported an infection after the booster dose.

Long-term immune response upon SARS-CoV-2 vaccine booster dose

To determine the long-term immunity generated by the vaccine and/or infection, the spike- and RBD-specific immune responses, targeting the wt strain, the Delta, Omicron BA.1 and Omicron BA.2 variants, a blood sample was collected 6 months following the booster dose. Compared to pre-boost, a significant increase in the wt spike-specific IgG levels was detected at post-boost (median values of 582.6 [210.5–1218] and 6847 [2791–14988] ng/ml, respectively, ***P < 0.001; Fig. 1a). Upon boosting, IgG levels specific for the wt spike were similar to the ones specific for the spike of the BA.2 variant (median of 7547 [2773–12886] ng/ml, respectively), and significantly higher compared to those specific for the Delta and Omicron BA.1 variants (median of 4457 [1850–8813] and 2125 [935.3–4265] ng/ml, *P = 0.028 and ***P < 0.001, respectively; Fig. 1b). The IgG response specific for the wt RBD was also significantly higher compared to the one specific for Omicron BA.1 and BA.2 RBD (median of 10314 [4462–20192], 3548 [1241–7344] and 3170 [1206–7361] ng/ml, respectively; ***P < 0.001; Fig. 1c). The functionality of the spike-specific antibodies was assessed via their ability to block the RBD/ACE-2 interaction, employing a sVNT. Upon the booster dose, a significantly higher number of participants developed antibodies with binding inhibition capacity above the threshold value compared to the pre-boost analysis, for all viral variants (***P < 0.001, Fig. 1d). Nevertheless, when comparing the binding inhibition capacity after the booster dose, a significant difference was observed between Omicron BA.1 and wt strain values (^###P < 0.001; Fig. 1d). The frequency of circulating wt RBD-specific B cells, identified among non-naïve CD19⁺ B cells (gating strategy in Supplementary Fig. 1) was similar before and 6 months after the booster administration (0.21 [0.11–0.34] and 0.17 [0.08–0.3] % of CD19⁺ cells, respectively; Fig. 1e). Nevertheless, upon in vitro stimulation, the amount of wt spike-specific IgG-secreting MBC was significantly higher at 6 months post-boost compared to pre-boost (2.52 [1.70–3.79] and 0.28 [0.09–1.02] % of total IgG-secreting cells respectively, ***P < 0.001; Fig. 1f).

**Fig. 1: Spike- and RBD-specific immune responses.**

In conclusion, the immunological analysis performed 6 months after the booster dose highlighted the critical role of the third vaccine dose in enhancing both the humoral and antigen-specific B cell responses, not only against the spike/RBD antigens of the wild type strain, but also of the Delta and Omicron variants. However, the wide IQR values across all variables indicated a considerable dispersion of data, suggesting a heterogeneous response.

Dimensionality reduction and Gaussian mixture clustering identify high and low responders

To explore post-boost data in an unsupervised manner, the 12 serological variables previously analysed for each participant (reported in Supplementary Table 1) were computationally processed. To capture complex and non-linear relationships within this 12-dimensional feature space and obtain a meaningful two-dimensional representation, two distinct dimensionality reduction techniques, namely UMAP and tSNE, were employed. Following dimensionality reduction, the application of the unsupervised Gaussian Mixture Model (GMM) clustering algorithm identified, in both UMAP- and tSNE-derived embeddings, two distinct clusters –configuration yielding the lowest BIC value– of immune response. To quantitively compare the clustering performances of the two approaches, Within-Cluster Sum of Squares (WCSS) and Average Silhouette Width were computed. The WCSS values were 693.39 for the UMAP-GMM strategy and 494.39 for the tSNE-GMM strategy, indicating greater intra-cluster compactness in the latter. Similarly, the Average Silhouette Width was higher for the tSNE-based approach (value of 0.63) compared to the UMAP-based one (value of 0.56), reflecting better-defined clusters. Given these results, the tSNE-GMM strategy was selected for downstream analyses and its visual representation is showed in Fig. 2a.

**Fig. 2: Serological data dimensionality reduction and clustering.**

tSNE-GMM cluster 2 consistently exhibited a significantly higher IgG response against wt, Delta, Omicron BA.1 and Omicron BA.2 spike and RBD antigens (Fig. 2b, c, ***P < 0.001), and a significantly higher proportion of participants exhibiting positive values for the RBD/ACE-2 binding inhibition against Omicron BA.1 and BA.2 variants compared to tSNE-GMM cluster 1 (Fig. 2d, ***P < 0.001 for the BA.1 variant, *P = 0.012 for the BA.2 variant). Consequently, tSNE-GMM cluster 2 is hereafter referred to as High Responders (HR) group and tSNE-GMM cluster 1 as Low Responders (LR) one.

The potential impact of clinical and demographic variables including gender, age, vaccine formulations, past infections and time since infection was evaluated to determine their potential influence on classification into HR and LR. Age, gender and vaccine formulations did not reveal to act as influential variables on cluster categorization (P > 0.05, Table 1), while a significantly higher frequency of participants who experienced a self-reported infection were classified as HR (71% of self-reported infected participants, ***P < 0.001).

Among self-reported Infected participants (sI) clustered within HR and LR, a statistically significant difference was observed when comparing the days elapsed from the last infection to the 6 months post-boost blood sample collection (Fig. 3a and Table 1). Indeed, infected participants among the HR group contracted the infection more recently than those falling into the LR one (median value = 94 days; IQR 31–151.5 days for sI-HR, versus median value = 180.5 days; IQR 73–534.5 days for sI-LR, *P = 0.037).

**Fig. 3: Classification of self-reported infectious status into HR and LR clusters.**

Given the high proportion of asymptomatic and mildly symptomatic infections associated with the emergence of Omicron variants¹⁰, it was investigated the possibility that some self-reported Non-Infected participants within the High Responders group (sNI-HR) might have experienced unrecognized infections. This possibility was corroborated by the observation that sNI-HR showed significantly higher frequencies of N-specific MBC compared to self-reported Non-Infected participants within the Low Responders group (sNI-LR, median value of 0.03% versus 0.00% respectively; *P = 0.013; Fig. 3b). Moreover, sNI-HR tended to exhibit higher N-specific antibody response compared to sNI-LR, although not statistically significant (median AUC value of 0.73 and 0.61 respectively; P > 0.05; Fig. 3c). This suggested the potential presence of participants unaware of their infection.

Identification of unaware infected individuals via machine learning classifiers

To identify Unaware Infected participants (UI), a predictive model was developed leveraging three distinct Machine Learning classifiers, namely k-NN, SVM-RBF and RF. These models were trained to distinguish immunological profiles of infected and non-infected individuals based on 13 serological variables (reported in Supplementary Table 1). The analysis comprised the Model Construction phase, performed on k-NN, SVM-RBF and RF classifiers, and the Model Application phase, implemented using a majority voting-based consensus approach of the three classifiers (Fig. 4).

**Fig. 4: Overall strategy for the identification of Unaware Infected participants using Machine Learning Classifiers.**

Model construction

From the initial cohort of 116 individuals, 25 were excluded due to incomplete serological and B cell data, as these variables were essential for the subsequent Model Construction and Application phases respectively (Fig. 4). This reduction resulted in a subset of 91 individuals. Based on predefined criteria—positive swab results, N-specific memory B cells, and anti-N IgG values (as detailed in the Methods and Materials – Classification Models section)—a subset of 34 participants was selected, comprising 18 mcI (model construction Infected individuals) and 16 mcNI (model construction Non-Infected individuals) participants. The clinical characteristics of these subgroups are reported in Table 2. These 34 participants were used to train and evaluate the three classifiers, k-NN, SVM-RBF and RF, via a 5-fold cross-validation strategy, with 70% of the data allocated for model training and the remaining 30% for testing. The classification performance of each model during this phase is reported in Table 3, demonstrating that all three models achieved optimal performances. Variable importance analysis across all three models identified Omicron BA.2 N-specific IgG AUC values, Omicron BA.2 spike-specific IgG concentrations and ACE2-BA.1 RBD binding inhibition percentages as the most important features (Table 4). However, some differences were observed in the feature importance attribution across classifiers. While the k-NN did not highlight any additional informative feature beyond those shared across models, both the SVM-RBF and the RF models assigned non-zero importance scores to a broader subset of serological variables (Table 4). In particular, the SVM-RBF assigned a zero importance score to wt spike-specific concentrations, as well as wt and Delta RBD-specific IgG concentrations, whereas the RF model excluded only the ACE2-wt RBD binding inhibition percentages. However, all three classifiers demonstrated high predictive performances and were thus retained for downstream analysis and included in the consensus-based approach during the Model Application phase.

Table 3 Metric performances of classifiers models during the cross-validation

Table 4 Variable Importance analysis for the classifiers models

Model application

k-NN, RF, and SVM-RBF pre-trained models were independently applied to the remaining 57 participants that did not meet the inclusion criteria of Model Construction phase and whose non-infection status was uncertain (Fig. 4). Among the 57 analysed participants, whose clinical and demographic characteristics are reported in Table 2, 18 self-reported Infection (sI) and 39 self-reported a Non-Infection (sNI). The application of the majority-voting consensus among the outputs of the three models correctly identified 16 out 18 self-reported Infected individuals, yielding a Recall of 0.89 in this Model Application phase. Recall was the only performance metric that could reliably be assessed, given the uncertainty regarding the non-infection status of the remaining participants. These 16 individuals, who self-reported a previous infection and were correctly identified by the consensus strategy, will be referred to as Infected (I). Among the 39 sNI participants, 14 were classified as infected and therefore referred to as Unaware Infected (UI). The remaining 25 participants were confirmed and classified as Non-Infected (NI) (Fig. 4).

To further confirm the UI classification assigned by the consensus strategy, the frequency of N-specific MBC, assessed by ELISPOT, was compared between the UI and NI groups. Participants classified by the consensus approach in the UI group showed a statistically significant higher frequency of N-specific MBC compared to NI participants (median 0.09% and 0% respectively; **P = 0.003; Supplementary Fig. 2), confirming an unaware infected profile. In summary, the application of this strategy allowed for the reliable identification of 14 participants with an unreported infection history based on their immunological profiles, demonstrating its potential to uncover hidden infection status.

Characterization of the immunological profile of participants stratified in Infected, Unaware Infected and Non-Infected participants

The immunological response was analysed based on the stratification of participants into the I, UI and NI groups as determined by the consensus strategy. The 2 participants self-reported Infected but erroneously classified by the model as NI, along with the 34 used for the Model Construction phase (mcI and mcNI) were excluded from this analysis. Participants classified as UI exhibited levels of IgG specific for wt and BA.2 RBD (median of 27,311 and 9487 ng/ml, respectively) comparable to participants classified as I (median of 21,914 and ng/ml and 8876 ng/ml, respectively). Moreover, their IgG levels were statistically higher compared to NI participants (median of 5710 and 2188 ng/ml for wt and BA.2 RBD-specific IgG; ***P < 0.001; Fig. 5a, b), while no significant differences in the proportion of participants above the binding inhibition threshold value were observed between groups (all P > 0.05; Fig. 5c, d). Similar results were observed when the analysis was performed for the serological response specific for Delta and BA.1 variants (Supplementary Fig. 3). A statistically significant higher frequency of wt RBD⁺ B cells was observed in I and UI (median of 0.27% and 0.28%) compared to the NI group (median of 0.16%; *P = 0.018 and **P = 0.005 respectively; Fig. 5e). Participants classified as I and UI also presented statistically higher frequencies of circulating IgG secreting RBD-specific MBC capable of reactivating upon in vitro stimulation compared to NI participants (median frequency of 4.55% in I, 3.71% in UI and 1.54% in NI; ***P < 0.001; Fig. 5f).

**Fig. 5: RBD-specific immune responses in groups with different immunological profile, as classified by the consensus-based model.**

To compare the phenotypes of the RBD⁺ B cells developed among the I, UI and NI groups, the SOM clustering algorithm was applied to the multidimensional flow cytometry data (Fig. 6). According to the combination of the expression of 7 markers (IgD, CD27, CD21, CD38, IgM, IgA, IgG), 12 MBC clusters were identified among the total CD19⁺ no naïve B cells, and grouped in Ig-switched MBC (IgD^– CD27⁺), plasmablast/plasma cells (PB/PC; IgD^– CD38⁺), double negative (DN; IgD^– CD27^–) and unswitched MBC (IgD⁺ CD27⁺) (Fig. 6a). Most of the RBD⁺ B cells fell into IgG⁺ resting MBC (cluster 3), DN CD21⁺ MBC (cluster 4), DN CD21^– MBC (cluster 12) and IgG⁺ activated MBC (cluster 13) (Fig. 6b). When comparing the phenotypes of RBD⁺ B cells among I, UI and NI, statistically higher levels of RBD⁺ IgG+ resting B cells (cluster 3) were detected in participants belonging to the I and UI groups compared to NI (median of 24.53%, 28.1% and 14.69%, respectively, *P = 0.041 and 0.016, respectively, Fig. 6c). Conversely, NI showed statistically higher levels of RBD⁺ DN1 CD21⁺ B cells (cluster 4) compared to I (median of 37.17% and 20% respectively, *P = 0.02, Fig. 6d).