Early detection of emerging SARS-CoV-2 Variants from wastewater through genome sequencing and machine learning

Categories: Disease & Virus

July 9, 2025

Wastewater sample collection, processing, and sequencing

A total of 3659 wastewater samples were collected from urban and rural locations in Southern Nevada from August 2021 to November 2023. After collection, samples were placed on ice in the field and stored under refrigeration until processing (hold time <36 h). Nucleic acids from wastewater samples were isolated using the Promega Wizard Enviro Total Nucleic Acid Kit (Cat #A2991) following the manufacturer’s protocol. In addition, we modified the Promega protocol by lysing wastewater with the protease solution and binding free nucleic acids using NucleoMag Beads from Macherey-Nagel (Cat #744970). Total RNA (>10 ng) was processed for first-strand cDNA synthesis using the LunaScript RT SuperMix Kit (New England BioLabs). Amplicon-based sequencing libraries were constructed using the CleanPlex SARS-CoV-2 FLEX Panel from Paragon Genomics. Libraries were sequenced on an Illumina NextSeq 500 or NextSeq 1000 platform with 300 cycle flow cells.

Wastewater sequence data processing

Processing of sequencing data followed a modification of our previously published pipeline¹⁹. Upon sequencing, Illumina adapter sequences were trimmed from read pairs using cutadapt version 4.2⁵². Sequencing reads were then mapped to the SARS-CoV-2 reference genome (NC_045512.2) using bwa mem, version 0.7.17-r1188⁵³. Paragon Genomics CleanPlex SARS-CoV-2 FLEX tiled-amplicon primers were trimmed from the aligned reads using fgbio TrimPrimers version 2.1.0 in hard-clip mode. Variants were called by iVar variants v1.4.1⁴⁰ using mutation sites with alternative allele frequencies with respect to the reference SARS-CoV-2 2020 initial genome⁵⁴, Genome coverage and read depth were calculated using samtools v1.16.1⁵⁵. After removing duplicated samples and positive/negative controls, 2,684 samples remained for quality control (QC) analysis (Supplementary Fig. 1). Strict QC was enforced as only wastewater samples with 50x depth covering more than 80% of SARS-CoV-2 genome were retained in the following analyses. Collectively, a total of 1,385 samples, from August 2021 to November 2023, covering 59,422 mutation sites of SARS-Cov-2 variants were used for the following analyses (Supplementary Figs. 1 and 2C-D).

Public health sample analyses

We analyzed 8,810 high-coverage clinical SARS-CoV-2 sequences from Nevada, spanning September 2021 to November 2023, downloaded from GISAID (Supplementary Fig. 10).

Retrospective independent component analysis of Variants (ICA-Var)

Mathematically, let \({{\bf{Y}}}\in {{\mathbb{R}}}^{{\mathrm{1,385}}\times {\mathrm{59,422}}}\) denote the 59,422 mutation frequencies (i.e., the proportion of reads at a site that contains the mutation) from 1385 wastewater samples. Since wastewater samples are aggregations of genomes from multiple infected individuals with various virus lineages, \(Y\) could be considered as a multivariate mixed signal of SARS-CoV-2 variants spanning the local community. The data-driven ICA approach separates this multivariate signal into additive subcomponents⁴⁵: \({{\bf{Y}}}={{\bf{AS}}}\), where \({{\bf{A}}}\in {{\mathbb{R}}}^{{\mathrm{1,385}}\times {n}_{{{\rm{ica}}}}}\) denotes the mixing matrix and \({{\bf{S}}}\in {{\mathbb{R}}}^{{n}_{{{\rm{ica}}}}\times {\mathrm{59,422}}}\) represents the source matrix (Fig. 1A, shaded gray box and Supplementary Fig. 9). In ICA-Var, the number of ICA components (\({n}_{{ica}}\)) was determined from the minimum description length (MDL) criterion, and fastICA algorithm was utilized to perform ICA. Similar to the Bayesian information criterion (BIC), MDL provides stronger penalties for model complexity, leading to a preference for simpler models, as compared to Akaike information criterion (AIC)⁵⁶. In our analysis, ICA was repeated 50 times with different initial values and components from each run were clustered and visualized using ICASSO⁵⁷, a software for investigating the reliability of ICA estimates by clustering and visualization. Only reliable estimates corresponding to tight clusters were retained as final sources (\({{\bf{S}}}\), Supplementary Fig. 1A).

The original ICA method assumes that all sources (i.e., subcomponents S) are non-Gaussian and that the sources are statistically independent from one another⁴⁵. In analyzing our wastewater samples using ICA-Var, the independence comes mostly from different mutation patterns and various circulating windows of time for each VoCs^42,43,44. We quantitatively validated this independence assumption by calculating pairwise Spearman’s correlation coefficients among the obtained sources (S), which revealed consistently low values (mean absolute correlation: 0.08 ± 0.08). The non-Gaussian nature of these components, another critical ICA assumption, derives primarily from the inherent sparsity in our input data matrix \({{\bf{Y}}}\) and was statistically confirmed through Kolmogorov-Smirnov tests, which rejected the null hypothesis of normal distribution for each source in S at α = 0.05. These validations ensure that the mathematical foundations of ICA are appropriately satisfied in our application. In this case, the source matrix could represent a co-varying mutation pattern in different time windows, and could therefore serve as data-driven reference barcodes for mutation frequencies co-existing in wastewater samples. From this perspective, the ICA-Var pipeline can be considered as running a multivariate regression and determining the design matrix (S, i.e., reference barcodes) from the data under the constraint of independence of the sources. In our study, we conducted a retrospective analysis using ICA-Var on 1,385 wastewater samples (Supplementary Fig. 1) spanning from August 2021 to November 2023. We predicted that the ICA sources could capture the evolving dominant SARS-CoV-2 VoCs over time, each characterized by unique determinant mutation patterns.

Next, we identified a set of contributing mutations (i.e., significant mutations) for each source in \({{\boldsymbol{S}}}\). We identified significant mutations for each source by selecting mutation sites with values exceeding mean ± 2 standard deviations (representing 4.55% of all mutation sites) in each row of S⁴⁹. The contribution of each mutation site reflects its overall prevalence across all samples analyzed by ICA-Var. A binary matrix \(\hat{{{\bf{S}}}}\) was then computed to retain only these contributing mutations (Supplementary Fig. 2B). The pipeline developed in this manuscript and the data used to generate the results are available at https://github.com/zhuangx15/ICAvar.

Dual-regression to back-project source matrix onto weekly wastewater samples

To further determine VoCs for each week, we performed a dual-regression analysis⁴⁶ to project the ICA source matrix (S) back onto weekly wastewater samples (Fig. 1A, shaded gray boxes and Supplementary Fig. 9). The term “dual-regression” stems from the utilization of two regression procedures employed to estimate source and de-mixing dynamics for each week against the original data. More specifically, let \({{{\bf{y}}}}_{{{\rm{i}}}}\in {{\mathbb{R}}}^{{Nsampl}{e}_{i}\times {\mathrm{59,422}}}\) denote mutation frequencies for \({Nsampl}{e}_{i}\) samples in the \({i}^{{th}}\) week, we then, 1) used the all-sample source matrix (S) as a set of source regressors in a general linear model (GLM), to find week-specific de-mixing dynamics (\({{{\bf{a}}}}_{{{\rm{i}}}}=\,{{{\bf{y}}}}_{{{\rm{i}}}}{{{\bf{S}}}}^{-1},\,{{{\bf{a}}}}_{{{\rm{i}}}}\in {{\mathbb{R}}}^{{Nsampl}{e}_{i}\times {n}_{{ica}}}\)) associated with all-sample source matrix (S); and 2) used week-specific de-mixing dynamics (\({{{\bf{a}}}}_{{{\bf{i}}}}\)) as a set of regressors in a second GLM, to find the week-specific source matrix (\({{{\bf{s}}}}_{{{\rm{i}}}}=\,{{{{\bf{a}}}}_{{{\rm{i}}}}^{-1}{{\bf{y}}}}_{{{\rm{i}}}},\,{{{\bf{s}}}}_{{{\rm{i}}}}\in {{\mathbb{R}}}^{{n}_{{ica}}\times {\mathrm{59,422}}}\)) that were still associated with the all-sample source matrix (S). This process yields pairs of estimates forming a dual space, jointly providing the best approximation for the original all-sample ICA source matrix in each weekly sample. In summary, we obtained dual-regressed week-specific source matrix \({{{\boldsymbol{s}}}}_{i}\) for 113 weeks from August 2021 to November 2023.

ICA source matrix annotation to SARs-CoV-2 VoCs

To delineate the VoCs each week, we annotated our dual-regressed ICA source matrix \({{{\boldsymbol{s}}}}_{i}\) by comparing them against the known mutations in VoCs from clinical SARS-CoV-2 sequencing data (Fig. 1A, bottom row). Next, we focused on 18 VoCs that have either been or were circulating in Southern Nevada between 2021 and 2023. Due to the potential shared dominant mutations among VoCs during evolution, a hierarchical structure was formed from these 18 VoCs based on the phylogenetic tree from www.covspectrum.org (Fig. 1B, top panel). Dominant mutation sites for each VoC were determined as follows: 1) mutations with more than 90% prevalence among clinical sequences reported at www.covspectrum.org were retained; 2) for lineages in level 2, 3, and 4 of the hierarchical tree, mutations that existed in their higher level VoCs were excluded to maintain a unique determinant mutation set (Fig. 1A, button row); and 3) mutations with both substitutions and deletions were included in step 1) and 2). The number of dominant mutations of each VoC were listed in parentheses in Fig. 1B, top panel.

We next binarized each row of week-specific source matrix (\({{{\bf{s}}}}_{{{\rm{i}}}}\)) by keeping only mutations with values greater than mean ± 2 standard deviations, as these mutations were contributing significantly towards the source (\(\hat{{{{\bf{s}}}}_{{{\rm{i}}}}}\)). We annotated the binarized week-specific source matrix (\(\hat{{{{\bf{s}}}}_{{{\rm{i}}}}}\)) using the dominant mutations from the VoCs by computing the following six matrices: 1) Spearman’s rank correlation coefficient (\(\rho\)); 2) sensitivity; 3) specificity; 4) area under the receiver operating characteristic curve (AUC); 5) F1 score; and 6) Jaccard Index (JI). As shown in the dendrogram of these six matrices (Supplementary Fig. 11C), JI and F1score, AUC and sensitivity were highly similar to each other, respectively. The measure of specificity is dependent on the count of non-dominant mutations within each VoC. This count is arbitrarily determined and influences the number of dominant mutations observed in other VoCs. Therefore, we established our annotation criteria using the F1 score, sensitivity, and the Spearman’s correlation values (Supplementary Fig. 11B), and further based on hierarchical levels and number of dominant mutations of each VoC (detailed in Fig. 1B).

We compared VoC annotations of the proposed pipeline against results from the state-of-the-art tool Freyja³⁵ (version 1.4.5, Fig. 1A, dashed boxes). For this comparison, Freyja was retrospectively and independently applied to each of the 1,385 samples, utilizing a barcode comprising 18 VoCs, generated in October 2023. We organized samples into individual weeks, ranging from August 2021 to November 2023. In each week, if the results from Freyja indicated that any wastewater sample contained a VoC with an abundance exceeding 15%, we considered this VoC as detected by Freyja in that specific week (Supplement Figure 11A and Fig. 1C).

Potential novel mutations

Given that the identification of contributing mutation sites did not necessitate prior knowledge of reference barcodes for VoCs, ICA-Var provides a distinctive approach to discern emerging mutations. This capability extends to contributions that might give rise to novel lineages across local communities, even in the absence of clinical sequencing data. From this perspective, we focused on capturing the time-evolving contributions of significant mutations in each week-specific source matrix (Fig. 1A, top row), using: |\({{{\bf{s}}}}_{i}^{\left(j\right)}|={{\bf{T}}}{{\mathbf{\beta }}}+{{\mathbf{\epsilon }}}\), where \(i=1,\ldots,113,\) \({{{\boldsymbol{s}}}}_{i}^{\left(j\right)}\) denoted the source values for \({j}^{{th}}\) contributing mutations in \({i}^{{th}}\) week, T denoted a time vector for 113 weeks from August 2021 to November 2023, and \({{\mathbf{\beta }}}\) represented the time-evolving effect of each contributing mutation. Since flipping signs of the de-mixing matrix in ICA would result in a flipping sign of the source matrix⁴⁵, we focused on the amplitude (absolute value) of each week-specific source \(({s}_{i})\).

A significant \(\beta\) indicated a critical time-changing contribution for this mutation from 2021 to 2023. As a proof of concept, we cross-checked mutations with significant \(\beta\) against known dominant mutations in Delta (B.1.617.2), Omicron (BA.1), and more recent XBB.1 variants, and examined their time-evolving contributions. Following clinical reports, dominant mutations in Delta variant (B.1.617.2) should demonstrate significant contributions in 2021, dominant mutations in Omicron variant (BA.1) should demonstrate significant contributions from late 2021 to 2022, and dominant mutations in XBB.1 variants should demonstrate significant contributions after late 2022.

Subsequently, we refined our pool of potential novel mutations by cross-referencing known SARS-CoV-2 variants with mutations exhibiting significant time-evolving contributions. Among the mutations retained, those demonstrating emerging contributions in more recent weeks were identified as candidates with the potential to give rise to novel lineages. To delve deeper, our focus extended to their time-evolving contributions over recent three months, from August 2023 to November 2023. We further performed a hierarchical clustering on these mutations with contributions in the three months as features, to identify co-varying patterns among these mutations. To validate the identified co-varying patterns, we examined co-varying patterns of mutations within each cluster, and cross-referenced them with dominant mutations from the emerging VoCs EG.5, HV.1, and BA.2.86.

The significant contributing mutation sites identified by ICA-Var did not necessitate prior knowledge of reference barcodes for VoCs, and thereby provided a data-driven candidate pool that could be aligned with both known and emerging variant mutation profiles. To further evaluate how different thresholds affect the identification of significant mutations, we conducted a sensitivity analysis by altering this parameter. Compared to the reported mean ± 2 standard deviations used in thresholding ICA source matrix S, we considered mean ± 1 standard deviation and mean ± 3 standard deviations as lenient and stringent thresholds, respectively. With each threshold, we identified mutations with significant time-evolving contributions and cross-referenced them with known and emerging variant mutation profiles.

Simulation to demonstrate the number of samples and QC metrics required for ICA-Var

For ground truth in our simulations, we designated wastewater samples with Freyja-calling variant abundance greater than 95% for individual variants as “real” samples. Approximately 25% of our wastewater samples qualified as “real” for a single variant. We generated simulated datasets by combining varying numbers of “real” samples (\([1:1:10,\,12:2:20,\,25:5:50]\) for each SARS-CoV-2 variant under different QC metrics (50x, 20x, or 10x depth at >80% or >40% coverage). Only variants with at least 10 “real” samples under each QC metric were included, causing the number of analyzed variants to vary slightly under different QC metrics.

This process was repeated for “mixed” samples, i.e., samples with >55% Freyja-calling abundance for an individual VoC. The corresponding VoC with >55% Freyja-calling abundance was considered the ground truth for each sample in the simulation. This updated criterion significantly increased the number of VoCs in the simulation, thereby enhancing the evaluation of ICA-Var’s performance in generating data-driven reference barcodes for VoCs with both unique and overlapping mutations from mixed datasets.

We processed these simulated datasets through ICA-Var and compared the resulting group independent sources against Freyja’s clinical barcodes to assess their reliability as data-driven reference barcodes. We evaluated performance using sensitivity and area under the ROC curve (AUC, the average of sensitivity and specificity). These measures were plotted against the number of “real” samples used for each variant under each QC metric. We used elbow criteria on sensitivity curves to determine the optimal number of samples needed for reliable barcodes.

After establishing reliable data-driven barcodes, we tested variant detection performance under various conditions. Using barcodes generated from high-quality data (50x depth, 80% coverage) with 20 samples per variant, we simulated detection of B.1.617.2, BA.1, BA.2, BF.7, BQ.1, and XBB.1.5.

We assembled “weekly” datasets in two ways. First, to compare performance with Freyja, we used “real” samples (>95% abundance) under various QC metrics. Second, to demonstrate ICA-Var’s advantage in early detection, we used samples with low variant prevalence (1–10% abundance, with few dominant mutation sites and low mutation frequencies). These low-abundance simulations were conducted separately for each variant to avoid cross-variant interactions.

Weekly sources were generated by projecting data-driven barcodes onto simulated weekly data using dual-regression. These were compared against Freyja’s barcodes and evaluated using our calling pipeline (Fig. 1B) to determine variant presence.