Bias and robustness of the selection advantage estimates
We consider the incidence of variants \(X\) and \(Y\), which vary according to the model (Supplementary A)
$${X}^{\prime} (t)={\beta }_{x}X(t)-\gamma X(t) \, {{{\rm{and}}}} \, {Y}^{\prime} (t)={\beta }_{y}Y(t)-\gamma Y(t)$$
(3)
In this model, \(\beta\) represents the transmission rate, i.e., the rate at which infected individuals infect non-infected individuals, and \(\gamma\) represents the rate at which infected individuals stop being infectious, yielding a mean generation interval time of \({\gamma }^{-1}\). We model a constant selection advantage \(s\) of variant \(X\) over variant \(Y\) as \(s={\beta }_{x}/{\beta }_{y}-1\), such that \({R}_{x}/{R}_{y}=1+s\), where \({R}_{x}\) and \({R}_{y}\) are the instantaneous reproduction numbers of \(X\) and \(Y\).
We consider a general procedure for estimating \({s}_{x}\) from relative incidence time series data \(f(t)=\frac{X(t)}{X(t)+Y(t)}\). This procedure starts by estimating the time derivative of the logit-transformed relative abundances \(\phi (t)\), which give the difference in transmission rates
$${\phi }^{\prime} (t)=\log (\frac{f(t)}{1-f(t)})^{\prime}={\beta }_{x}-{\beta }_{y}$$
(4)
Commonly, \({\phi }^{\prime}\) is estimated by modeling \(f(t) \sim t\) using some variant of logistic regression11,12,13. The estimate of \({\phi }^{\prime}\) is translated into an estimate of the selection advantage \(\hat{s}\) following
$$\hat{s}={\phi }^{\prime} {\gamma }^{-1}$$
(5)
This approach makes some key assumptions. First and foremost, it assumes that the transmission rate of each variant stays constant through time. When they vary (for example due to an intervention or due to a depletion of available susceptible individuals), the estimated selection advantage will appear to vary even though it is constant. Second and third, they assume that the generation times are equal for both variants and that the instantaneous reproduction number for variant \(Y\) is close to one. Violations of both of these assumptions can lead to a bias in the estimation of the selection advantage of \(X\) over \(Y\). Furthermore, the model assumes a large, well-mixed isolated population. Population dynamic effects such as high rates of introduction of a new variant from other locations can confound the analysis of its selection advantage. Lastly, this model is built on the assumption that the selection coefficient \(s\) is constant. This model is here formulated for simplicity for two variants, but it generalizes to any number of variants, for example by modelling the relative abundance \({X}_{i}/({X}_{i}+{X}_{j})\) for all pairs of variants \({X}_{i},{X}_{j}.\)
In practice, we do not have access to the true relative incidence \(f(t)\) but need to use a proxy, generally based on clinical testing and typing. When using wastewater-derived data, the analysis is complicated by the shedding load profiles \(g\) of the virus, a function which represents the average shedding of viral particles through time after infection. When \(g\) is normalized by the total amount of particles shed, we speak of the shedding load distribution. In general, the shedding load profiles will scale, shift, and spread the observed viral loads in wastewater relative to the true incidence (Supplementary B, Supplementary Fig. 4). We say that the estimates of the selection advantage \({\hat{s}}^{w}\) based on relative loads of variants in wastewater data \({f}^{w}(t)\) are unbiased if they have no expected difference from the estimate derived from the true relative incidences. We say that they are robust if the two variants having different shedding load profiles \({g}_{x}\) and \({g}_{y}\) does not introduce any bias. To investigate the bias and robustness of wastewater-based estimates of relative abundance, we investigate under which conditions the time derivative of the logit transformed relative loads
$${\phi }^{w}(t)^{\prime}=\frac{(X * {g}_{x})^{\prime} (t)}{(X * {g}_{x})(t)}-\frac{(Y * {g}_{y})^{\prime} (t)}{(Y * {g}_{y})(t)}$$
(6)
is altered relative to \(\phi (t)^{\prime}\) (Supplementary C). Here, \(X*{g}_{x}\) and \(Y*{g}_{y}\) represent the convolution product of the incidences of variants \(X\) and \(Y\) with their respective shedding load profiles. With this approach we show that under the common assumption of constant transmission rates, the wastewater-derived estimates are both unbiased and robust to any arbitrary changes of the shedding load profile (Supplementary C i). Furthermore, we also show that if we further assume that \({g}_{x}\) and \({g}_{y}\) have the same shape, so that only the total amount shed is different, then the estimates of selective advantage are always robust even when transmission rates are not constant (Supplementary C ii). The intuition behind these derivations is that, if the observed loads correspond to scalings of the true incidences, then the logit transformed relative loads \({\phi }^{w}(t)\) will be equal to the logit transformed relative incidence \(\phi (t)\) up to an additive constant, and therefore have the same time derivative. For showing that this applies to arbitrary shedding profiles in the case where transmission rates are constant, we can use a key property of exponential growth where convolution with a distribution is equivalent to a scaling.
We also use this framework to assess the impact of incorrectly assuming identical mean generation times and shedding profiles for both variants. In this case, estimates derived from wastewater data are affected in the same way as those based on true relative incidence (Supplementary C viii). Similarly, we use the same framework to show that if we wrongly assume a constant generation time, then the estimates based on wastewater data are affected in the same way as estimates based on true relative incidence, although the bias might be delayed in time (Supplementary C v). We also show that in that case the wastewater-based estimates remain robust to changes in total shedding.
We further develop approximations of \({\phi }^{w}(t)^{\prime}\) to investigate the eventual bias in the apparent selective advantage when the assumption of constant transmission rates does not hold or when the assumption of constant selection does not hold (Supplementary C iii, iv). We use it to investigate its eventual loss of robustness when, in addition, \({g}_{x}\) is arbitrarily different in shape from \({g}_{y}\) (Supplementary C vi). These approximations show that with non-constant transmission rates or with a non-constant selection factor \(s(t)\), the wastewater-based estimates \({\hat{s}}^{w}(t)\) are delayed with respect to true relative incidence-based estimator \(\hat{s}(t)\) by an amount equal to the mean \(\mu\) of the shedding load distribution \(g\) (although it is still robust to changes in total shedding from a variant).
$${\hat{s}}^{w}(t)\approx \hat{s}(t-\mu )$$
(7)
When we additionally further relax the assumption that \({g}_{x}\) and \({g}_{y}\) have the same shape, then the approximations show that robustness can be lost, and that the main contributing factors are the difference in mean of the shedding load distributions of \(X\) and \(Y\), and the rate at which the transmission rates are varying (Supplementary Equations 38, 39).
Bias in R
e estimates due to differences in shedding
We derived closed-form expressions to quantify the potential bias in wastewater-based estimates of \({R}_{e}\) during the introduction of a variant with a different shedding load profile (Supplementary D). For the case where the shedding load profiles of the variants differ in total shedding by a factor \(c\), we obtained (Supplementary D ii) the additive bias
$${R}^{w}(t)=R(t)+{\gamma }^{-1}\frac{(c-1)f(t)^{\prime} }{1+(c-1)f(t)}\le \frac{(c-1)s}{4}$$
(8)
In addition, when the shedding load distributions differ (Supplementary D iii), we have the bias
$${R}^{w}(t)\approx {R(t)+\gamma }^{-1}\frac{({\beta }_{x}-\gamma ){\varDelta }_{\mu }\,f(t)^{\prime} }{1+({\beta }_{x}-\gamma ){\varDelta }_{\mu }\,f(t)}\le \frac{s({\beta }_{x}-\gamma ){\Delta }_{\mu }}{4}$$
(9)
Where \({\Delta }_{\mu }={\mu }_{y}-{\mu }_{x}\). This also shows that the bias is greatest when the relative abundance \(f(t)\) changes rapidly over time, but goes to zero if \(f(t)\) does not vary over time (such as at the end of a sweep).
Simulations of variant emergence
We performed deterministic simulations of variant emergence for four different scenarios, two with fixed transmission rates and two with varying transmission rates. First, two variants are introduced at the same time and compete, with one having a selective advantage. Second, one variant is going extinct and is swept by the second variant. Third, two variants are introduced at the same time, but their transmission rates decrease according to the mass action dynamics of a Susceptible-Infected-Recovered (SIR) model. Fourth, two variants are introduced at the same time, and the transmission rates are reduced to almost zero by an intervention in the middle of the time series. In all cases, we assumed a fixed multiplicative selective advantage. We performed the simulations in R v4.1.3 using the package deSolve v1.3524.
In the derivations of closed-form expressions for the biases in selection advantage \(s\) and the effective reproduction number \({R}_{e}\) due to reduced or increased shedding, we relied on an expression for \({R}_{e}\) which assumes an exponentially-distributed generation time interval. We relaxed this assumption in stochastic simulations to assess the robustness of our results, using the simulation framework used in Huisman et al. 2,23, based on the framework described by Cori et al. 23,25 (Supplementary D). In this case the generation time interval is assumed to be gamma distributed, with a mean of 4.8 days and a standard deviation of 2.3 days. The results were analyzed using EpiEstim v2.226 using a sliding window approach.
Wastewater sampling and processing
Wastewater-based surveillance for SARS-CoV-2 was conducted in six sewersheds (Altenrhein, Chur, Geneva, Laupen, Lugano, Zurich) across Switzerland from 24 November 2021 through 10 January 2022 (Fig. 3A, Supplementary Fig. 1). Volume-proportional 24 h composite samples were collected daily from raw wastewater at the influent of each of the six sites and stored at 4 °C for up to 5 days before being transported on ice for processing at a central laboratory (Eawag, Dübendorf, Switzerland). Processing included total nucleic acid extraction from 40 ml samples (Wizard Enviro Total Nucleic Acid Extraction Kit, CN A2991, Promega Corporation, USA) with an elution volume of 80 ul and subsequent inhibitor removal using OneStep PCR Inhibitor Removal columns (CN D6030, Zymo Research, USA). From these samples, a subset were analyzed for variants of concern using drop-off RT-dPCR assays targeting signature mutations for Delta (S:L452R, n = 74 samples) and Omicron BA.1 (S:HV69-70, n = 79), following Caduff et al. 4. Almost all (n = 280) samples were also analyzed using NGS to identify Delta versus Omicron BA.1, following Jahn et al. 5. RNA extracts were stored at −80 °C for up to 1 week prior to sequencing and up to 3 months prior to RT-dPCR analysis.
Drop-off RT-dPCR assays for detection of signature mutations
Drop-off RT-dPCR assays targeted the S:HV69-70 deletion indicative of Omicron BA.1, as previously described4, and the S:L452R mutation indicative of Delta (lineage B.1.617.2).The assay targeting S:L452R includes two hydrolysis probes binding to a single amplicon: a universal probe targeting a conserved region on the amplicon and a variant-specific probe that binds to the S:L452R mutation (Supplementary Table 3, Supplementary Fig. 3). In the digital PCR, generated droplets with dual fluorescence indicates the presence of an amplicon from the mutation (i.e., Delta), whereas single fluorescence indicative of only the universal probe indicates an amplicon without the S:L452R mutation (i.e., Omicron). For the S:HV69-70 assay, the variant-specific probe only binds when S:HV69-70 is present27, and therefore dual fluorescence in a given droplet indicates presence of an amplicon without the mutation (i.e. Delta), whereas single fluorescence of only the universal probe indicates the presence of the mutation (i.e., Omicron).
Clinical VOC data
For each canton surrounding a WWTP, we downloaded counts of infected individuals binned by variant of sequenced PCR-positive clinical samples through the LAPIS API of CoV-Spectrum28. The data was restricted to sequences originating from the Viollier lab (8525 sequences, Supplementary Fig. 1), which during that period sent a random subset of their PCR-positive samples out for sequencing. This ensured that we compared to the most random clinical samples available for the populations served by the WWTP we analyzed. For the Geneva sewershed, we also compared estimates to data on the qPCR S-gene target failure (SGTF) data from the Geneva University Hospital (HUG) available at https://www.hug.ch/laboratoire-virologie/surveillance-variants-sars-cov-2-geneve-national (5794 tests, Supplementary Fig. 1). The SGTF qPCR is based on the detection of S:HV69-70; failed amplification of a clinical sample previously positive for the N1 gene target is used as a proxy to indicate that the clinical sample is Omicron BA.1.
Data analysis
Computational analysis of wastewater sequencing samples was performed using V-Pipe 3.029. We analyzed the data using the R v4.1.3 statistical programming language and the R package WWdPCR v0.1.014. The package was used to obtain maximum likelihood estimates (MLE) and confidence intervals of the logistic growth rate of the Omicron BA.1 variant in each region. Confidence intervals for the logistic growth parameter were computed assuming a quasibinomial (for the clinical and wastewater sequencing data) or quasimultinomial (for the dPCR data) model of the counts to account for overdispersion, but without allowing underdispersion (i.e., overdispersion factors <1 were not considered). Confidence bands for the fitted values were computed on the logit scale, with standard errors projected using the Delta method, and then back-transformed to the linear scale. The normality assumption is more likely to hold in this reparametrization, which optimizes the coverage of the intervals as well as ensures that they are constrained to the [0,1] range30.
The logistic growth model was fitted separately using the wastewater S:L452R dPCR data, the wastewater S:HV69-70 dPCR data, the wastewater sequencing data, and the clinical sequencing data. For Geneva, we fit the model also on the SGTF data. For the S:L452R dPCR, the dual fluorescence droplets are indicative of Delta, so we assumed that the single fluorescence droplets indicated Omicron BA.1. The growth advantage of BA.1 was calculated using S:L452R data assuming a negative logistic decay rate of the mutated fraction. For wastewater sequencing, we proceeded as previously described5, and we estimated the relative abundance of BA.1 using the observed fractions of reads bearing mutations characteristic of BA.1, while excluding mutations also present in B.1.617.2*. Amplicons suffering potential differential dropout rates or altered amplification due to mutations in the primer regions were discarded.
Inclusion & Ethics statement
All collaborators who contributed to this study and fulfilled the authorship criteria required by Nature Portfolio journals have been included as authors.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.