The mutational landscape of SARS-CoV-2 provides new insight into viral evolution and fitness

Categories: Disease & Virus

July 12, 2025

Overall strategy

The mutation rates and spectra of RNA viruses (including SARS-CoV-2) are notoriously difficult to measure. For example, even though more than 8 million SARS-CoV-2 genomes have been documented across the globe (GISAID, https://gisaid.org/), this dataset only contains mutations that were successful enough to become major variants in patients. Accordingly, most studies of within-host genetic diversity are limited to variants with an allele frequency that exceeds 0.5%, while mutations that are detrimental to the virus are missed¹. In contrast, the negative correlation between mutation rate and genome size observed in viruses suggests that the spontaneous mutation rate of the >30 kb SARS-CoV-2 genome is <1 × 10⁻⁵ per base², which is significantly lower than the detection threshold of the sequencing methods used on patients³. As a result, RNA-sequencing methods with improved sensitivity are required to determine the mutation rate of SARS-CoV-2. With these considerations in mind, we used an ultra-sensitive and highly accurate rolling-circle RNA consensus sequencing method termed CirSeq⁴ to determine the mutation rate and spectrum of 6 SARS-CoV-2 variants. This method was previously used to determine the mutational landscape of other RNA viruses, including the polio virus⁵, the Ebola virus⁶, the Dengue virus⁷, and the Zika virus⁸. The improved accuracy of CirSeq relies on the circularization of short RNA fragments to synthesize long cDNA molecules that carry tandem repeats of the original RNA template. These tandem repeats can then be analyzed to generate a consensus sequence, which eliminates sequencing and reverse-transcription errors from the final sequencing results (Supplementary Fig. 1). Mutation frequencies are then obtained by dividing the number of mutations observed at a given position by the number of molecules that covered this position.

To explore the mutational landscape of SARS-CoV-2, we cultured the virus in VeroE6 cells, a preferred cell line for COVID-19 research because of its susceptibility to infection, efficient viral replication, and permissiveness to mutations⁹. Accordingly, VeroE6 cells can support a higher degree of viral genetic diversity than other cell lines, which is useful for studies that examine viral evolution during prolonged culture conditions. In total, we cultured 6 major strains of the SARS-CoV-2 virus, including the USA-WA1/2020, Alpha and Delta strains (corresponding to clades 19B, 20I and 21J, respectively), as well as the Beta, Gamma and Omicron strains. Although each strain was cultured in duplicate, the majority of our experiments were performed on the USA-WA1/2020, Alpha, and Delta strains, which we cultured over seven serial passages, while the Beta, Gamma, and Omicron strains were profiled for a single passage (Table 1). For the strains we cultured over seven passages, we initiated each passage at a low multiplicity of infection (MOI = 0.1) to minimize potential complementation effects. This strategy ensures that most cells are infected by a single virion during the initial phase of each passage, significantly reducing the likelihood of co-infections. Co-infections, where multiple viral particles infect the same cell, could allow defective viral genomes to be rescued by functional ones, distorting the mutation spectrum and artificially lowering the observed fitness cost of deleterious mutations. Thus, by maintaining a low MOI across passages, we consistently and repeatedly limit the propagation of defective genomes that may have been rescued transiently during the expansion phase of the prior cycle. Identical approaches were previously used to limit the impact of co-infections on fitness measurements of other viruses⁵. Finally, because the VeroE6 cells were derived from the kidney of an African green monkey, we wanted to make sure that our measurements were not skewed by this unique biological environment. To do so, we also cultured the Delta strain for 1 passage in Calu-3 cells (a human lung adenocarcinoma cell line) and primary human nasal epithelial cells (HNEC) that were grown in an air–liquid interface (ALI), which more closely mimics human SARS-CoV-2 infections (Table 1). After each passage, we monitored the sequence of the SARS-CoV-2 genome by CirSeq to take a snapshot of its mutational landscape. A schematic of our cell culture and sequencing approach is depicted in Fig. 1. Across all strains and conditions, we sequenced over ~200 billion bases and identified more than three million mutations. Finally, we assigned the most common mutations a fitness value to determine if they are selected for or against by the SARS-CoV-2 virus and mapped these mutations onto the viral genome and proteome to determine the biological basis for selection.

**Fig. 1: Overall strategy for data collection.**

Table 1 Tabulation of main results and extent of sequencing data

The mutation rate and spectrum of the SARS-CoV-2 genome

After we profiled the mutational landscape of the SARS-CoV-2 strains across the length of its genome (Fig. 2A), we used lethal and highly detrimental mutations to estimate their mutation rate. Because these mutations cannot be carried over between passages, they must be produced anew each generation, so that their frequency is equal to the mutation rate⁵. We used two complementary methods to identify these mutations. First, we considered mutations to be lethal or highly deleterious if they introduce premature stop codons (PTC) in the open reading frame of the RNA-dependent RNA polymerase (RdRP), an essential viral protein required for replication¹⁰. This strategy provides the most reliable way to identify lethal mutations and, by extension, to calculate the mutation rate. However, one limitation of this approach is that it cannot capture A → C, U → C, G → C, and A → G mutations, which cannot produce stop codons. To ensure a comprehensive assessment of the mutation rate across all base substitutions, we therefore employed a second, complementary strategy.

**Fig. 2: Broad overview of the dataset.**

For this strategy, we analyzed over eight million SARS-CoV-2 genomes previously aligned by UShER^11,12 and Ensembl¹³ and identified mutations that are absent from these databases. These genomes represent the consensus sequences of the most common viral variants in individual patients, meaning that mutations with severe fitness consequences, including lethal or highly detrimental mutations, are unlikely to be present. Consistent with this idea, we found that the mutations identified through this method were significantly depleted in our experimental dataset (Supplementary Fig. 2), supporting their classification as highly detrimental or lethal. However, we noticed that 68 of these mutations were present in our own dataset at frequencies exceeding 1 × 10⁻⁴ (>10-fold higher than the average mutation frequency), strongly suggesting that they are neither lethal nor highly deleterious. Thus, we excluded them from the list of mutations used to determine mutation rates.

By combining these strategies, we created a comprehensive list of lethal and highly detrimental mutations and used it to calculate the mutation rate across the length of the SARS-CoV-2 genome. This analysis revealed that ~1.5 × 10⁻⁶ mutations occur per nucleotide per viral passage (Fig. 2B), whether the virus was grown in VeroE6 cells, Calu-3 cells, or primary HNEC grown in an ALI (Table 1). To ensure that our “combination strategy” was an appropriate tool to determine mutation rates, we also calculated separate mutation rates, based on either the PTC or ‘absent mutations’ method and found that they yielded nearly identical mutation rate estimates, strongly supporting the idea that these approaches provide appropriate, complementary datasets for determining the mutation rate of the SARS-CoV-2 genome (Supplementary Fig. 3). Interestingly, we found that the Delta strain displayed the highest mutation rate of the three strains that we monitored over seven passages, potentially contributing to the increased virulence it displayed compared to the USA and Alpha strain. For each strain we found that the mutation rate varied greatly between different base substitutions, ranging from ~2 × 10^⁻5 for C → U mutations to ~1 × 10^⁻6 for G → C mutations, with C → U substitutions being ~4 times more common than any other base substitution (Fig. 2C and Supplementary Fig. 4). The rate with which C → U substitutions arose depended to a significant degree upon the upstream (i.e., 5′ adjacent) and downstream (i.e., 3′ adjacent) nucleotides. For example, we found that C → U mutations occur most commonly in a 5′-UCG-3′ context (Fig. 3A–D), consistent with analyses based on SARS-CoV-2 phylogeny¹⁴. When taken together, these observations demonstrate that C → U substitutions add the greatest amount of genetic variation to the SARS-CoV-2 genome and provide the largest substrate for evolution to act upon, a conclusion that is also supported by more indirect observations¹⁵. Because our measurements are independent of positive or negative selection, though (which play a key role in published SARS-CoV-2 genome sequences), our analyses provide an unfiltered view of the impact of genetic context on viral mutagenesis.

**Fig. 3: The mutation rate of the SARS-CoV-2 genome is altered by genetic context.**

It’s notable that the mutation rate of SARS-CoV-2 is ~10-fold lower compared to the poliovirus⁵ and ~5-fold lower than the Dengue virus⁷, two other RNA-based viruses previously examined by CirSeq. The decreased mutation rate of the SARS-CoV-2 genome is most likely due to the proofreading ability of its RdRp^10,16, which is absent in the polio and Dengue virus. In this context, it is important to note that G→A and U→C mutations displayed the largest reduction in mutation rate compared to the polio virus (48-fold and 28-fold, respectively, Fig. S5). When the proofreading activity of eukaryotic RNA polymerases II is compromised^17,18,19, these base substitutions increase the most, suggesting the existence of a universal set of rules that govern the proofreading capabilities of RNA polymerases in eukaryotes and viruses.

Selection for nucleotide composition

It is likely that the mutation rate and spectrum of the SARS-CoV-2 genome affect the evolution of the virus in various ways. One of the most fundamental attributes of a genome is its nucleotide composition, which depends on the balance between the mutation spectrum and the intensity of selection for each of the four nucleotides. Using the mutation rate for each of the 12 possible base substitutions, we estimate the equilibrium frequencies for all four nucleotides as: U = 0.42, A = 0.29, G = 0.21, and C = 0.07 (Table 2). This analysis translates into an equilibrium GC content of 28%, which is substantially higher than the 17% previously reported¹⁵. However, this previous estimate is based on indirect estimations of mutation rates at 4-fold degenerate sites across lineages sequenced in GISAID, which might be impacted by selection. Regardless, both estimates are significantly lower than the observed 38% GC content of the SARS-CoV-2 genome, indicating that the GC content in the SARS-CoV-2 genome is actively preserved by natural selection, particularly in the case of cytidines. Cytidines were even preserved at 4-fold degenerate sites (Table 2), suggesting that natural selection also preserves cytidines at sites where mutations would not alter amino acid composition. This pattern indicates a broader, possibly structural or regulatory, role for cytidines in the SARS-CoV-2 genome. To gain more insight into the molecular mechanisms that suppress cytidine depletion at 4-fold degenerate sites, and examine the impact of C → U mutations on viral fitness, we calculated fitness values for 3603 C → U mutations that were scattered across the SARS-CoV-2 genome.

Table 2 The base composition of the SARS-CoV-2 genome

Fitness landscape of SARS-CoV-2

Because fitness analyses require large amounts of data gathered from a single strain over an extended period of time, we selected one replicate of the SARS-CoV-2 Delta variant and tracked it over the course of seven passages. After each passage, we monitored its genome by Cirseq, ultimately sequencing 155 billion bases and covering each base 1.7 million times on average. This sequencing effort allowed us to identify 64,967 unique mutations across all passages, with each mutation being observed 42 times on average, for a total of 2.7 million mutation observations. Because the SARS-CoV-2 genome is ∼30,000 bases in length, and each base can be mutated into 3 different nucleotides, a total of ∼90 K base substitutions is theoretically possible, meaning that we identified 66% of all possible mutations in the SARS-CoV-2 genome. We then used this dataset to determine the consequences of C → U mutations on viral evolution by characterizing their impact on the fitness of the SARS-CoV-2 virus with a strategy previously employed for the polio virus⁵. Due to technical considerations, we did not determine fitness values for other base substitutions (see “Methods” section). Briefly, the fitness of a mutation is related to its change in frequency between consecutive passages as described in the following equation:

$${f}_{n}={f}_{n-1}\times w+{\mu }_{n-1}$$

(1)

With f_n and f_n₋₁ being the observed frequency of the mutation at passages n and n − 1, w the relative fitness, and µ the rate of C → U substitutions. (Fig. 4A and Supplementary Data 1). These fitness values were calculated as a weighted average of the fitness values derived at each of the seven passages, so that values with higher coverage (and thus higher precision) contribute more to the final estimates. We performed three tests to determine the veracity of these fitness values. First, we separated the C → U mutations into three groups and found that, as expected, synonymous C → U mutations were less deleterious than non-synonymous C → U mutations (0.78 vs 0.70, P = 2.0 × 10^-6, Mann–Whitney U-test), and non-synonymous mutations were less deleterious than non-sense mutations (0.70 vs 0.62, P = 0.006, Mann–Whitney U-test). In a second test, we examined the fitness values of mutations that were predicted to be either lethal or highly detrimental because they produce a PTC, or because they were absent from the 8 million genomes alignment. We found that these mutations displayed significantly lower fitness values compared to all the other mutations we detected (mean fitness: 0.38 vs 0.73, P = 3.4 × 10^-4, Mann–Whitney U-test). Moreover, mutations with similar fitness values frequently clustered together, as expected of mutations that affect similar regions of the genome or the proteome (Fig. 4B). Finally, we compared our fitness estimates to studies that used changes in mutation frequency throughout the SARS-CoV-2 phylogeny to infer fitness values^20,21. When we compared our values to the only other study to provide fitness estimates for both synonymous and nonsynonymous mutations²¹, we found a moderate but significant correlation between our data and this independent dataset (r = 0.47, P < 2 × 10⁻¹⁶, Fig. S6). Together, these analyses strongly support the idea that our algorithms provide predictive information about the impact of mutations on viral fitness.

**Fig. 4: Fitness analysis of SARS-CoV-2 mutations.**

Paired bases contribute disproportionately to SARS-CoV-2 fitness

Next, we used our fitness values to investigate why synonymous C → U mutations are selected against in the SARS-CoV-2 genome, even if they occur at 4-fold degenerate sites. Potentially, this phenomenon could be explained by stronger, more frequent purifying selection against synonymous mutations in SARS-CoV-2 compared to other viruses, such as the polio virus⁵. It was recently shown that the SARS-CoV-2 genome adopts a highly specific secondary structure²² and that bases that pair with each other to form these structures tend to display lower nucleotide diversity²³. Interestingly, we observed a similar specificity for secondary structures in our CirSeq dataset. The enzyme used to fragment viral RNA (RNAse III) prefers to cleave RNA at specific double-stranded structures, causing strong peaks and valleys in genome coverage that reflect the secondary structure of the SARS-CoV-2 genome. We found that these coverage peaks are identical between all the variants we tested, indicating that the secondary structure of the genome is highly conserved across the SARS-CoV-2 phylogeny (Fig. S7). Based on these observations, we hypothesized that the need to preserve this secondary structure could be a significant factor driving purifying selection against synonymous mutations. To test this hypothesis, we split synonymous C → U mutations into two groups: those that form base-pair interactions (henceforth referred to as “paired” sites) and those that do not (henceforth referred to as “unpaired” sites). This classification is based on a study that used DMS MapSeq to determine whether nucleotides are paired or not²².

Consistent with the idea that there is strong purifying selection against synonymous mutations that affect secondary structures in the SARS-CoV-2 genome, we found that the average fitness value of synonymous C→U mutations was lower at paired sites compared to unpaired sites (0.60 vs 0.93, P < 2 × 10⁻¹⁶, Mann–Whitney U-test, Fig. 4C, D). We observed a similar pattern for nonsynonymous mutations (average fitness: 0.50 vs 0.81 for paired and unpaired sites, respectively, P < 2 × 10⁻¹⁶, Mann–Whitney U-test, Fig. 4E, F) and non-sense mutations (average fitness 0.29 vs 0.71 for paired and unpaired sites, respectively, P < 2 × 10⁻¹⁶, Mann–Whitney U-test, Fig. 4G, H). To support this idea further, we re-examined our fitness values with the help of an independent assessment of secondary structures based on SHAPE scores²⁴ and found a weak but significant positive correlation between the shape reactivity score and our fitness estimates for synonymous C → U mutations (r = 0.28, P < 2 × 10⁻¹⁶, Fig. S8). Taken together, these results suggest that mutations that disrupt base-pairing interactions are more likely to be deleterious to SARS-CoV-2 fitness than those that don’t.

Because our fitness estimates are limited to C → U mutations, we used the fitness estimates previously published by Bloom and Neher²¹ to investigate if purifying selection for synonymous mutations at paired sites was present for all types of base-substitutions. Restricting our analysis to base-substitutions with enough observations in both paired and unpaired categories, we found that synonymous mutations are significantly more deleterious at paired vs unpaired sites for U → C, G → U, G → C, C → U, C → A, and A → U base substitutions (P < 0.01 for all, Mann–Whitney U-test, Fig. S9). U → A, U → G, C → G, and A → C substitutions did not yield enough observations to calculate fitness values, while two types of base-substitutions (G → A and A → G) showed no significant difference. Interestingly, though, it was previously shown that A:C and G:U base pairs (which would arise from G → A and A → G mutations, respectively) allow wobble base pairing in RNA molecules²⁵. Therefore, it is possible that these mutations do not significantly alter the secondary structure of the SARS-CoV-2 genome, even when they occur at paired sites, allowing them to escape purifying selection.

Paired bases display a reduced mutation rate

Our data suggests that the secondary structure of the SARS-CoV-2 genome is critical for viral fitness and that SARS-CoV-2 conserves these structures by strong purifying selection. However, the pace of evolution is also controlled by the mutation rate. Accordingly, we wanted to test the impact of the secondary structure on the mutation rate. To do so, we compared the rate of mutation between paired and unpaired bases and found that C → U mutations (but not other base substitutions) are ~3 times more frequent at unpaired bases compared to paired bases in all strains (P = < 0.01, Figs. 4G and S10). Other base substitutions are not increased at paired bases (Fig. 4G), indicating that the mechanism responsible for this observation is highly specific. A similar discrepancy is seen at mutational hot spots and cold spots, which are defined by locations where the mutation frequency either increases or decreases 10-fold. In hot spots, only 14.2% of nucleotides are predicted to be paired (vs 47.3% overall, P < 2 × 10⁻¹⁶, chi-square test), while they make up 81.1% of bases in cold spots (P < 2 × 10⁻¹⁶, chi-square test, Fig. 4H). One potential explanation for this observation is spontaneous cytidine deamination, as unpaired cytidines are 140 times more prone to spontaneous deamination into uracil than paired bases²⁶. In support of this idea, we found that C → U mutations are elevated at CpG sites (Fig. 4I). The electron density of guanine slightly alters a cytosine’s electron distribution, particularly around the amino group at the 4-position, thereby increasing the likelihood of spontaneous hydrolytic deamination^27,28. Another possibility is that the reduced rate of cytidine deamination at paired sites reflects the preference of APOBEC proteins for ssRNA. For example, APOBEC3A has a strong preference for unpaired cytidines that are flanked by a 5’ uracil and a 3’ guanosine^29,30,31, the exact conditions that show the highest C → U mutation rate in our dataset (Fig. 3C). Regardless of the mechanism though, these observations suggest that the secondary structure of the SARS-CoV-2 genome is not only preserved by strong purifying selection, but also by local changes in the mutation rate that spare paired bases. Since paired bases display lower C → U mutation rates, we hypothesized that selection should result in an excess of essential components of proteins in paired regions of the genome. In support of this hypothesis, we found that the more detrimental a mutation is for a protein, the greater the chance it is located in a paired region (Fig. 4D, F, H). For example, bases in which C → U mutations would result in a premature stop codon with a fitness value of 0 have a 60% chance of being placed in a paired region, compared to 40% for non-synonymous mutations and 30% for synonymous mutations. Moreover, when we compiled a short-list of 120 amino acids that have been proposed to undergo mutational scanning, conservation between coronaviruses and patient information, that together, these observations together suggest a synergistic relationship between the secondary structure of the SARS-CoV-2 genome and its mutation rate, which reinforces each other to promote viral fitness.

Fitness values, viral evolution, and potential weaknesses of SARS-CoV-2

Because fitness values reflect the forces of natural selection, we wondered whether they could predict the evolution of the SARS-CoV-2 virus. Since the emergence of the Delta variant, we sequenced for our fitness analysis, multiple variants have evolved that swept the globe. Each of these strains contains defining mutations that were positively selected for during the evolution of SARS-CoV-2 in human populations (Fig. 5A, B). Interestingly, some of these mutations were also detected in our short-term evolution experiment. When we examined the fitness values of the mutations that define strain 23H (the most advanced strain at the time of writing), we found that these mutations displayed significantly higher fitness values compared to all other mutations detected in our evolution experiment (0.97 vs 0.72, P = 5 × 10⁻⁵, Wilcoxon rank sum test). Thus, the fitness landscape obtained from our dataset could help predict the mutations that arise in future variants. Accordingly, mutations with high fitness values that have not been observed in known variants so far could be of interest to researchers trying to predict the evolutionary trajectory of SARS-CoV-2^32,33. Although mutations with high fitness values tend to be dispersed across the SARS-CoV-2 genome (Fig. 5C), regions where they cluster together might be of particular interest for this purpose (Fig. 5D).

**Fig. 5: Fitness values for mutations in the SARS-CoV-2 genome.**

Conversely, we also identified clusters of mutations with fitness values below 0.5, suggesting that these mutations are targets for negative selection. Our data suggests that one mechanism by which these clusters affect viral fitness is by disrupting secondary structures in the genome. Consistent with this idea, we identified multiple secondary structures in which mutations on either side of the structure lead to large fitness defects, even if they are hundreds of bases apart in the primary sequence of the genome (Fig. 5E, F). Figure S11 contains a comprehensive 2D map of the SARS-CoV-2 genome containing all the C → U mutations for which we established fitness values. In addition to the secondary structure of the genome, C → U mutations may also affect critical amino acids in protein structures. To visualize the potential impact of mutations on the proteome, we mapped clusters of mutations that are subject to negative selection onto the spike protein and the viral replisome (Fig. 5G–L and see also Supplementary Movie 1–4). Because the clusters of mutations that negatively affect the structure of the genome or the proteome highlight immutable components of the SARS-CoV-2 virus, they could be valuable targets for vaccines and treatments. Frequently, viruses develop resistance to vaccines and treatments by mutation, leading to variants that lack the targets that treatments or vaccines were developed against; however, because mutation of these essential components significantly lowers viral fitness, targeting these clusters could limit the number of escapees that emerge.

Source link