Overall strategy
The mutation rates and spectra of RNA viruses (including SARS-CoV-2) are notoriously difficult to measure. For example, even though more than 8 million SARS-CoV-2 genomes have been documented across the globe (GISAID, https://gisaid.org/), this dataset only contains mutations that were successful enough to become major variants in patients. Accordingly, most studies of within-host genetic diversity are limited to variants with an allele frequency that exceeds 0.5%, while mutations that are detrimental to the virus are missed1. In contrast, the negative correlation between mutation rate and genome size observed in viruses suggests that the spontaneous mutation rate of the >30 kb SARS-CoV-2 genome is <1 × 10−5 per base2, which is significantly lower than the detection threshold of the sequencing methods used on patients3. As a result, RNA-sequencing methods with improved sensitivity are required to determine the mutation rate of SARS-CoV-2. With these considerations in mind, we used an ultra-sensitive and highly accurate rolling-circle RNA consensus sequencing method termed CirSeq4 to determine the mutation rate and spectrum of 6 SARS-CoV-2 variants. This method was previously used to determine the mutational landscape of other RNA viruses, including the polio virus5, the Ebola virus6, the Dengue virus7, and the Zika virus8. The improved accuracy of CirSeq relies on the circularization of short RNA fragments to synthesize long cDNA molecules that carry tandem repeats of the original RNA template. These tandem repeats can then be analyzed to generate a consensus sequence, which eliminates sequencing and reverse-transcription errors from the final sequencing results (Supplementary Fig. 1). Mutation frequencies are then obtained by dividing the number of mutations observed at a given position by the number of molecules that covered this position.
To explore the mutational landscape of SARS-CoV-2, we cultured the virus in VeroE6 cells, a preferred cell line for COVID-19 research because of its susceptibility to infection, efficient viral replication, and permissiveness to mutations9. Accordingly, VeroE6 cells can support a higher degree of viral genetic diversity than other cell lines, which is useful for studies that examine viral evolution during prolonged culture conditions. In total, we cultured 6 major strains of the SARS-CoV-2 virus, including the USA-WA1/2020, Alpha and Delta strains (corresponding to clades 19B, 20I and 21J, respectively), as well as the Beta, Gamma and Omicron strains. Although each strain was cultured in duplicate, the majority of our experiments were performed on the USA-WA1/2020, Alpha, and Delta strains, which we cultured over seven serial passages, while the Beta, Gamma, and Omicron strains were profiled for a single passage (Table 1). For the strains we cultured over seven passages, we initiated each passage at a low multiplicity of infection (MOI = 0.1) to minimize potential complementation effects. This strategy ensures that most cells are infected by a single virion during the initial phase of each passage, significantly reducing the likelihood of co-infections. Co-infections, where multiple viral particles infect the same cell, could allow defective viral genomes to be rescued by functional ones, distorting the mutation spectrum and artificially lowering the observed fitness cost of deleterious mutations. Thus, by maintaining a low MOI across passages, we consistently and repeatedly limit the propagation of defective genomes that may have been rescued transiently during the expansion phase of the prior cycle. Identical approaches were previously used to limit the impact of co-infections on fitness measurements of other viruses5. Finally, because the VeroE6 cells were derived from the kidney of an African green monkey, we wanted to make sure that our measurements were not skewed by this unique biological environment. To do so, we also cultured the Delta strain for 1 passage in Calu-3 cells (a human lung adenocarcinoma cell line) and primary human nasal epithelial cells (HNEC) that were grown in an air–liquid interface (ALI), which more closely mimics human SARS-CoV-2 infections (Table 1). After each passage, we monitored the sequence of the SARS-CoV-2 genome by CirSeq to take a snapshot of its mutational landscape. A schematic of our cell culture and sequencing approach is depicted in Fig. 1. Across all strains and conditions, we sequenced over ~200 billion bases and identified more than three million mutations. Finally, we assigned the most common mutations a fitness value to determine if they are selected for or against by the SARS-CoV-2 virus and mapped these mutations onto the viral genome and proteome to determine the biological basis for selection.
Nasal swabs of SARS-CoV-2 patients were collected, and the virus was cultured in Vero-E6 cells. Positive culture supernatant was serially diluted to an extinction endpoint, and dilutions with less than 33% positive wells were cultured further in Vero E-6 to attain unique cultures. Two of these cultures were propagated per variant. The TCiD50 was determined, and the virus was passed at an MOI of 0.1 for all subsequent passages. Viral supernatant was concentrated, and RNA was extracted from 15 mL of viral culture. After extraction, viral RNA was cleaved into 60–80 bp fragments, which were ligated to themselves to form circular RNA molecules. These circular RNA molecules were then reverse transcribed to generate linear concatemers of the RNA template. If a mutation were present in the template (yellow line), this mutation would be present in every copy of the concatemer. In contrast, sequencing errors (green line) or reverse transcription errors (red line) would be present in only one, thereby allowing true mutations to be discriminated from technical artifacts. This figure was created in BioRender. Nijhuis (2025) https://BioRender.com/mftt8y6.
The mutation rate and spectrum of the SARS-CoV-2 genome
After we profiled the mutational landscape of the SARS-CoV-2 strains across the length of its genome (Fig. 2A), we used lethal and highly detrimental mutations to estimate their mutation rate. Because these mutations cannot be carried over between passages, they must be produced anew each generation, so that their frequency is equal to the mutation rate5. We used two complementary methods to identify these mutations. First, we considered mutations to be lethal or highly deleterious if they introduce premature stop codons (PTC) in the open reading frame of the RNA-dependent RNA polymerase (RdRP), an essential viral protein required for replication10. This strategy provides the most reliable way to identify lethal mutations and, by extension, to calculate the mutation rate. However, one limitation of this approach is that it cannot capture A → C, U → C, G → C, and A → G mutations, which cannot produce stop codons. To ensure a comprehensive assessment of the mutation rate across all base substitutions, we therefore employed a second, complementary strategy.
A Circos plot depicting the coverage and mutations detected in SARS-CoV-2 sequencing libraries. Black dots represent the mutations detected, with mutations detected at higher frequencies located further away from the center of the plot. The blue surface represents sequencing coverage, with higher coverage extending further away from the center of the plot. The two outer rings depict the genes present in the SARS-CoV-2 genome, while the numbering outside of these rings indicates the position of these genes along the genome in 1 kb steps. B Mutation rate of six variants of the SARS-CoV-2 virus. Each dot represents a single measurement of a variant at 1 out of 7 passages for 2 replicates (n = 14 for USA, Alpha, and Delta, Mann–Whitney U-test, two-sided alternative, no correction for multiple testing). The mutation rates of the Beta, Gamma, Delta, Delta Calu, and Omicron variants were only measured once, for one replicate and one passage, and are depicted here for comparison. All mutations are presented in the context of the sense strand of the SARS-CoV-2 genome. C The mutation spectrum of the USA, Alpha, Delta, and Omicron strains. Of these strains, USA, Alpha, and Delta were monitored in duplicate across seven passages, while Omicron was monitored once across one passage. Error bars represent the standard deviation of the mean across the seven passages.
For this strategy, we analyzed over eight million SARS-CoV-2 genomes previously aligned by UShER11,12 and Ensembl13 and identified mutations that are absent from these databases. These genomes represent the consensus sequences of the most common viral variants in individual patients, meaning that mutations with severe fitness consequences, including lethal or highly detrimental mutations, are unlikely to be present. Consistent with this idea, we found that the mutations identified through this method were significantly depleted in our experimental dataset (Supplementary Fig. 2), supporting their classification as highly detrimental or lethal. However, we noticed that 68 of these mutations were present in our own dataset at frequencies exceeding 1 × 10⁻⁴ (>10-fold higher than the average mutation frequency), strongly suggesting that they are neither lethal nor highly deleterious. Thus, we excluded them from the list of mutations used to determine mutation rates.
By combining these strategies, we created a comprehensive list of lethal and highly detrimental mutations and used it to calculate the mutation rate across the length of the SARS-CoV-2 genome. This analysis revealed that ~1.5 × 10⁻⁶ mutations occur per nucleotide per viral passage (Fig. 2B), whether the virus was grown in VeroE6 cells, Calu-3 cells, or primary HNEC grown in an ALI (Table 1). To ensure that our “combination strategy” was an appropriate tool to determine mutation rates, we also calculated separate mutation rates, based on either the PTC or ‘absent mutations’ method and found that they yielded nearly identical mutation rate estimates, strongly supporting the idea that these approaches provide appropriate, complementary datasets for determining the mutation rate of the SARS-CoV-2 genome (Supplementary Fig. 3). Interestingly, we found that the Delta strain displayed the highest mutation rate of the three strains that we monitored over seven passages, potentially contributing to the increased virulence it displayed compared to the USA and Alpha strain. For each strain we found that the mutation rate varied greatly between different base substitutions, ranging from ~2 × 10⁻5 for C → U mutations to ~1 × 10⁻6 for G → C mutations, with C → U substitutions being ~4 times more common than any other base substitution (Fig. 2C and Supplementary Fig. 4). The rate with which C → U substitutions arose depended to a significant degree upon the upstream (i.e., 5′ adjacent) and downstream (i.e., 3′ adjacent) nucleotides. For example, we found that C → U mutations occur most commonly in a 5′-UCG-3′ context (Fig. 3A–D), consistent with analyses based on SARS-CoV-2 phylogeny14. When taken together, these observations demonstrate that C → U substitutions add the greatest amount of genetic variation to the SARS-CoV-2 genome and provide the largest substrate for evolution to act upon, a conclusion that is also supported by more indirect observations15. Because our measurements are independent of positive or negative selection, though (which play a key role in published SARS-CoV-2 genome sequences), our analyses provide an unfiltered view of the impact of genetic context on viral mutagenesis.
A–D The mutation rate of the SARS-CoV-2 genome differs depending on the bases that directly flank the focal base. The focal base (the mutated base at the center of a triplet) is listed underneath the graph, while the bases that are on its 5’ or 3’ side are located above and below the focal base. The type of mutation that is analyzed is depicted above the bars. So, for example, the first 4 bars on the left-hand side of (A) correspond to an adenine base that is at the center of the triplet (the focal base), and is flanked on its 5’ base by adenine, and on its 3’ side by one of four possible bases. Each bar then corresponds to the impact of these flanking bases on the mutation of adenine (the focal base) to uracil. For example, the mutation rate of A → C is highest when adenine is flanked by guanine on its 5’ side and 3’ side. In contrast, the mutation rate of A → C substitutions is lowest when adenine is flanked on its 5’ side with uracil, and adenine, cytosine, or guanine on its 3’ side. Note that the y-axes for each panel differ for increased visibility. Average of all passages generated (n = 47), error bars represent standard deviation of the mean.
It’s notable that the mutation rate of SARS-CoV-2 is ~10-fold lower compared to the poliovirus5 and ~5-fold lower than the Dengue virus7, two other RNA-based viruses previously examined by CirSeq. The decreased mutation rate of the SARS-CoV-2 genome is most likely due to the proofreading ability of its RdRp10,16, which is absent in the polio and Dengue virus. In this context, it is important to note that G→A and U→C mutations displayed the largest reduction in mutation rate compared to the polio virus (48-fold and 28-fold, respectively, Fig. S5). When the proofreading activity of eukaryotic RNA polymerases II is compromised17,18,19, these base substitutions increase the most, suggesting the existence of a universal set of rules that govern the proofreading capabilities of RNA polymerases in eukaryotes and viruses.
Selection for nucleotide composition
It is likely that the mutation rate and spectrum of the SARS-CoV-2 genome affect the evolution of the virus in various ways. One of the most fundamental attributes of a genome is its nucleotide composition, which depends on the balance between the mutation spectrum and the intensity of selection for each of the four nucleotides. Using the mutation rate for each of the 12 possible base substitutions, we estimate the equilibrium frequencies for all four nucleotides as: U = 0.42, A = 0.29, G = 0.21, and C = 0.07 (Table 2). This analysis translates into an equilibrium GC content of 28%, which is substantially higher than the 17% previously reported15. However, this previous estimate is based on indirect estimations of mutation rates at 4-fold degenerate sites across lineages sequenced in GISAID, which might be impacted by selection. Regardless, both estimates are significantly lower than the observed 38% GC content of the SARS-CoV-2 genome, indicating that the GC content in the SARS-CoV-2 genome is actively preserved by natural selection, particularly in the case of cytidines. Cytidines were even preserved at 4-fold degenerate sites (Table 2), suggesting that natural selection also preserves cytidines at sites where mutations would not alter amino acid composition. This pattern indicates a broader, possibly structural or regulatory, role for cytidines in the SARS-CoV-2 genome. To gain more insight into the molecular mechanisms that suppress cytidine depletion at 4-fold degenerate sites, and examine the impact of C → U mutations on viral fitness, we calculated fitness values for 3603 C → U mutations that were scattered across the SARS-CoV-2 genome.
Fitness landscape of SARS-CoV-2
Because fitness analyses require large amounts of data gathered from a single strain over an extended period of time, we selected one replicate of the SARS-CoV-2 Delta variant and tracked it over the course of seven passages. After each passage, we monitored its genome by Cirseq, ultimately sequencing 155 billion bases and covering each base 1.7 million times on average. This sequencing effort allowed us to identify 64,967 unique mutations across all passages, with each mutation being observed 42 times on average, for a total of 2.7 million mutation observations. Because the SARS-CoV-2 genome is ∼30,000 bases in length, and each base can be mutated into 3 different nucleotides, a total of ∼90 K base substitutions is theoretically possible, meaning that we identified 66% of all possible mutations in the SARS-CoV-2 genome. We then used this dataset to determine the consequences of C → U mutations on viral evolution by characterizing their impact on the fitness of the SARS-CoV-2 virus with a strategy previously employed for the polio virus5. Due to technical considerations, we did not determine fitness values for other base substitutions (see “Methods” section). Briefly, the fitness of a mutation is related to its change in frequency between consecutive passages as described in the following equation:
$${f}_{n}={f}_{n-1}\times w+{\mu }_{n-1}$$
(1)
With fn and fn−1 being the observed frequency of the mutation at passages n and n − 1, w the relative fitness, and µ the rate of C → U substitutions. (Fig. 4A and Supplementary Data 1). These fitness values were calculated as a weighted average of the fitness values derived at each of the seven passages, so that values with higher coverage (and thus higher precision) contribute more to the final estimates. We performed three tests to determine the veracity of these fitness values. First, we separated the C → U mutations into three groups and found that, as expected, synonymous C → U mutations were less deleterious than non-synonymous C → U mutations (0.78 vs 0.70, P = 2.0 × 10-6, Mann–Whitney U-test), and non-synonymous mutations were less deleterious than non-sense mutations (0.70 vs 0.62, P = 0.006, Mann–Whitney U-test). In a second test, we examined the fitness values of mutations that were predicted to be either lethal or highly detrimental because they produce a PTC, or because they were absent from the 8 million genomes alignment. We found that these mutations displayed significantly lower fitness values compared to all the other mutations we detected (mean fitness: 0.38 vs 0.73, P = 3.4 × 10-4, Mann–Whitney U-test). Moreover, mutations with similar fitness values frequently clustered together, as expected of mutations that affect similar regions of the genome or the proteome (Fig. 4B). Finally, we compared our fitness estimates to studies that used changes in mutation frequency throughout the SARS-CoV-2 phylogeny to infer fitness values20,21. When we compared our values to the only other study to provide fitness estimates for both synonymous and nonsynonymous mutations21, we found a moderate but significant correlation between our data and this independent dataset (r = 0.47, P < 2 × 10−16, Fig. S6). Together, these analyses strongly support the idea that our algorithms provide predictive information about the impact of mutations on viral fitness.
A Fitness distribution of the 3603 C → U mutations which we calculated fitness values. B Mutations with similar fitness values tend to cluster together. Blue bases indicate locations where C → U mutations are positively selected for, and orange bases indicate locations where C → U mutations are selected against. The intensity of the color indicates the intensity of selection. Five clusters are highlighted. C, D Fitness distribution of all synonymous C → U mutations for which we calculated fitness values, split up into paired or unpaired bases. E, F Fitness distribution of all non-synonymous C → U mutations for which we calculated fitness values, split up into paired or unpaired bases. G, H Fitness distribution of all non-sense C → U mutations that we calculated fitness values for, split up into paired or unpaired bases I. Mutation rate of all mutations, split up into paired or unpaired bases (P = 0.0006 for C-to-U, Mann–Whitney U-test, n = 7). J Presence of paired and unpaired bases in hot spots or cold spots of mutation. K Mutation rate of cytosine in CpG vs non-CpG islands (P = 0.002, Mann–Whitney U-test, n = 7). All analyses were done by the Mann–Whitney U-test. **P < 0.01; ***P < 0.001. All error bars represent the standard deviation of the mean across the seven passages of the Delta (replicate B) virus.
Paired bases contribute disproportionately to SARS-CoV-2 fitness
Next, we used our fitness values to investigate why synonymous C → U mutations are selected against in the SARS-CoV-2 genome, even if they occur at 4-fold degenerate sites. Potentially, this phenomenon could be explained by stronger, more frequent purifying selection against synonymous mutations in SARS-CoV-2 compared to other viruses, such as the polio virus5. It was recently shown that the SARS-CoV-2 genome adopts a highly specific secondary structure22 and that bases that pair with each other to form these structures tend to display lower nucleotide diversity23. Interestingly, we observed a similar specificity for secondary structures in our CirSeq dataset. The enzyme used to fragment viral RNA (RNAse III) prefers to cleave RNA at specific double-stranded structures, causing strong peaks and valleys in genome coverage that reflect the secondary structure of the SARS-CoV-2 genome. We found that these coverage peaks are identical between all the variants we tested, indicating that the secondary structure of the genome is highly conserved across the SARS-CoV-2 phylogeny (Fig. S7). Based on these observations, we hypothesized that the need to preserve this secondary structure could be a significant factor driving purifying selection against synonymous mutations. To test this hypothesis, we split synonymous C → U mutations into two groups: those that form base-pair interactions (henceforth referred to as “paired” sites) and those that do not (henceforth referred to as “unpaired” sites). This classification is based on a study that used DMS MapSeq to determine whether nucleotides are paired or not22.
Consistent with the idea that there is strong purifying selection against synonymous mutations that affect secondary structures in the SARS-CoV-2 genome, we found that the average fitness value of synonymous C→U mutations was lower at paired sites compared to unpaired sites (0.60 vs 0.93, P < 2 × 10−16, Mann–Whitney U-test, Fig. 4C, D). We observed a similar pattern for nonsynonymous mutations (average fitness: 0.50 vs 0.81 for paired and unpaired sites, respectively, P < 2 × 10−16, Mann–Whitney U-test, Fig. 4E, F) and non-sense mutations (average fitness 0.29 vs 0.71 for paired and unpaired sites, respectively, P < 2 × 10−16, Mann–Whitney U-test, Fig. 4G, H). To support this idea further, we re-examined our fitness values with the help of an independent assessment of secondary structures based on SHAPE scores24 and found a weak but significant positive correlation between the shape reactivity score and our fitness estimates for synonymous C → U mutations (r = 0.28, P < 2 × 10−16, Fig. S8). Taken together, these results suggest that mutations that disrupt base-pairing interactions are more likely to be deleterious to SARS-CoV-2 fitness than those that don’t.
Because our fitness estimates are limited to C → U mutations, we used the fitness estimates previously published by Bloom and Neher21 to investigate if purifying selection for synonymous mutations at paired sites was present for all types of base-substitutions. Restricting our analysis to base-substitutions with enough observations in both paired and unpaired categories, we found that synonymous mutations are significantly more deleterious at paired vs unpaired sites for U → C, G → U, G → C, C → U, C → A, and A → U base substitutions (P < 0.01 for all, Mann–Whitney U-test, Fig. S9). U → A, U → G, C → G, and A → C substitutions did not yield enough observations to calculate fitness values, while two types of base-substitutions (G → A and A → G) showed no significant difference. Interestingly, though, it was previously shown that A:C and G:U base pairs (which would arise from G → A and A → G mutations, respectively) allow wobble base pairing in RNA molecules25. Therefore, it is possible that these mutations do not significantly alter the secondary structure of the SARS-CoV-2 genome, even when they occur at paired sites, allowing them to escape purifying selection.
Paired bases display a reduced mutation rate
Our data suggests that the secondary structure of the SARS-CoV-2 genome is critical for viral fitness and that SARS-CoV-2 conserves these structures by strong purifying selection. However, the pace of evolution is also controlled by the mutation rate. Accordingly, we wanted to test the impact of the secondary structure on the mutation rate. To do so, we compared the rate of mutation between paired and unpaired bases and found that C → U mutations (but not other base substitutions) are ~3 times more frequent at unpaired bases compared to paired bases in all strains (P = < 0.01, Figs. 4G and S10). Other base substitutions are not increased at paired bases (Fig. 4G), indicating that the mechanism responsible for this observation is highly specific. A similar discrepancy is seen at mutational hot spots and cold spots, which are defined by locations where the mutation frequency either increases or decreases 10-fold. In hot spots, only 14.2% of nucleotides are predicted to be paired (vs 47.3% overall, P < 2 × 10−16, chi-square test), while they make up 81.1% of bases in cold spots (P < 2 × 10−16, chi-square test, Fig. 4H). One potential explanation for this observation is spontaneous cytidine deamination, as unpaired cytidines are 140 times more prone to spontaneous deamination into uracil than paired bases26. In support of this idea, we found that C → U mutations are elevated at CpG sites (Fig. 4I). The electron density of guanine slightly alters a cytosine’s electron distribution, particularly around the amino group at the 4-position, thereby increasing the likelihood of spontaneous hydrolytic deamination27,28. Another possibility is that the reduced rate of cytidine deamination at paired sites reflects the preference of APOBEC proteins for ssRNA. For example, APOBEC3A has a strong preference for unpaired cytidines that are flanked by a 5’ uracil and a 3’ guanosine29,30,31, the exact conditions that show the highest C → U mutation rate in our dataset (Fig. 3C). Regardless of the mechanism though, these observations suggest that the secondary structure of the SARS-CoV-2 genome is not only preserved by strong purifying selection, but also by local changes in the mutation rate that spare paired bases. Since paired bases display lower C → U mutation rates, we hypothesized that selection should result in an excess of essential components of proteins in paired regions of the genome. In support of this hypothesis, we found that the more detrimental a mutation is for a protein, the greater the chance it is located in a paired region (Fig. 4D, F, H). For example, bases in which C → U mutations would result in a premature stop codon with a fitness value of 0 have a 60% chance of being placed in a paired region, compared to 40% for non-synonymous mutations and 30% for synonymous mutations. Moreover, when we compiled a short-list of 120 amino acids that have been proposed to undergo mutational scanning, conservation between coronaviruses and patient information, that together, these observations together suggest a synergistic relationship between the secondary structure of the SARS-CoV-2 genome and its mutation rate, which reinforces each other to promote viral fitness.
Fitness values, viral evolution, and potential weaknesses of SARS-CoV-2
Because fitness values reflect the forces of natural selection, we wondered whether they could predict the evolution of the SARS-CoV-2 virus. Since the emergence of the Delta variant, we sequenced for our fitness analysis, multiple variants have evolved that swept the globe. Each of these strains contains defining mutations that were positively selected for during the evolution of SARS-CoV-2 in human populations (Fig. 5A, B). Interestingly, some of these mutations were also detected in our short-term evolution experiment. When we examined the fitness values of the mutations that define strain 23H (the most advanced strain at the time of writing), we found that these mutations displayed significantly higher fitness values compared to all other mutations detected in our evolution experiment (0.97 vs 0.72, P = 5 × 10−5, Wilcoxon rank sum test). Thus, the fitness landscape obtained from our dataset could help predict the mutations that arise in future variants. Accordingly, mutations with high fitness values that have not been observed in known variants so far could be of interest to researchers trying to predict the evolutionary trajectory of SARS-CoV-232,33. Although mutations with high fitness values tend to be dispersed across the SARS-CoV-2 genome (Fig. 5C), regions where they cluster together might be of particular interest for this purpose (Fig. 5D).
A Distribution of fitness values for all defining C → U mutations in SARS-CoV-2 clade 23H compared to all other mutations detected during our short-term evolution experiment (P = 5 × 10−5, Wilcoxon rank sum test, two-sided alternative). B Details of all defining C → U mutations in SARS-CoV-2 clade 23H. C Although mutations with similar fitness values tend to cluster together, clusters that are positively or negatively selected for tend to be semi-randomly distributed across the genome. Orange bases indicate locations where C → U mutations are selected against, and blue indicates positive selection. The intensity of the color indicates the intensity of selection. D–F However, some regions are strongly enriched in either positive or negative selected cytosines. G–L We plotted the mutations that are most detrimental to viral fitness (ω < 0.5) on the structure of the SARS-CoV-2 spike protein (G–I) and the replisome (J–L) in various orientations. The most interesting mutations are those that occur in clusters, highlighting regions of the viral proteome, or the underlying structure of the genome, that are especially vital to fitness. Three subunits of the spike protein are depicted in pink, blue, and green. Mutations that are predicted to be detrimental are depicted in red (all mutations detected were detrimental) or purple (a subset of mutations detected were detrimental). In J–L, individual units of the replisome are depicted in various colors, including the exonuclease, polymerase, and the viral genome itself.
Conversely, we also identified clusters of mutations with fitness values below 0.5, suggesting that these mutations are targets for negative selection. Our data suggests that one mechanism by which these clusters affect viral fitness is by disrupting secondary structures in the genome. Consistent with this idea, we identified multiple secondary structures in which mutations on either side of the structure lead to large fitness defects, even if they are hundreds of bases apart in the primary sequence of the genome (Fig. 5E, F). Figure S11 contains a comprehensive 2D map of the SARS-CoV-2 genome containing all the C → U mutations for which we established fitness values. In addition to the secondary structure of the genome, C → U mutations may also affect critical amino acids in protein structures. To visualize the potential impact of mutations on the proteome, we mapped clusters of mutations that are subject to negative selection onto the spike protein and the viral replisome (Fig. 5G–L and see also Supplementary Movie 1–4). Because the clusters of mutations that negatively affect the structure of the genome or the proteome highlight immutable components of the SARS-CoV-2 virus, they could be valuable targets for vaccines and treatments. Frequently, viruses develop resistance to vaccines and treatments by mutation, leading to variants that lack the targets that treatments or vaccines were developed against; however, because mutation of these essential components significantly lowers viral fitness, targeting these clusters could limit the number of escapees that emerge.




