Genomic data acquisition and lineage classification
We queried the complete global dataset of 16.6 million whole genome sequences of SARS-CoV-2, along with related metadata, available on the GISAID database (https://gisaid.org) as of 28 February 2024. Of these, approximately 8 million sequences, classified using the ‘Phylogenetic Assignment of Named Global Outbreak Lineages’ (PANGOLIN) method19, belong to the BA lineages of the Omicron variant, which served as the focus of our study. To facilitate downstream analysis, we grouped lineages into their broader classification (BA.1 through BA.5, and BA.2.86) while distinguishing all descendant and recombinant Omicron lineages according to their respective pango-lineage designation, as outlined in the sequence designation list available on the PANGO GitHub repository (github.com/cov-lineages/pango-designation/milestones).
Sequence alignment and metadata extraction related to sequence quality were performed using Nextstrain’s Nextclade CLI20,21. The resulting metadata was used to filter the dataset, retaining only sequences with complete collection dates and high coverage (>90% genome coverage) that passed Nextclade’s quality check. Additionally, sequences with clade-level inconsistencies between GISAID’s PANGOLIN assignment and Nextclade’s PANGO lineage designation were excluded. For a detailed breakdown of sequence counts by BA lineage, refer to Supplementary Table 1.
Genomic sampling
Given the sensitivity of ancestral state reconstruction methods to sampling biases22, we employed a subsampling strategy to generate representative datasets for each lineage. For this purpose, we used subsampler (https://github.com/andersonbrito/subsampler), a pipeline designed to systematically subsample genomic data based on epidemiological time series data. This approach allocated sequences per location, by country and province for global and subnational-level investigations, respectively, in proportion to reported case counts accounting for geographic and epidemiologic diversity. To ensure temporal representativeness, we applied a weekly sampling criteria, aggregating reported cases and sequences on a weekly basis. A crude estimate of the number of weekly cases by lineage was derived by multiplying the number of reported cases by the proportion of sampled genomes designated to the respective BA lineages, using metadata from GISAID. This systematic subsampling approach ensured that the datasets accurately represented the temporal and proportional circulation of the lineages, aligning with the underlying epidemiological trends over time. Specific considerations for subsampling, according to the objectives of each analysis, are detailed below.
Global ‘mugration’ analysis
To establish and confirm the broader global origins of the respective BA lineages, we subsampled 5000 globally representative sequences per lineage using the subsampler tool based on available case count data (scaled by population) from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19). To infer the early movements and global origins of the respective lineages, we performed a ‘mugration’ analysis with TreeTime23, modelling transitions between discrete geographic states at the country level. Maximum-likelihood tree topologies for each lineage were inferred using IQ-TREE version 2.3.624 under a GTR model of nucleotide substitution, with branch support estimated using ultrafast bootstrap approximation (1000 replicates) to ensure statistical robustness25. Temporal molecular clock signals were evaluated using the clock functionality of TreeTime, and outlier sequences that violated the strict molecular clock assumption were identified and removed using the Ape package in R. The refined trees were then transformed into time-scaled phylogenies in TreeTime. Discrete country locations were mapped to tips, and ancestral country locations were inferred for internal nodes under a GTR model using the ‘mugration’ extension of TreeTime. Finally, a custom python script (https://github.com/CERI-KRISP/SARS_CoV_2_VOC_dissemination) was used to derive the number of state changes over the span of the tree. The first 100 inferred transition states were visualised using ggmap and ggplot2 in R on a global map with country-level boundaries from rnaturalearth, providing a comprehensive overview of global spatial dynamics during the early dispersal phase of each lineage.
Phylogenetic investigation
To investigate the evolutionary dynamics of the six lineages in the region associated with their emergence, we applied the previously described subsampling strategy, utilising provincial-level case data from COVID-19 South Africa (https://www.covid19sa.org/provincial-breakdown). For each lineage, we selected 200 sequences from the early phase of each wave, up to the peak in cases inferred from genomic data, while ensuring proportional representation across provinces and retaining the first 20 identified sequences per lineage. Preliminary maximum likelihood trees for each lineage, as well as for the combined set of lineages, were generated using IQ-TREE version 2.3.624. We then inspected these trees using TempEst version 1.5.326 to assess the strength of temporal signal and to examine the root-to-tip distance versus genetic regressions for outlier identification27, with further validation performed using TreeTime23. Linear regression of root-to-tip genetic distances against sampling dates indicated that the sequences had been evolving in a strong clock-like manner (correlation coefficient = 0.98, R2 = 0.95), warranting the use of a strict clock model for further Bayesian phylogenetic analyses. We then estimated time-calibrated phylogenies using BEAST version 10.5.028, applying a strict molecular clock model, the GTR + I + G nucleotide substitution model, and an exponential growth coalescent model29. Model parameter selection was validated through marginal likelihood estimation (MLE) across a range of parameter combinations (Supplementary Table 2). We performed Markov Chain Monte Carlo (MCMC) sampling of the model parameter space with runs of 100 million states each, sampling every 10,000 steps. Proper mixing of the MCMC was verified using Tracer version 1.7.330. MCC trees were annotated with model parameters (and their 95% credibility intervals) and used to summarise the sampled MCMC chain states using TreeAnnotator after discarding 10% of samples as burn-in. The MCC tree was visualised using the R packages ggplot2 and ggtree31,32.
Phylogeographic inference
To obtain information on the geographical origins of the six lineages within southern Africa we inferred the geographical locations of ancestors of the sampled sequences: information that, given the geographical coordinates where sequences were sampled, is expected to be encoded in the phylogenetic relationships of the sampled sequences. We constructed a time-resolved MCC phylogenetic tree to determine when the ancestral sequences represented by phylogenetic tree nodes existed: i.e., we determined the tMRCA of every pair of sampled sequences represented in the tree (Fig. 3a). We then performed a phylogeographic analysis, to identify both the most probable geographical location coordinates of the MRCAs of each of the BA.1, BA.2, BA.3, BA.4, BA.5 and BA.2.86 lineages and the 95% HPD regions that bounded these coordinates. Lastly, we used the inferred geographical coordinates of the sampled sequences in each lineage to infer the most likely geographical coordinates of the MRCAs of every pair of sequences and used these together with the temporal scaling of the phylogenetic trees to track the spatiotemporal dispersal dynamics of the lineages during the earliest stages of their emergence (Fig. 3b).
The same subset of sequences that was used to produce the time-scaled MCC tree was used to perform continuous phylogeographic reconstructions of the respective BA lineages. The spatial resolution was determined by the available sampling location information for each sequence, ranging from province (all) to district (~34%) level. For sequences with only province-level information, the coordinates of the province’s most densely populated city were used. We first constructed maximum likelihood trees for each lineage separately and visualised these in TempEst to assess whether each dataset contained a sufficiently strong temporal signal to warrant the use of a molecular clock model, and to remove potential outliers27. We then used the ModelFinder33 component of IQ-TREE to identify GTR + I + G as a suitable nucleotide substitution model for all of the datasets. To model the geographical movements of the respective BA lineages across South Africa, we used a flexible relaxed random walk diffusion model implemented in BEAST that accommodates branch-specific variation in rates of dispersal with a Cauchy distribution that is well-suited to assumptions of SARS-CoV-2 transmission dynamics, and which enables estimation of diffusion statistics34,35. In addition to the spatial diffusion and nucleotide substitution models we applied a strict clock model and an exponential growth coalescent model, as identified through MLE model comparison (Supplementary Table 2).
MCMC chains were run in duplicate for 300 million steps with sampling every 30,000 steps, and convergence being assessed using Tracer. MCC trees were produced from the sampled MCMC chain states using TreeAnnotator after discarding 10% of samples as burn-in. We used the R package seraphim36 to extract and map spatiotemporal information embedded in the sampled trees.
To investigate whether the predicted origins of the respective BA lineages could be traced to one or more neighbouring countries of South Africa, we additionally conducted discrete phylogeographic reconstructions of each lineage at a regional level (South Africa and its neighbouring countries). We utilised the same sampling strategy as described for the continuous phylogeographic reconstruction, subsampling 200 sequences per lineage based on available case count data (scaled by population). We first reconstructed phylogenies using 100 million MCMC iterations without incorporating location data. We then parameterised the Continuous-Time Markov Chain (CTMC) model using PrioriTree37, specifying a dispersal process with an asymmetric prior, where relative dispersal rates between countries were estimated while allowing for directional differences in movement probabilities. The CTMC models were run in duplicate for 50 million MCMC iterations each, with the MCC trees produced from the sampled MCMC chain states using TreeAnnotator after discarding 10% of samples as burn-in. The spatiotemporal information embedded in the sampled trees was extracted and visualised using SpreaD338 software to determine the dispersal dynamics of the respective BA lineages across the region.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.