Method overview
Our analysis is based on the detection of superspreading events and the assignment of containment scores to each event by quantifying secondary infections (Fig. 1b). As the first step of the pipeline, we download and preprocess the GISAID EpiCoV database29. Unfortunately, the sequencing rate in Hungary was too low for a meaningful comparison with the survey results. In the interest of data quality and a close match with the survey experiment, we focused on sequences collected in European countries with a sequencing rate of at least 2% from the Delta, Omicron BA.1 and BA.2 variants. For our analysis, we mainly relied on the amino-acid-level substitution dataset precomputed from the raw clinical genetic sequences by the GISAID pipeline – a dataset that has been previously used to detect variants of interest30 and to visualize mutation trends42. We partition the genetic sequences with identical amino acid substitutions into subsets, which we call collision clusters (CCs). We group together collision clusters that were collected in the same country and that belong to the same variant, filtering out clusters that are prevalent in multiple countries. Following43, we assume that SARS-CoV-2 viruses from the same variant had similar fitness profiles, there was no significant selection between them, and the infection probability and recovery time of the patients were similar.
We detect superspreading events in each collision cluster by tracking unexpectedly large increases in their size after proper normalization (see Methods). Our superspreading event detection method is closely related to previous thresholding approaches33,40, requires only minor preprocessing. The detected events agree with our intuition after visual inspection (Fig. 2b) and a more in-depth analysis based on location metadata in Supplementary Section A. Thereafter, we assign Event Containment Scores (ECSs) to each superspreading event by comparing the size of the collision clusters after superspreading events and after appropriately selected baseline events during the same time period (see Methods). Finally, to acquire aggregate descriptions of event containment, we compute the median of ECS values in each country-variant pair c, denoted by MECS; the output of the pipeline in Fig. 1b. Intuitively, a positive MECS means that superspreading events typically led to smaller collision cluster sizes, and therefore fewer secondary infections than the baselines, i.e. the superspreading events were well-contained (Fig. 2b), red squares). Similarly, a negative ECS would suggest superspreading events that were not contained as well as the baselines (Fig. 2b, blue squares).
a Bar plot showing the weekly number of SARS-CoV-2 genetic sequences collected in Belgium and shared via the GISAID platform for each major variant from July 2021 to July 2022. Dashed lines indicate the weeks when a new variant became dominant. The solid red line represents the number of reported SARS-CoV-2 cases. b Temporal evolution of seven identified collision clusters in Belgium. Within these clusters, our proposed thresholding approach detected four superspreading events, marked with square symbols — typically occurring near the beginning of a cluster. The color of each square represents the sign of the corresponding containment score.
Both the superspreading event detection and the ECS assignment algorithms are efficient but imperfect methods, potentially introducing significant amounts of noise in our results. However, we expect that if enough superspreading events are detected in a country-variant pair, the median of the ECS values will still contain information about event containment, and subsequently, local awareness behavior. We confirm this hypothesis by simulation results and by the analysis of COVID-19 genetic sequences.
Event Containment Scores on Synthetic Genetic Sequence Data
We set up a synthetic pipeline (Fig. 1b) to generate genetic sequence datasets similar to the GISAID EpiCoV dataset, which we can analyze with our superspreading event detection and ECS assignment pipeline. First, we simulate Susceptible-Infected-Recovered (SIR) epidemics on various synthetic and real networks, then we apply the Jukes-Cantor44 genetic substitution model on the resulting infection tree to produce genetic sequence data (see Methods). To model the combined effect of not all infectious individuals being identified (detection rate), and not all identified individuals being sequenced (sequencing rate), we randomly subsample the generated sequences with probability p. Finally, we compute the MECS values as before, with c denoting the model parameters instead of the country-variant pair.
For the underlying network, we select four real social networks and three types of synthetic random networks. Two company friendship networks45, that encode personal connections (recorded by Facebook), have medium size (around 5000 nodes), and have similar characteristics as the contact networks on which a viral disease (such as SARS-CoV-2) can spread. Two online social networks, the Google+ friendship network46, and the Twitter mutual mention network47 are large (over 200,000 nodes), and they model the underlying network of online contagion processes (e.g., rumor, misinformation). All 4 networks have a heterogeneous degree distribution and a relatively high clustering coefficient (Supplementary Fig. B8). To model these characteristics separately, we select three synthetic network models: the Configuration Model has a heterogeneous degree distribution but no clustering, the Stochastic Block Model (SBM) has high clustering but a homogeneous degree distribution, and the Geometric Inhomogeneous Random Graph (GIRG) model48, which has both a heterogeneous degree distribution and high clustering. On all network models, due to the heterogeneous degree distribution (or the community structure in case of the SBM), we expect large infection events that can be detected with our superspreading event detection algorithm.
We include local and global awareness in our simulations as a modification of the SIR model with adaptively changing infection probabilities. Inspired by49, for local awareness we set the infection probability of an infectious node u at time t to be
$${\beta }_{u,t}={\beta }_{0}{e}^{-{\alpha }_{l}{I}_{u,t}},$$
(1)
where β0 ∈ [0, 1] is the basic infection probability, αl sets the strength of the local awareness behavior, and Iu,t is the number of infectious neighbors of node u at time t. In case of global awareness, all infectious nodes u have the same infection probability at time t :
$${\beta }_{u,t}={\beta }_{0}{e}^{-{\alpha }_{g}{I}_{t}/N},$$
(2)
where It is the total number of infectious nodes in the network, αg sets the strength of the global awareness behavior, and N is the size of the network. The exponential function in equation (1) (resp., (2)) aims to model a scenario where each neighbor (resp., node) may alert node u about their infectious status, and each of these independent alerts cause a multiplicative reduction in the infection probability. This model is similar to alternative approaches that treat local awareness as a contagion process, where the probability of staying unaware decays exponentially in the number of aware neighbors12,15,16. As a robustness check, we also implement linearly decaying local awareness functions, since it has been reported that they may be more cost-effective based on an epi-economic point of view50 (Supplementary Fig. B7).
In Fig. 3, we plot the dependence of MECS on the awareness-strength parameters αl and αg and two potential confounding factors: the basic infection probability β0, and the subsampling probability p. The results indicate that MECS primarily depends on the parameter αl (Fig. 3a). Importantly, we were only able to generate positive MECS values with the local awareness model, apart from the noisy MECS values near zero for low subsampling probability in smaller networks. This is a strong indication that the positive MECS values are signs of local awareness behavior.
Epidemics were simulated on synthetic and real networks as a function of a the local, b the global awareness function parameter, c the infection probability and d the subsampling probability of the resulting genetic sequences. For each set of parameters, we simulated n = 200 independent epidemic processes with different random seeds. Colored intervals show the 25th and 75th percentiles of the ECS values, while black intervals indicate confidence intervals for the median, computed using a normal approximation. Source data are available in Supplementary Data 2. When not stated otherwise, all parameters are set to be their default values αl = 0, αg = 0, β0 = 0.15, and p = 1. We observe positive Median Event Containment Scores (MECS) in the case of local awareness, and noisy MECS values near zero if the subsampling probability is low.
The observation that only local awareness can produce positive MECS values has an intuitive explanation. When a superspreading event occurs, there is usually a common trait between the individuals that become infected at the same time; they all tend belong to the same community as the initial infector. It is also likely that there exist many additional individuals who belong to the same community, but do not become immediately infected. Indeed, reports of early superspreading events during COVID-19 do not report all individuals becoming infected in the communities at the same time51,52, and the same is true in simulations, unless the infection probability inside the community is close to 1. If the structure of the contact network remains unchanged after the superspreading event, then these additional community members become infected in the next timestep (week), which causes the number of sequences in the collision cluster to grow, and therefore produces a negative MECS value. Note that there are extreme examples of static networks and epidemic parameters that produce a positive MECS value. For instance, in a star network with infection probability close to 1, an epidemic from the center node produces a single superspreading event, and then dies out in the next step, resulting in MECS > 0. However, we conclude that besides a few extreme cases, positive MECS values, such as the ones observed in the empirical dataset in Fig. 4 – are signs of local awareness behavior.
Bar plots and black dots mark median ECS (blue) and CHI (green) values in European countries with at least 15 detected superspreading events in the (a) Delta, b Omicron BA.1 variants, and c when all Omicron variants are merged. The number of ECS values corresponding to each Median ECS (MECS) value is shown in Supplementary Table B.1. Colored intervals show the 25th and 75th percentiles of the distribution, while black intervals indicate confidence intervals for the median, computed using a normal approximation. Country-variant pairs with a confidence interval larger than 3 around the MECS values are filtered out. Gray background signifies a statistically significant correlation between MECS and the median CHI values (Table 1).
Local awareness in the COVID-19 Genetic Dataset – Spatial analysis
We compute the MECS values for all country-variant pairs with at least 15 detected superspreading events during the Delta or the Omicron BA.1 variants in the GISAID EpiCoV dataset (Fig. 4a, b), and we analyze how these values are related to behavioral metrics and potential confounding factors. Since we only have 5 datapoints in the Omicron BA.1 wave due to data availability, we also performed the same experiment on all Omicron sequences merged together in Fig. 4c).
Fig. 4a, b shows statistically significantly positive containment scores for Germany in the Delta wave and Germany, Slovenia and Belgium during the Omicron BA.1 wave – a sign of local awareness behavior established in the previous section. To understand the factors that could explain the variability between the observed MECS values, we compute the sequencing rate, the attack rate, and the Containment Health Index (CHI) in each country-variant pair (see Methods). CHI is a composite epidemic response measure based on thirteen policy indicators maintained by the Oxford Coronavirus Government Response Tracker (OxCGRT) project, similarly to the stringency index41. We plot the CHI in Fig. 4a–c, and we compute the Spearman-r statistic between them and the MECS values (Table 1). Interestingly, we find a positive correlation between the MECS values and the Containment Health Index, which becomes statistically significant in the Delta wave and when we merge Omicron waves into a single dataset, suggesting that government policies may also impact the local awareness behavior we measure.
While we find no significant correlation between the MECS values and the attack rate (Supplementary Fig. B3), we do observe a statistically significant negative correlation with the sequencing rate during the Delta wave (Table 1), which could suggest that MECS is an artefact of how the data was collected. However, in the Delta wave, sequencing rate and CHI happened to be highly and negatively correlated, potentially because countries aimed to lift the economic burden of strict containment policies by a higher quality sequencing and monitoring project. In the Omicron BA.1 wave and when all Omicron samples are merged, there is no significant correlation between the MECS values and the sequencing rate, suggesting that MECS measures a behavioral signal instead of confounding effects.
Local awareness in the COVID-19 Genetic Dataset – Temporal analysis
Having validated containment scores in real and synthetic datasets, we return to our motivating research question; whether drops in local awareness behavior can be observed in the genetic sequence dataset during the Omicron BA.1 wave of the COVID-19 pandemic. One approach to answer this question is to compare the variant-aggregated MECS scores from Fig. 4 between the Delta and the Omicron BA.1 waves. Fig. 5a shows that MECS values during the Omicron BA.1 wave were lower compared to the Delta wave in Ireland and the United Kingdom, with other European countries either showing no change between the two waves (Belgium), or an increased MECS in the Omicron BA.1 wave (Sweden, Denmark, Germany, Slovenia). As opposed to the spatial analysis in Fig. 4, the temporal trends in the MECS do not seem to be explained by the Containment Health Index. Fig. 5b shows that while the ranking of the MECS values and the CHI are still correlated, the median stringency of the policies became more relaxed only in Sweden and in Belgium, and no change can be observed in the case of Ireland and the UK. However, the purpose of ECS values is to measure the impact of local awareness instead of the policy stringency in the country. As an alternative explanation, we highlight the fact that the Omicron BA.1 wave arrived in the UK and to Ireland a few weeks before its arrival to continental Europe, during the late December instead of early January. The extreme changes in mixing behavior during the holiday season may have contributed to the lower containment scores measured in Fig. 5b.
a Median Event Containment Scores (MECS) during the Delta and the Omicron BA.1 variants as computed in Fig. 4. Datapoints below the dashed (x = y) line hint at drops in local awareness during Omicron BA.1 variant. b Containment Health Index (CHI) during the Delta and the Omicron BA.1 variants as computed in Fig. 4. c MECS values computed biweekly with a 4-week sliding window in the UK for the Delta, Omicron BA.1 and BA.2 variants. Confidence intervals were computed using a normal approximation, and datapoints with a confidence interval larger than 2 are filtered out. We observe a drop in MECS in December 2021 – January 2022 during the Omicron BA.1 wave.
Up until this point, we focused on the MECS values, computed as the median of all ECS values for a country-variant pair. However, in the United Kingdom – where thousands of superspreading events are detected across multiple variants – a higher temporal resolution can be achieved by calculating the median of ECS values biweekly with a 4-week sliding window. The resulting signal (Fig. 5c), obtained purely based on genetic sequence data, shares a remarkable similarly with the Hungarian survey results in Fig. 1a. Both curves show a relatively stable signal between October 2021 and July 2022, with a smaller drop during November 2021 and a significant drop at the beginning of the Omicron BA.1 wave.
Notably, the decline in ECS values coincides with the transition from the Delta to the Omicron BA.1 wave, raising concerns that this trend may reflect the increased transmissibility of the BA.1 variant compared to Delta. However, simulations (Fig. 3c) suggest that transmissibility alone has only a minimal impact on ECS values. Furthermore, the BA.2 variant, which also had increased transmissibility relative to BA.1, does not exhibit a similar discontinuity in the ECS signal. A possible alternative explanation is that a temporal drop in ECS could also be driven by spatial variations in behavior. Although our analysis treats the UK as one homogeneous population, it has been reported that the introduction of the BA.1 variant into the UK was initially localized in the London area53, and the drop in ECS could be a result of a limited ability to engage in awareness behavior in this region. In contrast, the Hungarian dataset was a representative survey, and with the Omicron wave arriving later in Hungary, the introduction of the disease was likely more uniform than in the UK.
While the uncertainties and the differences in the data collection render the direct comparison of the British ECS signal and the Hungarian survey inherently challenging, their alignment opens an array of new questions and research directions in behavioral epidemiology. Moreover, the temporal resolution of the ECS signal in the UK underscores the potential of our approach as a new tool to evaluate the impact of local awareness behavior during a pandemic situation.



