Data sources and processing
For the model-inference system, we utilized multiple sources of epidemiologic data, including confirmed and probable COVID-19 cases, ED visits, deaths, vaccination, and variant proportions49,50,51. As done and described previously11,12,13, we aggregated all COVID-19 confirmed and probable cases52,53, COVID-19-associated ED visits13,54, and COVID-19-associated deaths53 reported to the NYC Health Department by age group ( < 1, 1-4, 5-14, 15-24, 25-44, 45-64, 65-74, and 75+ year-olds), neighborhood of residence (42 United Hospital Fund neighborhoods in NYC55), and week of occurrence13. For mortality, we note a change in COVID-19-associated death definitions. From March 1, 2020 – April 2, 2023, COVID-19-associated deaths included (1) deaths occurring in persons with laboratory-confirmed SARS-CoV-2 infection (i.e., confirmed COVID-19-associated death) at any point (March 1, 2020–July 23, 2020), within 60 days (July 24, 2020–August 2, 2021), or within 30 days (August 3, 2021–April 2, 2023) of diagnosis; and (2) deaths with COVID-19, SARS-CoV-2 or a similar term listed on the death certificate as an immediate, underlying, or contributing cause of death but without laboratory-confirmation of COVID-19 (i.e., probable COVID-19-associated death)56. From April 3, 2023 through the week of August 27, 2023 (i.e., end of this study), COVID-19-associated deaths included any death where the death certificate included COVID-19 or a common variation of COVID-19, SARS-CoV-2, coronavirus, etc.49. For vaccinations, we included all available vaccine doses to date (i.e., 1st to 5th dose), and aggregated data for each vaccine dose to the same age/neighborhood strata, by date of vaccination50.
To model the impact of NPIs, as done previously11,12,13, we used mobility data from SafeGraph48 to adjust SARS-CoV-2 transmission rate. Note, however, the model-inference system also included a parameter to capture the overall impacts of NPIs not limited to mobility reduction (e.g., additional interventions such as masking; see below). The SafeGraph data were aggregated to the neighborhood level by week without age stratification, and available from the week of March 1, 2020 to the week of December 19, 2022. For the week of December 26, 2022 to the week of August 27, 2023 (i.e., end of our study period), a comparison of historical SafeGraph data (i.e., weeks during March 2020–December 2022, using the maximum mobility recorded for the corresponding week of year to account for seasonal changes) showed a close agreement with real-time subway ridership data (Supplementary Fig. 4). Thus, we used historical SafeGraph data for those weeks.
To compute the variant-specific cases, ED visits, deaths estimated infection rates, we used reported weekly percentage of individual variants among sequenced samples51,57. Variant proportion data started from the week of December 27, 2020, and likely did not fully capture the share of Iota, a major variant that emerged around Fall 2020. Therefore, we combined the ancestral and Iota variants when computing the total number of cases or infections attributable to these variants.
For model validation, we used SARS-CoV-2 wastewater surveillance data, available from August 31, 2020 onward. Note the measurement of SARS-CoV-2 concentration in wastewater during each week (here, averaged over samples from the same week given the roughly 1-week timespan from infection to recovery) provided a snapshot of the prevalence of SARS-CoV-2 presence in the population (i.e., a proxy of SARS-CoV-2 infection prevalence in the population). Specifically, SARS-CoV-2 RNA concentrations were measured at each of the city’s 14 wastewater treatment plants, often twice per week, using quantitative reverse transcription polymerase chain reaction (RT-qPCR) assays during August 31, 2020–April 11, 2023 and reverse transcription digital PCR (RT-dPCR) assays from November 1, 2022 through the week of August 27, 2023 (i.e., end of this study). For weeks after April 11, 2023 when the samples were measured using RT-dPCR alone, we converted the RT-dPCR measurements to RT-qPCR equivalents, by multiplying a simple conversion ratio (i.e., the mean of all RT-qPCR measurements dividing the mean of all RT-dPCR measurements during November 1, 2022–April 11, 2023 when both assays were conducted). To compute the citywide weekly per-capita SARS-CoV-2 wastewater concentrations, we first averaged the per-capita SARS-CoV-2 concentrations (i.e., normalized by sewershed flow rate and population size) for each week and sewershed, and then further aggregated the sewershed-level measurements to the city level (i.e., weighted mean per the population size).
This activity was classified as public health surveillance and exempt from ethical review and informed consent by the Institutional Review Boards of both Columbia University and NYC Health Department.
Model inference to estimate key epidemiological variables and parameters
We used a model-inference system to estimate epidemiological variables and parameters based on case, ED visit, and mortality data, accounting for NPIs, vaccinations, under-detection of infection, and seasonal changes. Built on an approach described in Yang et al.13, here the model-inference system additionally tracks the number of vaccinated individuals and accounts for all vaccine doses as done in ref. 58. Briefly, the model-inference system uses a metapopulation network SEIRSV (Susceptible-Exposed-Infectious-(re)Susceptible-Vaccination) model (Eq. 1) to simulate the transmission of SARS-CoV-2 across the 42 neighborhoods in the city by age group:
$$\left\{\begin{array}{c}\frac{d{S}_{i}}{{dt}}=\frac{{R}_{i}}{L}-\left({S}_{i}{\sum }_{j=1}^{j=42}\frac{{{{b}_{s}{b}_{j}\beta }_{{city}}m}_{{ij}}{I}_{j}}{{N}_{j}}\right)+{\sum }_{\tau =0}^{\tau ={\rm T}}{{\rho }_{\tau }V}_{i,t-\tau }-{\sum }_{k=1}^{k=K}{v}_{i,k} \\ \frac{d{E}_{i}}{{dt}}=\left({S}_{i}{\sum }_{j=1}^{j=42}\frac{{{{b}_{s}{b}_{j}\beta }_{{city}}m}_{{ij}}{I}_{j}}{{N}_{j}}\right)-\frac{{E}_{i}}{Z} \hfill \\ \frac{d{I}_{i}}{{dt}}=\frac{{E}_{i}}{Z}-\frac{{I}_{i}}{D} \hfill \\ \frac{d{R}_{i}}{{dt}}=\frac{{I}_{i}}{D}-\frac{{R}_{i}}{L} \hfill \\ \frac{d{V}_{i}}{{dt}}={\sum }_{k=1}^{k=K}{v}_{i,k}-{\sum }_{\tau =0}^{\tau ={\rm T}}{{\rho }_{\tau }V}_{i,t-\tau } \hfill \end{array}\right.$$
(1)
where Si, Ei, Ii, Ri, Vi, and Ni are the number of susceptible, exposed (but not yet infectious), infectious, recovered and immune (i.e., protected against infection), vaccinated and immune individuals, and the total population59, respectively, from a given age group (i.e., <1, 1–4, 5–14, 15–24, 25–44, 45–64, 65–74, or 75+ years) in neighborhood-i (i = 1,…42, for the 42 neighborhoods in the city). \({\beta }_{{city}}\) is the average citywide transmission rate; bs is the estimated seasonal trend12. The term bi represents the neighborhood-level transmission rate relative to the city average. The term mij represents the changes in contact rate in each neighborhood (for i = j) or spatial transmission from neighborhood-j to i (for i ≠ j) and was computed based on the mobility data12. Here, we did not explicitly model the impact of individual NPI such as masking, due to the lack of data and the minor impact of masking at the population level (estimated 5–20% reduction11,60). Rather, to account for the overall impact of NPIs including masking, we scaled the mobility data by a multiplicative factor to capture the overall NPI effectiveness when computing mij12. Z, D, and L are the latency period, infectious period, and immunity period, respectively. Note that as all state variables and parameters are time varying and for each age group separately, Eq. 1 omits time (t) and age in the subscripts.
To account for vaccination, \({\upsilon }_{i,k}\) is the number of neighborhood-i residents who were immunized after the k-th dose (k = 1, 2, …, 5 here for up to 5 doses of vaccines to date) at the time step (t), and was computed using vaccination data adjusting for vaccine effectiveness (VE) against infection61,62,63,64,65. Thus, the term \({\sum }_{k=1}^{k=K}{v}_{i,k}\) represents the total number of neighborhood-i residents immunized by any dose of vaccine at the time step. The term \({\sum }_{\tau =0}^{\tau ={{{\rm T}}}}{{\rho }_{\tau }V}_{i,t-\tau }\) accounts for the waning of vaccine protection against infection, where Vi,t-τ is the number of neighborhood-i residents who got vaccinated τ days ago and lost protection on day-t, and ρτ is the VE waning probability computed based on VE duration data64. Note, here we focused on modeling the impact of vaccination on population susceptibility, and that the posterior estimates of population susceptibility were made along with other factors (e.g., infection) using several data streams and model inference as described below.
Using the model-simulated number of infections occurring each day, we further computed the number of cases, ED visits, and deaths each week to match with the observations, as described in refs. 12,13. Using cases as an example, we multiplied the model-simulated number of new infections per day by the infection-detection rate (i.e., case ascertainment rate, or the fraction of infections reported as cases), and further distributed these estimates in time per a distribution of time-from-infection-to-case-detection (Supplementary Table 3); we then aggregated the daily lagged, simulated estimates to weekly totals for model inference.
Each week, the system uses the ensemble adjustment Kalman filter (EAKF)66 to compute the posterior estimates of model state variables and parameters based on the model (prior) estimates and observed case, ED visit, and mortality data per Bayes’ rule12,13. In particular, model posterior estimates include (1) the underlying infection rate including those not reported as cases, for each week (Fig. 2a and Fig. 2c); (2) the number of susceptible individuals (i.e., Si), which provides estimates of population susceptibility over time (Fig. 2b); (3) the citywide transmission rate (\({\beta }_{{city}}\)) and infectious period (see estimates in Supplementary Fig. 2B and C); (4) the time-varying virus transmissibility (RTX, a measure of variant-specific infectiousness as described in refs. 17,22; Fig. 2c), computed as the product of the citywide transmission rate and infectious period; and (5) other key parameters such as the infection-detection rate (Fig. 3c), IFR (Fig. 4), and the real-time production number (Rt; see estimates in Supplementary Fig. 2A). Computer code for a previous study22 using similar model-inference approach is available at Zenodo67.
We ran the model-inference system for the pre-Omicron and Omicron periods separately, given the large differences in variant characteristics and thus prior distributions for the model states and parameters (Supplementary Table 3). For the pre-Omicron period, we initiated the system at the week of March 1, 2020 (i.e., the week the first cases were detected in NYC), and ran it continuously through the week of December 5, 2021 (i.e., the week before the Omicron BA.1 variant was detected in >50% of sequenced cases). For the Omicron period, we reinitiated the system at the week of November 21, 2021 and ran it continuously through the end of the study period; given the initial overlap with the Delta variant in November/December 2021, we computed the number of cases, ED visits, and deaths due to Omicron based on the variant proportion data and used those variant-specific estimates for inference. To account for model uncertainty, we ran the model-inference system 10 times, each with 500 ensemble members randomly drawn from the initial prior ranges (Supplementary Table 3), and combined the posteriors from all runs, as done in ref. 12. Note the credible intervals (CrIs) were constructed directly using the ensemble members; as such, the 95% CrIs tended to be wide particularly during early weeks of the study period when data were sparse, which reflected the large model uncertainty (see, e.g., the wide lighter blue area in Fig. 2b showing the 95% CrIs for the susceptibility estimates during the 1st and 2nd waves) and model outliers (vs. the much tighter 50% CrIs in darker blue in Fig. 2b for the susceptibility estimates).
Model validation using SARS-CoV-2 wastewater surveillance data
To validate model-inference estimates, we compared the infection prevalence estimates (i.e., the estimated number of infectious individuals, including those not detected as cases, in the population each week) to independent SARS-CoV-2 wastewater concentration data (i.e., the collective SARS-CoV-2 viral shedding of the population, regardless of clinical testing practices). While both quantities represent the presence of SARS-CoV-2 in the population, the measurements are on different scales and viral shedding per infection could vary by the infecting variant. Thus, for comparison, we separated the data into three periods: (i) August 31, 2020 (i.e., the first day of wastewater surveillance) – June 26, 2021, predominantly the ancestral and Iota variants; (ii) June 27, 2021 (i.e., the first week the share of Delta exceeding 50% among the sequenced samples) – November 20, 2021, predominantly the Delta variant; and (iii) November 21, 2021 (i.e., the first week Omicron BA.1 was detected) – August 29, 2023 (i.e., the last wastewater sample during the study period), predominantly the Omicron subvariants. We scaled the wastewater measurements by multiplying the ratio of mean infection prevalence estimates and mean wastewater concentrations across all weeks of each period, and overlay the two time series for visual inspection (see Fig. 1b). For simplicity and given that wastewater viral load data and the infection prevalence estimates both represent active SARS-CoV-2 viral shedding, we used concurrent measures/estimates for the same week (i.e., with no lead/lag time) for this comparison.
Estimating variant-specific infection rates
The weekly infection rate estimates from the model-inference system are based on surveillance data combining all reported variants and thus represent infections by any variant circulating during the week. To estimate the variant-specific infection rates for each week, we multiplied the overall infection rate estimate by the proportion among the sequenced samples for each variant during that week. To compute the total variant-specific infection rate, we then summed the weekly estimates across all weeks that a given variant was detected. For each variant, to identify the main circulation period (i.e., calendar weeks when 95% of all infections occurred), we recorded the first week that the cumulative infection rate surpassed 2.5% (i.e., the start) and 97.5% (i.e., the end) of the total.
Qualitative illustration of immunity from vaccinations and infections by different variants
The model-inference system accounted for immunity conferred by prior infection and vaccination and waning (Eq. 1) to compute the posterior estimates of population susceptibility, using epidemiological data and the EAKF inference algorithm as described above. However, because the two immune components overlap (e.g., a recoveree could subsequently get vaccinated and have mixed immunity for both) and the EAKF may not perfectly preserve mass balance, it is difficult to separately quantify their contributions. Thus, to qualitatively examine the population immunity landscape, we used the rolling sum of prior infection as a proxy of infection-induced immunity and that of vaccinations as a proxy of vaccine-induced immunity (shown in Fig. 2b). Specifically, the rolling sum of prior infection was computed by adding all estimated infections during the preceding 0.5Trs days (i.e., the estimated half time of immunity period; see Supplementary Fig. 2d). The rolling sum of vaccinations was computed by adding vaccinations of the primary series, 3rd, and 4th, 5th dose during the preceding 0.5Tvax days (Tvax is the estimated vaccine-induced immunity period) and further multiplying the estimated variant-specific VE (Supplementary Table 3).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.