Revealing spatiotemporal variations in areas potentially linked to COVID-19 spread using fine-grained population data

Categories: Disease & Virus

July 2, 2025

Correlation-based potential risk calculation

We integrated two datasets for epidemiological modelling. The first was spatiotemporal population data that consisted of hourly population estimates for each 125-metre-square grid cell. The population estimates were grouped by the home wards of the visitors. Provided by LY Corporation (Tokyo, Japan; formerly Yahoo Japan Corporation at the time of data provision), this data was collected from mobile devices of users who had opted in, with privacy-preserving aggregation. It covered 182,669 grid cells across Tokyo and Kanagawa Prefectures, which are central to the Greater Tokyo Area (see the leftmost image in Fig. 1a). Considering the time span of the data, we set the period of our analysis from the third wave (beginning on 2 Nov. 2020) to the fifth wave (ending on 31 Oct. 2021) of the pandemic in Tokyo. The second dataset comprised the daily number of confirmed cases in each ward of Tokyo, aggregated based on their reporting dates. Wards are the smallest units for which confirmed cases were reported in Tokyo.

Our metric of potential risk, or measure of potential concern, was calculated using the most detailed information available from the datasets described above. Specifically, the indicator was defined for each grid cell (c) at a given time of day (visiting hour h) and was dependent on the residential ward (w). The inclusion of ward-specific dependence was intended to capture the varying trends of confirmed cases across wards (see Supplementary Fig. S4) and to provide local administrators with insights into areas requiring heightened attention. The grid cells represented different urban locations (e.g. business or downtown districts), while the time-of-day data reflected distinct types of activities at these locations (e.g. population at 13:00 largely consisting of workers, while population at 20:00 primarily reflecting evening or nightlife activities). We hypothesised that the combination of a specific grid cell and time of day would characterise distinct urban scenarios and activities, which have been shown to influence COVID-19 transmissibility²².

Given a triplet (c, h, w), representing these factors, we quantified the potential risk during a wave by calculating Pearson’s correlation coefficient between two daily time series: the daily population at grid cell c during hour h, denoted as $\textrm{Population}_{d}(c,h)=\textrm{Population}_{t(d,h)}(c)$, and the daily instantaneous effective reproduction number $R_t$ for ward d, denoted as $R_d(w)$. Here, $\textrm{Population}_{t}(c)$ is the original hourly time series of population fluctuation in the cell c, and t(d, h) represents the time index corresponding to hour h on date d. Furthermore, to account for fluctuating delays between population changes and subsequent impacts on infection spread (beyond the reporting delays partially corrected in the $R_t$ calculation), we introduced a time shift $\delta$ when calculating correlation coefficients. This was done to mitigate the influence of these potential short-term delays and prevent overlooking potential areas of concern. We applied forward or backward shifts (in days) to the population time series for each cell and calculated the correlation for each shift within a predefined range $[-\Delta , \Delta ]$. The potential risk was then defined as the maximum correlation found across these shifts:

$$\begin{aligned} \mathrm{Risk_{\delta }}(c, h, w)&= \textrm{Pearson}[\textrm{Population}_{d+\delta }(c, h), R_d(w)], \end{aligned}$$

(1)

$$\begin{aligned} \textrm{Risk}(c, h, w)&= \max _{\delta \in [-\Delta , \Delta ]} \mathrm{Risk_{\delta }}(c, h, w), \end{aligned}$$

(2)

where the Pearson correlation is calculated over the daily index d corresponding to the selected credible period for $R_d(w)$ during the target wave (see “Identifying periods for correlation calculations per ward” subsection), and $\Delta$ is the maximum acceptable lag in days (determined in the “Parameter determination” subsection to be $\Delta =1$).

An overview of our framework is illustrated in Fig. 1b. Here, $R_t$ represents the expected number of individuals infected by a patient at a given time point³⁵. We estimated it using a Bayesian method³⁶ with data on confirmed cases adjusted for reporting delays (see the “Calculation of R_t” subsection of the “Methods” section). Conceptually, our method identifies grid cells where fluctuations in population tend to synchronise with changes in $R_t$ (see the “Risk modelling” subsection of the “Methods” section). While we acknowledge that correlation does not imply causation, previous studies have demonstrated the utility of such correlations in monitoring epidemiological trends^19,30. Consistent with these findings, we observed in what follows that our approach effectively highlighted potentially concerning areas, aligning with insights from prior research.

To enhance the accuracy of our indicator calculation and address spurious correlations, our method included two steps before computing the indicator for each cell (see corresponding subsections of the “Methods” section). First, we performed a two-phase cell screening process to extract non-residential cells and select candidate cells with more than a certain number of visitors from a target home ward, leveraging home ward information in our spatiotemporal population data. This resulted in 61,053 non-residential cells in Tokyo and Kanagawa Prefectures (about one-third of the total cells), and 3,295 cells for Setagaya Ward during the third wave, for instance. Second, we selected periods with low uncertainty in terms of $R_t$ (i.e. with enough confirmed cases) throughout an entire period of each wave, for correlation calculation. The calibration of the parameters involved and verification are detailed in the “Methods” section.

Influence of urban-suburban structure on highly-correlated areas

The Greater Tokyo Area comprises a central metropolitan area and suburbs, which are connected to each other mainly by trains. To investigate the impact of this urban-suburban structure on the risk of COVID-19 spread, we analysed correlation maps during the third wave (the initial wave in our datasets) for residents in Setagaya and Shinjuku Wards, as representative suburban and urban wards, respectively. During the wave, Setagaya Ward had the largest number of confirmed cases among all Tokyo’s wards, while Shinjuku Ward had the most cases out of the central wards (see Supplementary Fig. S2).

The left panel of Fig. 2 shows a correlation map of the third wave for residents in Setagaya Ward. To visualise spatial variation in correlations, each cell in the map is colour-coded based on its highest correlation coefficient among those at all times of day, thus omitting temporal information. From the figure, it can be observed that areas around terminal stations (e.g. Tokyo Sta., Shinjuku Sta., Shibuya Sta., and Ikebukuro Sta.) had high correlation coefficients. This result is consistent with those on infectious risk reported in previous studies^{18,26,27,37,38,39}, validating our metric. The correlation map also highlights areas around railways connected to these stations. We note that the majority of highly-correlated areas were located outside Setagaya Ward, although a few downtown areas within the ward (e.g., around Futako-Tamagawa Sta.) were identified. Additionally, certain areas in neighbouring Kanagawa Prefecture exhibited correlations, albeit rarely (see Supplementary Fig. S6 for a map including Kanagawa).

In the right panel of Fig. 2, we present a correlation map of the third wave for residents in Shinjuku Ward. Once again, areas around the terminal stations showed high correlations. However, in contrast to Setagaya Ward’s map, Shinjuku Ward’s map contained most of the high-correlation areas inside the Yamanote Line, the main circular railway surrounding the central metropolitan area of Tokyo.

Spatiotemporal variations of potential areas of concern

Next, we analysed the spatiotemporal characteristics of the areas requiring heightened attention. We first identified the top 300 cells with high correlation coefficients for each wave and ward as cells of potential concern, which accounted for approximately 0.5% of the non-residential cells. Here we emphasise that our method highlights areas requiring further epidemiological consideration, but does not directly quantify the possibility of infection at these areas. To investigate whether these cells of potential concern were common across different wards, we selected Adachi, Edogawa, Nerima, Ota, and Setagaya as target wards (Fig. 3a), which had the largest number of confirmed cases during the third to fifth waves (see Supplementary Fig. S2). We counted the number of wards that shared each cell of potential concern for each wave. As shown in Fig. 3a, cells of potential concern for the third wave were concentrated in the central metropolitan area (through which the Yamanote Line runs) and shared by multiple wards. However, as the pandemic progressed, these cells became more distinct for each ward, and tended to shift to the suburbs.

To better understand the spatiotemporal variations of cells of potential concern over different waves, we next focused our analysis on Setagaya Ward, which consistently had the largest number of confirmed cases. Figure 3b shows these cells for Setagaya Ward residents during each of the three waves. The areas identified for the later waves contained fewer metropolitan areas (including the terminal stations mentioned above) and more areas within Setagaya Ward. For example, cells of potential concern in the fourth wave were located near the ward office and along the Odakyu Line, a railway connecting Kanagawa Prefecture (below Tokyo in the figure) to Shinjuku Station through Setagaya Ward. Other examples included Yoyogi Park and Meguro River, popular spots for viewing cherry blossoms during their blooming period (the end of March until the beginning of April). For the fifth wave, the cells of potential concern were mainly located near railway lines. These results imply a possible association between railway usage and surges in infections, which is in accordance with the findings of previous studies^40,41.

We also analysed the time of day with a higher correlation. To do this, we classified the time of day into four categories: early morning (3:00–8:59), working hours (9:00–16:59), after work (17:00–19:59), and night (20:00–2:59). Figure 3b shows the time categories for which each cell had the highest correlation with $R_t$. While cells of potential concern for the third wave were concentrated at night, those in the fourth and fifth waves were mostly associated with working hours. In particular, downtown districts around major terminals exhibited a high correlation at night for the third and fourth waves, indicating that off-time activities after working potentially contributed to the spread of COVID-19 during these waves. Conversely, areas including downtowns in suburbs and those surrounding railway lines were highly correlated during working hours for the fourth and fifth waves. This could suggest that daytime activities, including working, were associated with the disease spread during these waves.

Points-of-interest (POIs) in potential areas of concern

The above analysis revealed that areas of potential concern exhibited certain spatiotemporal patterns. To gain a better understanding of the dynamics of these patterns, we analysed the POIs associated with these areas using a POI dataset consisting of 634,107 POIs in Tokyo and Kanagawa prefectures, along with their locations and categories (see the “Points-of-interest (POIs) data” subsection under the “Methods” section). Specifically, we counted the number of POIs within cells of potential concern for each category. Figure 4a shows the distribution of POIs detected for Setagaya Ward residents during each wave. During the third wave, the number of detected POIs varied greatly from category to category; in particular, “Restaurants” was the most common category, which is consistent with previous findings^22,26,38. Note that this large variance cannot be attributed solely to differences in the total POI counts among the categories (see Supplementary Fig. S9). Subsequently, there was a gradual decrease in this variance over time. We observed the same trend in other wards as well (see Supplementary Figs. S10–S13).

To quantify the temporal changes in variance, we computed the entropy of the POI distribution for the five wards shown above. From Fig. 4b a monotonic increase of entropy was observed for all of the wards except Adachi, for which the maps of potentially concerning areas contained significant noise during the fourth wave (see Supplementary Fig. S7). These results suggest that, as the pandemic progressed, the spread of COVID-19 was no longer attributable to any single POI category, and may have been associated with many different POIs.

Population time series at potential areas of concern

Finally, we analysed the population time series of the highly-correlated areas directly to explicitly investigate the temporal events that caused a high correlation between these time series and $R_t$. Over the entire analysis period, we calculated the average of the population time series at the top 300 cells identified as potentially concerning for each wave, for Setagaya Ward residents. Each cell’s contribution to the average is its population time series evaluated consistently at the specific hour of day ($h_{\max}$) that yielded its highest correlation during wave N. The resulting average time series, calculated over the entire study duration shown, is then normalised by dividing by its maximum value observed within the plotted date range to set the displayed peak to 1. The three resulting time series, which reflect the set of cells of potential concern for each wave, are plotted in Fig. 5, juxtaposed with the $R_t$ time series of the ward.

In the figure, the three population time series exhibit similar upward and downward trends on the whole, but there are also some notable differences between them. In what follows, we detail how the population and $R_t$ time series changed for the period of each wave. We note that, during the pandemic, the Tokyo Metropolitan Government declared (quasi-)states of emergency (SoEs) several times, including non-compulsory requests for citizens to refrain from going out^42,43, which may have affected mobility patterns⁴⁴.

During the third wave (2 Nov. 2020 to 23 Feb. 2021), we observed that the population time series of cells of potential concern of the third wave experienced the largest drop around New Year’s Day among those of the three waves, which coincided with the rapid decline of $R_t$. The sharp peak of $R_t$ before the decline is partially due to the suspension of new case reporting over the New Year holidays. The population remained low compared to these cells of the other waves during the SoE in early 2021. Our POI analysis suggests that the dynamics of populations in potentially concerning areas during the third wave may have been related to year-end parties at restaurants.

During the fourth wave (5 Mar. 2021 to 12 Jun. 2021), both the population and $R_t$ peaked at the end of March, coinciding with the season for the spring vacation and cherry-viewing parties. This trend aligns with our analysis of potential areas of concern discussed in the previous section. A significant drop in the population was then observed around the end of April and the beginning of May, likely attributable to a declared SoE and the national holidays in Japan. By contrast, $R_t$ continued its gradual decline after peaking at the end of March and did not mirror the population’s recovery in mid-May, leading to the abatement of the wave. This divergence between the population and the $R_t$ time series suggests that daily human mobility did not solely dictate $R_t$ trends, particularly toward the end of a wave.

Finally, during the fifth wave (16 Jun. 2021 to 31 Oct. 2021), both the population time series for cells of potential concern and the $R_t$ time series showed a decreasing trend from mid-July to mid-August, which includes consecutive public holidays and a major holiday season in Japan. Later, while the population time series increased, the $R_t$ time series continued to decrease. This discrepancy may have been due to the increased spread of vaccination among the population with Pfizer/BioNTech and Moderna vaccines^45,46 by that time (see Supplementary Fig. S5 for statistics of the vaccination ratio in Japan^47,48).

Source link