Explaining COVID-19 dynamics through user activity data from digital platforms with Yandex’s self-isolation index as a case study

Categories: Disease & Virus

November 19, 2025

Data

Official data on daily reported COVID-19 infections and deaths in the Russian Federation were obtained from⁴⁴. The data are available at the level of individual administrative units (oblasts) and has a daily resolution, starting from 12 March 2020.

The Yandex’s self-isolation index was designed to reflect the average intensity of person-to-person contacts⁴⁵. This was possible due to the platform’s extensive user base. In years 2020 – 2022, for example, on average 80 to 100 million users in Russia (approximately 55–70% of the Russian population, with slight underrepresentation of adults of above 65 years of age – 10% of Yandex users vs. 27% of total adult population^46,47) were visiting Yandex platform at least once a month. A typical user was spending about 13 h a month engaging with the Yandex ecosystem, including portal, mail service, movie and music streaming, blog, news, marketplace, and a range of other services⁴⁸. Thus, the Yandex user base is a good cross-section of people living in Russian urban centers, which tend to have above-average access to computers and mobile devices.

The self-isolation index was derived from the anonymized user activity data from various services available through the Yandex ecosystem, including geolocation data from navigation and ride-hailing apps, delivery requests, movie streaming as well as search queries and use of blog platforms⁴⁹. For example, higher values of self-isolation index were attained when fever than usual routing requests were registered for the navigator and subway route planning apps. Likewise, an elevated usage of movie streaming, delivery services and activity on internet forums resulted in higher values of the SII. The index was calibrated against a pre-pandemic baseline and takes values between 0 and 5, with $0$ representing the level of activity during rush hours on a regular weekday, when most people are not at home, and $5$ corresponding to virtually empty streets, with a vast majority of the population staying at home. The SII time series are available for individual cities at daily resolution for the period from 23 February 2020 to 22 September 2021.

The period for which the epidemic data and the SII data overlap spans between 12 March 2020 and 22 September 2021. There is a mismatch, however, in the level of spatial resolution, as the epidemic data is available for entire administrative units (federal subjects), while SII is calculated for individual cities. Therefore, in our analysis, we focus on the two largest cities in Russia, Moscow and St. Petersburg, which themselves are administrative units (federal cities). Both epidemic and SII data sets used in our study are publicly available, anonymized and highly aggregated. They are not considered personal data and thus their use does not require ethical approval.

The time series used in our analyses are displayed in Fig. 1. Points represent daily reported new cases of COVID-19 infections ${N}_{t}$, daily COVID-19 related deaths ${D}_{t}$, and daily values of self-isolation index $SI{I}_{t}$ for Moscow and St. Petersburg. Solid lines represent the smoothed versions of these time series, $R{M}_{7}({N}_{t})$, $R{M}_{7}\left({D}_{t}\right)$ and $R{M}_{7}({SII}_{t})$, respectively, where $R{M}_{7}({\cdot }_{t})$ denotes a 7-day rolling mean (moving average) taken over the days from ($t-6)$ to $t$. The 7-day rolling mean filters out seasonal patterns at weekly time scales (e.g., lower than average numbers of new infections recorded on Tuesdays and Wednesdays and higher values of the SII on weekends).

In the initial stage of the pandemic unfolding in early spring of 2020, only a few new cases have been reported each day, with some days on which no new cases have been recorded at all. Numbers of new cases in double digits or higher (a signal sufficiently strong for the purpose of our analysis) have been consistently reported only after 23 March 2020 for Moscow and 6 April 2020 for St. Petersburg. We therefore take these dates as starting points for the time series of the daily new COVID-19 cases in the respective cities. As $SI{I}_{t}$ exhibit pronounced weekly cycles with index values higher on weekends, we limit our analysis to the period that ends on 19 September to ensure that the time series used for modelling include records of full weeks. The ranges of the time series included in the analysis are marked on Fig. 1A with a gray background.

The data on the daily reported deaths due to COVID-19 required a more substantial trimming to be used in our analysis. As some analysis of the excess mortality in Russia suggests, the number of the fatal cases due to COVID-19 might have been considerably underreported in 2020^50,51. According to Kobak⁵⁰, on 28 December 2020, the Russian authorities admitted that most of the excess mortality between January and November 2020 was related to COVID-19, but the official statistics were not updated. Therefore, we discard the data points on deaths before 2 November 2020 which marks the beginning of the second wave of COVID-19 deaths. The period for which the data on COVID-19 deaths was included in our analysis is marked on Fig. 1B with the grey background.

The SIR-inspired time series model

A visual inspection of Fig. 1 suggests the following relationship between the dynamics of COVID-19 and the self-isolation index: the waves of infections are preceded by the periods of low SII values, while higher SII levels appear to flatten the epidemic curve. If confirmed, such a relationship would justify $SI{I}_{t}$ as an informative indicator for explaining the observed dynamics of COVID-19, reflected by ${N}_{t}$ or ${D}_{t}$. To explore this hypothesis, in this section we propose an autoregressive distributed-lag time series model inspired by the classical compartmental susceptible-infected-removed (SIR) model of infectious disease⁵².

In the SIR model, at any given time $t$ the number of new cases, ${N}_{t}$ is proportional to the number of currently infected individuals ${I}_{t}$. In a discrete time, this relationship can be expressed as

$$N_{t} = \beta I_{t – 1} = \beta \mathop \sum \limits_{i = 1}^{K} N_{t – i}$$

(1)

where $K$ is an average period within which infected individuals remain contagious, and the proportionality coefficient $\beta$ is interpreted as the average number of contacts per person times the probability of virus transmission (See Supplementary Information S1, where we show how the discrete-time Eq. (1) can be derived from the continuous-time SIR model). Applying the logarithmic transformation, Eq. (1) can be written in the additive form

$$\log N_{t} = \log \beta + \log \left( {\mathop \sum \limits_{i = 1}^{K} N_{t – i} } \right)$$

(2)

Equation (2) allows for the separation of the effects related to the intensity of people-to-people contacts and the probability of transmission, jointly captured by parameter $\beta$, and the effects related to the sheer number of infected individuals, captured by the sum of recent cases ${N}_{t-i}$ over the period of the last $K$ days.

It is important to note that Eq. (2) relies on the assumptions of a well-mixed and closed population, inherited from the SIR model. For large cities, such as Moscow and St. Petersburg, the assumption of a well-mixed population is plausible, since multiple public spaces and dense transport infrastructure enables uniform virus spread (i.e., without a tendency to develop spatially isolated clusters of infections). Populations of urban centers, however, cannot be considered closed. To represent the net flow of infected individuals to and from a city in question, we add a constant term $\mu$ to the right-hand side of Eq. (2).

To further improve the realism of model (2) we make two more adjustments. First, we observe that term $\text{log}\left({\sum }_{i=1}^{K}{N}_{t-i}\right)$ implies that the old cases ${N}_{t-i}$ (with $i$ close to $K$) contribute to the current number of cases ${N}_{t}$ with the same weight as the more recent cases (with $i$ close to 1). In other words, model (2) assumes that infected individuals that are close to recovery transmit the virus at the same rate as the newly infected persons. In the case of COVID-19, however, higher transmission rates are observed within the first several days of a COVID-19 infection, and then they gradually decline⁵³. To reflect this reality in the model while maintaining its convenient linear form, we replace $\text{log}\left({\sum }_{i=1}^{K}{N}_{t-i}\right)$ with $\sum_{i=1}^{q}{\gamma }_{i}\text{log}{N}_{t-i}$, where coefficients ${\gamma }_{i}$ represent the effect strengths of the lagged values $\text{log}{N}_{t-i}$ on the current value of $\text{log}{N}_{t}$. Here $q\le K$ is the time horizon within which these effects are not negligible. Importantly, coefficients ${\gamma }_{i}$ should not be confused with transmission rates of individuals in the $i$-th day of infection.

Second, in Eq. (2), parameter $\beta$, i.e., the average number of contacts per person times the transmission probability, is constant. In reality, however, this parameter can vary from day to day, e.g., due to the social distancing measures that reduce the number of physical person-to-person contacts. While daily changes in the average number of contacts per person are impossible to track, we conjecture that they match the changes in the self-isolation index. Accordingly, term $\text{log}\beta$ in Eq. (2), i.e., the effect of the logarithm of the average number of contacts per person (assuming constant transmission probability) on the current value of $\text{log}{N}_{t}$, can be replaced with $\sum_{j=0}^{p}{\beta }_{j}\text{log}SI{I}_{t-j}$, where coefficients ${\beta }_{j}$ represent the effect strengths of lagged values $\text{log}SI{I}_{t-j}$ on $\text{log}{N}_{t}$. In other words, $\text{log}\beta$ can be replaced with a linear model that regresses $\text{log}{N}_{t}$ on past values of $\text{log}SI{I}_{t}$. The time horizon for detectable effects, $p$, should not exceed the length of the period within which newly infected persons can transmit the SARS-COV-2 virus, but it does not need to coincide with $q$.

Considering the abovementioned adjustments to Eq. (2), we propose the autoregressive distributed lag (ARDL) model ($(M_1)$), given by formula (3), to explain the current level of $\text{log}{N}_{t}$ with a linear combination of its past values (the autoregressive part) and past values of $\text{log}SI{I}_{t}$ (the distributed lag part) plus the white noise ${\varepsilon }_{t}$.

$$\log N_{t} = \mu + \mathop \sum \limits_{i = 1}^{q} \gamma_{i} \log N_{t – i} + \mathop \sum \limits_{j = 0}^{p} \beta_{j} \log SII_{t – j} + \varepsilon_{t} ,$$

(3)

If reliable testing and/or registering the new COVID-19 cases is lacking, the daily reported new cases ${N}_{t}$ may not reflect well the true epidemic curve. Yet, the true number of infections could be estimated using the reported numbers of COVID-19 fatalities⁵⁴. While estimating the true numbers of infections based on the reported fatal cases of COVID-19 is beyond the scope of this paper, the abovementioned approach justifies the use of the daily reported deaths ${D}_{t}$ in place of ${N}_{t}$ in our analysis. Working with the time series of reported deaths ${D}_{t}$ poses several problems, however. As visible on Fig. 1B, ${D}_{t}$ exhibits considerable changes in volatility and on some days no deaths were reported, which precludes logarithmic transformation. Therefore, we smoothen the original data ${D}_{t}$ using a 7-day rolling mean. This reduces the variance without obscuring the long-term patterns and allows us to apply logarithmic transformation to the time series (barring cases when no deaths were reported for 7 consecutive days). We also apply a 7-day rolling mean to $\text{log}SI{I}_{t}.$ Thus, as an analog to model ($(M_1)$), we propose model ($(M_2)$) given by formula (4):

$$\log RM_{7} (D_{t} ) = \mu + \mathop \sum \limits_{i = 1}^{q} \gamma_{i} \log RM_{7} (D_{t – i} ) + \mathop \sum \limits_{j = 0}^{p} \beta_{j} \log RM_{7} (SII_{t – j} ) + \varepsilon_{t} .$$

(4)

A reliable estimation of parameters of models ($(M_1)$) and ($(M_2)$) using the ordinary least squares (OLS) method requires that both processes $\text{log}{N}_{t}$ and $\text{log}SI{I}_{t}$, and $\text{log}{R{M}_{7}(D}_{t})$ and $\text{log}R{M}_{7}(SI{I}_{t})$, respectively, are stationary⁵⁵. A suite of stationarity tests (cf. Supplementary Information S2, Supplementary Tables S2.1 and S2.2) compels us, however, to reject the hypotheses of stationarity of $\text{log}{N}_{t}$, $\text{log}SI{I}_{t}$, $\text{log}{R{M}_{7}(D}_{t})$, and $\text{log}R{M}_{7}(SI{I}_{t})$ – for both Moscow and St. Petersburg. Consequently, the OLS method cannot be used to estimate parameters of models (3) and (4) directly. However, the first differences of the abovementioned processes, $\Delta \log N_{t}$, $\Delta \log SII_{t} \Delta \log RM_{7} (D_{t} )$, and $\Delta \log RM_{7} \left( {SII_{t} } \right)$, do pass the stationary tests (see Supplementary Information S2, Supplementary Tables S2.3 and S2.4). Thus, for both cities, the analyzed time series are of type $I(1)$ (integrated of order one, i.e., stationary after applying difference operator once). This opens a possibility for employing the so-called error-corrected forms of ARDL models (M1) and (M2), for which OLS estimators of parameters are reliable.

Cointegration and the error-corrected form of models (M1) and (M2)

While both $\text{log}{N}_{t}$ and $\text{log}SI{I}_{t}$ are non-stationary, there may exist a stable long-run relationship between them, which could be exploited for reliable OLS estimation of parameters. More specifically, there may exist vector $\left(\gamma , \theta \right)\in {\mathbb{R}}^{2}$ such that process $\gamma \text{log}{N}_{t}+\theta \text{log}SI{I}_{t}$ is stationary – in which case processes $\text{log}{N}_{t}$ and $\text{log}SI{I}_{t}$ are said to be cointegrated⁵⁶. Notice that the mean of a stationary process is a constant, which can be included in the intercept term of the model, thus, without loss of generality, we can assume that the average of $\gamma \text{log}{N}_{t}+\theta \text{log}SI{I}_{t}$ is zero. Consequently, cointegration implies that, on average, we expect to observe the long-run equilibrium relationship $\text{log}{N}_{t}=-\frac{\theta }{\gamma }\text{log}SI{I}_{t}$.

If $\text{log}{N}_{t}$ and $\text{log}SI{I}_{t}$ are cointegrated, then model ($(M_1)$) given by formula $(3)$ can be transformed into its so-called error-corrected form^57,58, defined by formula (5) below and denoted as $(M1_{EC})$:

$$\Delta \log N_{t} = \mu + \gamma \log N_{t – 1} + \theta \log SII_{t – 1} + \mathop \sum \limits_{i = 1}^{q – 1} \alpha_{i} \Delta \log N_{t – i} + \mathop \sum \limits_{j = 0}^{p – 1} \phi_{j} \Delta \log SII_{t – j} { } + e_{t} .$$

(5)

Cointegration of $\text{log}{N}_{t}$ and $\text{log}SI{I}_{t}$ implies that both sides of the formula above consist of stationary processes, which offers a possibility for a reliable estimation of parameters of model $(M1_{EC})$. More specifically, OLS estimators of these parameters are consistent if: (A) errors ${e}_{t}$ are serially independent with zero mean and a constant variance; and (B) errors ${e}_{t}$ are uncorrelated with $\Delta \log SII_{t – h}$ for all lags $h \in {\mathbb{Z}}$ (i.e., $\Delta \log SII_{t}$ is exogenous)⁵⁵. Parameters of the original model $(M1)$ can then be recovered using the following relationships:

$$\begin{gathered} \gamma_{1} = \gamma + 1 + \alpha_{1} , \hfill \\ \gamma_{i} = \alpha_{i} – \alpha_{i – 1} , i = 2, \ldots , q – 1, \hfill \\ \gamma_{q} = – \alpha_{q – 1} , \hfill \\ \beta_{0} = \theta_{0} , \hfill \\ \beta_{1} = \theta + \phi_{1} – \phi_{0} \hfill \\ \beta_{j} = \phi_{j} – \phi_{j – 1} , j = 2, \ldots , p – 1, \hfill \\ \beta_{p} = – \phi_{ p – 1} . \hfill \\ \end{gathered}$$

(6)

The error-corrected model $\left(M1_{EC}\right)$ has an important practical interpretation. The ARDL term $\mathop \sum \limits_{i = 1}^{q – 1} \alpha_{i} \Delta \log N_{t – i}$+$\mathop \sum \limits_{j = 0}^{p – 1} \phi_{j} \Delta \log SII_{t – j}$ represents the short-term dynamics of $\Delta \log N_{t}$ driven by the past changes $\Delta \log N_{t}$ and $\Delta \log SII_{t}$. The short-term dynamics is not influenced by the level of $\text{log}SI{I}_{t}$, however, regardless of whether it is low or high. Yet, in the presence of a long-run equilibrium $\text{log}{N}_{t}=-\frac{\theta }{\gamma }\text{log}SI{I}_{t}$, any given level of $\text{log}SI{I}_{t}$ will induce an adjustment of values of $\text{log}{N}_{t}$ towards its new equilibrium value of $-\frac{\theta }{\gamma }\text{log}SI{I}_{t}$. Thus, to properly describe the evolution of $\Delta \log N_{t}$, the short term-dynamics needs to be corrected (hence the name of error correction model) with the term $\gamma \text{log}{N}_{t-1}$+$\theta \text{log}SI{I}_{t-1}$$=-\gamma \left(-\frac{\theta }{\gamma }\text{log}SI{I}_{t-1}-\text{log}{N}_{t-1}\right)$, where the expression in parentheses is the difference between the expected equilibrium value of $\text{log}{N}_{t}$, equal to $-\frac{\theta }{\gamma }\text{log}SI{I}_{t-1}$, and the value of $\text{log}{N}_{t-1}$ observed at the previous time step. The constant $-\gamma$ is interpreted as the rate of convergence to the new equilibrium.

In a similar fashion as above, if $\text{log}{R{M}_{7}(D}_{t})$ and $\text{log}R{M}_{7}(SI{I}_{t})$ are cointegrated, then the model $\left(M2\right)$ can be transformed into its error-corrected form $(M1_{EC})$ given by formula (7) below:

$$\begin{aligned} \Delta \log RM_{7} \left( {D_{t} } \right) = & \mu + \gamma \log RM_{7} \left( {D_{{t – 1}} } \right) \\ & \quad + \theta \log RM_{7} \left( {SII_{{t – 1}} } \right) \\ & \quad + \sum\limits_{{i = 1}}^{{q – 1}} {\alpha _{i} } \Delta \log RM_{7} \left( {D_{{t – i}} } \right) \\ & \quad + \sum\limits_{{j = 0}}^{{p – 1}} {\phi _{j} } \Delta \log RM_{7} \left( {SII_{{t – j}} } \right) + e_{t} . \\ \end{aligned}$$

(7)

The conditions (A) and (B) for reliable estimation of parameters of model $(M{2}_{EC})$ are essentially the same as for the model $(M{1}_{EC})$, with $\Delta \log RM_{7} \left( {SII_{t – h} } \right)$ replacing $\Delta \log SII_{t – h}$ in condition (B). Coefficients of the model $(M2)$ can be recovered using Eq.(6).

Estimation of parameters of error-corrected model and cointegration tests

An error-corrected ARDL model can be employed in testing for cointegration if its parameters can be reliably estimated using the OLS method, i.e., if conditions (A) and (B) above are satisfied⁵⁸. The procedure for testing for cointegration involves: (i) estimation of parameters of the error-corrected ARDL model including the optimal orders ${q}^{*}$ and ${p}^{*}$ of the autoregressive and distributed lag parts of the model, respectively, and (ii) testing the null hypothesis ${H}_{0}:\gamma =\theta =0$^58,59.

In the Results section below, we present the results of cointegration tests between $\text{log}{N}_{t}$ and $\text{log}SI{I}_{t}$, and $\text{log}{R{M}_{7}(D}_{t})$ and $\text{log}R{M}_{7}(SI{I}_{t})$ obtained with use of the R package dLagM, version 1.1.8⁶⁰. To carry out step (i), we use the ardlBoundOrders function. The function performs a search over combinations of orders $q$ and $p$ within the pre-specified limits ${q}_{\text{max}}$ and ${p}_{\text{max}}$. For each combination of $q$ and $p$, the function estimates an error-correction ARDL model and it returns ${q}^{*}$ and ${p}^{*}$ for which the corresponding estimated model minimizes the selected goodness-of-fit criterion. As a goodness-of-fit criterion we use the Bayesian Information Criterion (BIC) since it strongly penalises the number of parameters in a model and thus reduces the risk of overfitting. To carry out step (ii), we use the obtained optimal orders ${q}^{*}$ and ${p}^{*}$ as parameters in the ardlBound function. This function fits an error-corrected ARDL model of the specified orders, performs the bounds test for cointegration as described in⁵⁹ and displays the model’s diagnostics.