The work presented in this post is not peer reviewed. It should not be taken as clinical advice nor be reported as established information without consulting medical experts.

During a disease outbreak, *incidence* (the rate of new cases) and *prevalence* (the fraction of the population currently infected) must be monitored so as to implement and evaluate disease control strategies. These measurements depend on the availability of diagnostic tests as well as the likelihood that each patient would be tested. During an outbreak of a novel agent such as SARS-CoV-2 (in the COVID-19 pandemic), the availability of tests and the likelihood of testing change over time, making it difficult to distinguish the spread of the disease from the increased availability of testing. For example, a greater proportion of test results are likely to be positive early in a disease outbreak because physicians may reserve scarce tests for patients with severe signs and symptoms; however, as tests become more readily available, physicians may also administer tests to patients with less severe signs and symptoms, possibly resulting in a reduced ratio of positive to negative tests.

In this post, we propose a simple probabilistic model that uses the total number of administered tests and the relative proportion of positive and negative test results to estimate cumulative SARS-CoV-2 incidence over time in each state in the United States. The model incorporates beliefs about how testing strategies increase the likelihood of testing infected individuals, and enables us to compare the implications of various prevalence estimates, even for studies which measure very different populations. In this way, we can put studies from very different contexts on similar footing to understand the implications regarding spread of the pandemic. Python code to perform these estimates and daily-updated results are available on Github. As more data, including randomized community testing, become available, we expect these analyses to become much more accurate.

## Model

Our model uses existing test data and one free parameter to estimate cumulative incidence. The free parameter is derived from published epidemiologic studies and provides an estimate of the relative propensity, based on patient signs and symptoms, of a test to be completed. Each study suggests a particular value for this parameter, and through the use of this simple model, we can compare and contrast the implications of the studies.

We have \(N\) total individuals in the population. At each time \(t\), we observe the following random variables:

- \(n_p(t)\), the number of individuals who tested positive
- \(n_n(t)\), the number of individuals who tested negative

In addition, we have the unobserved latent variables:

- \(z^i(t)\), whether individual \(i\) has truly been infected
- \(t^i(t)\), whether individual \(i\) has been tested

Finally, let \(T_n(t)\) denote the probability that an uninfected individual will be tested, and \(T_p(t)\) denote the probability that an infected individual will be tested, i.e.

$$T_n(t) = \mathbb{P}(t^i(t)=1 | z^i(t) = 0),$$ and

$$T_p(t) = \mathbb{P}(t^i(t)=1 | z^i(t) = 1).$$

We then define

$$ c(t) = \frac{T_p(t)}{T_n(t)} =

\frac{\mathbb{P}(t^i(t)=1 | z^i(t) = 1)}

{\mathbb{P}(t^i(t)=1 | z^i(t) = 0)}

$$

i.e. an infected individual is a factor of \(c(t)\) more likely to be tested than an uninfected individual.

### Inference of Total Infection Counts

We would like to estimate the number of total infections which have occurred by time \(t\),

$$Z(t) = \sum_{i=1}^N z^i(t).$$

Given all \(z^i(t)\), the expected number of positive tests is

$$E_p(t) = \mathbb{E} \left[n_p(t) | z^1(t),\ldots,z^N(t) \right]

= Z(t) T_p(t), $$

and the expected number of negative tests is

$$ E_n(t) = \mathbb{E} \left[n_n(t) | z^1(t),\ldots,z^N(t) \right]

= \left(N – Z(t) \right) T_n(t).$$

Therefore,

$$

\frac{E_n(t)}{E_p(t)} =

\left(\frac{N}{Z(t)} – 1 \right) \frac{1}{c(t)}

\hspace{1mm}\Leftrightarrow\hspace{1mm}

Z(t) = \frac{N E_p(t)}{c(t) E_n(t) + E_p(t)}.

$$

Based on this, we estimate \(Z(t)\) using

$$ \hat{Z}(t) =

\frac{N \tilde{n}_p(t)}

{c(t) \tilde{n}_n(t) + \tilde{n}_p(t)}, \label{eq:model}

$$

where \(\tilde{n}_p(t)\) is the observed number of positive tests and \(\tilde{n}_n(t)\) is the observed number of negative tests.

### Intuition

The central assumption underlying our model is that infected individuals are more likely than uninfected individuals to receive diagnostic tests. The parameter \(c(t)\) quantifies this bias toward testing infected patients: large values of \(c(t)\) indicate a large preference for testing infected patients, while a value of \(c(t)=1\) indicates no prejudice between testing infected and non-infected patients.

#### Distinguishing Growing Epidemics from Increased Testing

According to this model, \(\hat{Z}\) increases for three reasons: (1) an increased total population \(N\), (2) an increased proportion of positive tests \(\frac{n_p(t)}{n_n(t)}\), or (3) a decreased \(c(t)\). Intuitively, this third relationship says that for a given number of observed cases, if we have been less effective at rationing tests to likely infected patients, then we expect that the total population contains more infected individuals.

So, increasing the number of positive tests does not necessarily increase the implied total number of infections. This represents the counter-intuitive property that if we observe only total counts of test results, an exponentially-spreading infection is statistically indistinguishable from exponentially-growing testing capacity unless we have extra information about the testing protocol. For instance, we can consider an infection which is fixed in some proportion of the full population (i.e. not spreading)—if testing grows exponentially, then the counts of positive tests would grow exponentially without the infection spreading at all. Thus, if we believe the prevalence is truly increasing while the proportion of positive tests is staying the same, then we must implicitly believe that the testing protocol is changing over time. Our parameter \(c(t)\) provides a way to quantify this change.

### Estimating \(c(t)\)

We explore two strategies for estimating the free parameter \(c(t)\). The first method uses seroprevalence as a measure of \(Z(t)\) and calculates the implied \(c(t)\), while the second method attempts to express \(c(t)\) as an interpretable ratio of symptomatic rates. At present, these two methods provide estimates of \(c(t)\) which are within the same order of magnitude; however, it is likely that \(c(t)\) is highly variable over time and location, so these estimates should be improved by collecting more data for more localities.

###### Estimating \(c(t)\) from Seroprevalence Studies

One way to estimate \(c(t)\) is to use seroprevalence studies which measure antibody prevalence in a randomize population to identify how many individuals have been infected up to a particular date. In contrast to the diagnostic tests, which typically rely on patients seeking treatment for their disease, these seroprevalence studies attempt to get an unbiased view of a region by randomly selecting participants and re-weighting demographic groups to reflect a broader population. Each study provides a view of \(Z(t)\) which implies a particular \(c(t)\):

$$

\hat{c}(t) = \frac{N \tilde{n}_p(t)}{\tilde{Z}(t)\tilde{n}_n(t)} – \frac{\tilde{n}_p(t)}{\tilde{n}_n(t)}

$$

In Table 1, we show the value of \(\hat{c}(t)\) implied by the reported value of \(Z(t)\) in a few recent seroprevalence studies:

**Santa Clara County, CA (SC)**: Concerns have been raised [1, 2, 3] regarding improper propagation of uncertainty estimates in the original pre-print. In addition to the 95% confidence intervals reported in the original preprint (SC-B), we also report the 95% confidence interval calculated with the PyStan MCMC implementation designed to correctly propagate uncertainties (SC-S).**New York, NY (NY1**, **NY2****)**: The state government has reported results of two seroprevalence studies. These results have been shared at press conferences and recorded in news articles; however, the full data is not currently available, so we calculate a point estimate of \(\hat{c}(t)\).**Boise, Idaho (ID)**: An antibody test used in Idaho’s “Crush the Curve” initiative in late April.

**Study** | **Antibody Prevalence** | **Date** | **\(\hat{c}(t)\)** |

SC-B | \(0.0180\)–\(0.0570\) | 4/4 | \(1.74\)–\(5.74\) |

SC-S | \(0.00094\)–\(0.0167\) | 4/4 | \(6.19\)–\(111.82\) |

NY1 | \(0.139\) | 4/21 | \(3.92\) |

NY2 | \(0.123\) | 5/2 | \(3.45\) |

ID | \(0.0179\) | 4/25 | \(5.86\) |

Implications of recent seroprevalence reports.
Each of these studies provide flawed estimates of \(Z(t)\), so we do not view these values as the “correct” values of \(c(t)\), but rather we seek to compare the implications of these studies.

###### Estimating \(c(t)\) from Symptomatic Rates

Another way to estimate \(c(t)\) is to express it in a way that lets us reason about its value. We can do this by assuming that testing is mostly driven by patient symptomatic load. In particular, let the probability an individual is tested be conditionally independent of the individual’s infection status given the individual’s symptoms:

$$

\mathbb{P}(t^i(t) = 1 | s^i(t), z^i(t)) = \mathbb{P}\big(t^i(t) | s^i(t)\big)\mathbb{P}\big(s^i(t) | z^i(t)\big)

$$

where \(s^i(t)\) is the symptomatic load of individual \(i\) at time \(t\). This allows us to express \(c(t)\) in terms of symptomatic rates of the infected and non-infected populations. For simplicity, let us further assume that testing is based on exceeding a threshold of symptoms: \(\mathbb{P} \left(t^i(t) | s^i(t)\big) \propto I(s^i(t) > \tau(t) \right)\). Then,

$$

c(t) = \frac{s_1}{s_0} = \frac{\mathbb{P}(s^i(t) > \tau(t) | z^i(t) = 1)}{\mathbb{P}(s^i(t) > \tau(t) | z^i(t) = 0)}

$$

where \(s_1\) is the symptomatic rate of infected individuals, and \(s_0\) is the rate of non-infected individuals developing Covid-19-like symptoms.

To estimate \(s_0\), we use the prevalence of common infections. For example, the Centers for Disease Control (CDC) estimates that seasonal influenza viruses resulted in 35.5 million symptomatic cases in the 2018–2019 influenza season, or approximately 3.0 million symptomatic cases per week of the peak influenza season. Thus, if the threshold \(\tau\) for being considered for SARS-CoV-2 testing is equivalent to the symptoms of the seasonal influenza, then we set \(s_0=0.01\). In addition, the common cold has a prevalence of approximately 10% during the winter months, so if the threshold for requesting testing for SARS-CoV-2 is equivalent to the symptoms of the common cold, then we set \(s_0=0.11\).

To estimate \(s_1\), we turn to community testing. In Iceland and Italy, studies found that half of SARS-CoV-2 patients had few symptoms requiring medical attention, leading to an estimate of \(s_1 \approx 0.5\). We hope that this estimate will be improved by more widespread testing for SARS-CoV-2 in asymptomatic individuals. These implied values of \(c\) for these parameter settings are shown in Table 2.

**Required Symptomatic Load** | \(s_0\) | \(s_1\) | \(c\) |

Cold | \(0.11\) | \(0.5\) | \(4.5\) |

Flu | \(0.01\) | \(0.5\) | \(50\) |

*Implications of Symptomatic Load Required for Testing.*
## Estimating Cumulative Incidence for US States

In Figure 2 are plots of the estimated cumulative infections for a few states with representative types of growth. Daily-updated results for all 50 states are available on Github. In each panel, we plot the total state population, the number of positive tests \(\tilde{n}_p(t)\), the number of negative tests \(\tilde{n}_n(t),\) and the estimated number of cumulative infections \(\hat{Z}(t)\) implied by a variety of settings of \(c(t)\). At each day \(t\), \(\hat{Z}(t)\) is re-estimated using only tests conducted before \(t\). While \(\tilde{n}_p(t)\) and \(\tilde{n}_n(t)\) increase monotonically with time, \(\hat{Z}(t)\) may decrease when the number of negative tests rises rapidly.

As expected from mechanics of epidemic spread, both the number of positive tests and the implied number of infections grow more quickly in the states with larger population densities than in states with smaller population densities. In addition, many of the curves of implied infection counts seem to grow at a slower rate than the number of positive test results. This could happen for two reasons which cannot be distinguished by this method: (1) either testing availability is expanding at a faster rate than the true epidemic is spreading, or (2) the true value of \(c(t)\) is decreasing over time. These explanations are not mutually exclusive, and in fact, (2) is likely to be a consequence of (1). We expect that the value of \(c(t)\) is reducing over time because testing availability is increasing, and thus the curve of true infections does not follow any of the curves plotted for fixed \(c(t)\). At each time, there is a maximum feasible \(c(t)\), defined by \(c(t) \leq \frac{N – \tilde{n}_p(t)}{\tilde{n}_n(t)}\), which is a function of the observed number of positive tests. When we see the implied number of latent cases for a particular value of \(c(t)\) drop below the number of positive tests, we consider this value of \(c(t)\) to now be too large. As the testing availability continues to increase, and more serology studies are performed to clarify the value of \(c(t)\), this picture will come into clearer resolution.

## Takeaways

In this post, we’ve investigated using a simple probabilistic model to estimate cumulative incidence for each of the states in the United States. These estimates can be wildly different for different values of \(c(t)\), underscoring the need for public discourse to consider the testing protocols in combination with positive test counts to understand the epidemic spread.

When using this approach to compare the implications of different prevalence studies, we should keep a few limitations in mind:

- First, the value of \(c(t)\) is changing over time, and each estimate of \(c(t)\) was derived seroprevalence studies which are often of questionable quality. For this reason, we wouldn’t advise using any of the estimates of prevalence in isolation.
- Secondly, this model does not project future spread, such as could be done with a mechanistic model (e.g., SIR model). We are not modeling the effects of policy or behavior changes; we simply aim to put different studies of prevalence on equivalent footing for comparison.
- Finally, this model uses test results, which are dependent on both test availability and usage by physicians and patients. Consequently, these estimates of cumulative incidence may decrease if test usage decreases or propensity to test changes. This wouldn’t imply that the true number of infections decreased, but rather that we would need to update the value of \(c(t)\).

However, this approach also has strengths:

- First, our model has few external inputs, favoring simplicity over predictive accuracy.
- Second, the single free parameter is highly interpretable as relative propensity to test. This means that a variety of different studies of prevalence can be compared in straightforward manners.
- Third, our model can be applied repeatedly and at any geographic level to help inform decisions including re-opening different localities.

In conclusion, we have developed a probabilistic model to estimate cumulative incidence. The model uses estimates from external studies which are likely to increase in number and precision, thus improving the accuracy of our results. Inferences derived from this model show that total infection counts can be orders of magnitude greater than the limited diagnostic results, and that these quantities may grow at very different rates. Furthermore, this model makes clear that it is very difficult to distinguish the cause of an increase in positive test counts, as the contributions from increased testing and/or epidemic spread rely on our prior assumptions about testing policy (quantified here as \(c(t)\)). As a result, lawmakers should use estimates of total infection counts to inform public policy rather than solely focusing on diagnostic test counts which can be heavily influenced by testing protocols.

#### Data

Test count data is collected by the Covid Tracking Project. For each state, we set the population \(N\) according to the 2019 state population estimates projected from the 2010 US Census.

**DISCLAIMER: **All opinions expressed in this post are those of the author and do not represent the views of CMU.