{"subscriber":false,"subscribedOffers":{}}

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
×

Research Article

COVID-19

Estimating The Infection Fatality Rate Among Symptomatic COVID-19 Cases In The United States

Affiliations
  1. Anirban Basu ([email protected]) is the Stergachis Family Endowed Director and Professor of Health Economics at the Comparative Health Outcomes, Policy, and Economics Institute in the School of Pharmacy, University of Washington, in Seattle.
PUBLISHED:Free Accesshttps://doi.org/10.1377/hlthaff.2020.00455

Abstract

Knowing the infection fatality rate (IFR) of novel coronavirus (SARS-CoV-2) infections is essential for the fight against the coronavirus disease (COVID-19) pandemic. Using data through April 20, 2020, I fit a statistical model to COVID-19 case fatality rates over time at the US county level to estimate the COVID-19 IFR among symptomatic cases (IFR-S) as time goes to infinity. The IFR-S in the US was estimated to be 1.3 percent. County-specific rates varied from 0.5 percent to 3.6 percent. The overall IFR for COVID-19 should be lower when I account for cases where patients are asymptomatic and recover without symptoms. When used with other estimating approaches, my model and estimates can help disease and policy modelers obtain more accurate predictions for the epidemiology of the disease and the impact of various policy levers to contain the pandemic. The model could also be used with future pandemics to get an early sense of the magnitude of symptomatic infection at the population level before other direct estimates are available. Substantial variation across patient demographics likely exists and should be the focus of future studies.

TOPICS

Knowing the infection fatality rate (IFR) of novel coronavirus (SARS-CoV-2) infections is essential for the fight against the coronavirus disease (COVID-19) pandemic.1,2 A substantial amount of uncertainty in projecting the effects of the pandemic at the population level and the impact of public policies and directives, such as physical distancing measures, as well as the impact of potential future shortages of health care supply pivots around the uncertainty of this parameter. The IFR is the ratio of two numbers: the number of deaths caused by COVID-19 (numerator) and the total number of people in the population who were genuinely infected by the virus (denominator). However, for many reasons, both the numerator and the denominator of the IFR are measured with error. For example, errors in the denominator arise because patients remain asymptomatic during the first few days of the infection, testing is not universal and is selective at best, and longitudinal data on patients with COVID-19 are unavailable at the national level.3 Measurement errors may also exist in the numerator because of the undercounting of deaths due to social isolation and other factors and because some COVID-19-related deaths are attributed to other factors.4 As a consequence, the reported case fatality rate (CFR) for COVID-19, which is an estimate based on the reported number of COVID-19-related deaths and the reported number of cases that were laboratory confirmed as COVID-19 infections, provides a biased estimate of the IFR. It could be biased upward because the actual number of individuals who are infected is not known. It also could be biased downward because some of those who are currently infected could die in the future or because deaths are undercounted. The upward bias is likely to be much larger during the early phase of testing. Most estimates of the COVID-19 fatality rate currently available around the world suffer from these biases.5

In this article I try to overcome these biases using national US data on counts of reported deaths and detected COVID-19 cases and the temporality of the reported case fatality rate (that is, its variability over time) to make inferences about the infection fatality rate for COVID-19. My method does not account for a fraction of cases with COVID-19 infection where patients recover without any major symptoms. These asymptomatic patients do not contribute to any of the reported statistics on COVD-19 deaths and cases. A true IFR should include these patients in the denominator. However, in this article, because I try to eliminate measurement errors in reported CFRs based on trends in reported COVID-19 deaths and cases, I am unable to account for this fraction of the population that remains asymptomatic with infections. As a consequence, what I estimate is the IFR among symptomatic COVID-19 cases (IFR-S) or the “true” case fatality rate, where no reporting errors are present.

Study Data And Methods

Assumptions

I make three assumptions for this analysis. First, errors in the numerator and the denominator lead to underreporting of true COVID-19 deaths and cases, respectively, and the error is smaller for deaths than for cases. Second, both the errors are declining over time. Finally, the errors in the denominator are declining at a faster rate than the error in the numerator.

The first assumption is self-evident: both deaths and actual cases are undercounted during the initial phase of the epidemic.3,4 Because deaths are much more visible events than infections, which, in the case of COVID-19, can be asymptomatic during the first few days of infection, I posit that at any point in time, the errors in the denominator are larger than the errors in the numerator. Hence, this assumption leads to CFR estimates being larger than the IFR-S, which is typically believed to be true, according to observed data.

The second assumption is my central assumption, stating that under some stationary processes of care delivery, health care supply, and reporting, which are all believed to be improving over time, the errors in both the numerator and the denominator are declining. It implies that the measurement of both the numerator and the denominator are improving over time, albeit at different rates in different jurisdictions.

The third assumption posits that the error in the denominator is declining faster than the error in the numerator. This assumption indicates that case fatality rates, based on the number of cumulative COVID-19 deaths and the number of cumulative reported COVID-19 cases, are declining over time and are confirmed by my observed data (described in detail below).

If these simple assumptions hold, these methods allow me to project the IFR at the limit when time goes to infinity and errors reduce to zero. That is not to say that, in practicality, I expect that in the future the US would ever reach a point of universal testing or comprehensive reporting of all COVID-19 deaths. However, as long as improvements occur in identifying all symptomatic cases and reporting all COVID-19 deaths, the errors will be reduced, enabling inferences about the IFR-S by fitting models to the temporality of the CFR and thereby projecting the expected rate at infinite time.

However, this stationary process of declining errors may be disrupted in certain regions and over certain periods by shortages of testing supplies, leading to artificial increases in the CFR after days of decline. I applied specific criteria to identify these regions and periods so that I could exclude these data from my analysis.

The online appendix contains a detailed mathematical formulation of these assumptions and their implications.6 I used an exponential decay formulation within a logit framework to model this decay to estimate the asymptote parameter. In other words, given the fact that reported case fatality rates are observed to decline over time early in an epidemic, I was able to use statistical techniques to estimate the potential value of those reported rates as they approach their limit when the decline continues over a long time horizon. This rate at the limit presumably reflects the value of the IFR-S. More important, I allowed for heterogeneity in decay across US counties to estimate my target parameter. My estimate of IFR-S is less sensitive to the missing deaths that will occur in the future from the cases detected in the last two weeks of the limit (discussed in more detail in the Discussion section). I validated my predictions on the basis of declining observed rates during the future dates that were not used for estimation purposes.

Data

I used publicly reported data, located in GitHub, from both the Johns Hopkins Repository7 and the New York Times8 on the total number of cumulative deaths and detected cases by day for each US county. I updated missing values from one repository using the nonmissing values from the other repository by date and county. Moreover, for any date and county, the maximal value reported for deaths or detected cases in either repository was used. A rate variable was constructed by dividing the cumulative total number of deaths by the cumulative total number of detected cases for each date and county. The first diagnosed case of COVID-19 in the US occurred January 21, 2020, and the first death occurred February 28, 2020, both in Washington State, although new data are showing that earlier cases may have existed in California.9 Because testing was nonexistent during the initial few days, the data showed that the ratio of deaths to cases increased for the first few days for many counties. Therefore, for each county, my analysis started from the day when the first zenith in this rate was reached. It is assumed that declining error rates within each county began from that day forward, driven by better reporting of COVID-19 deaths and cases. Only counties that had reported at least five COVID-19 deaths and thirty cases before April 20, 2020, were retained.

Moreover, I was aware that sudden areawide shortages in testing kits could artificially raise the CFR after days of decline, and therefore bias my decay analysis. That is why I also removed counties that reported at least a one-standard-deviation increase in the CFR for seven or more days after reaching the CFR nadir. In addition, among the remaining counties, I removed the last seven days of follow-up if CFRs were found to increase consecutively for three or more days during that week. Last, I retained counties that had at least six follow-up days of reported data after reaching the zenith.

Statistical Model

I modeled these rates over time for each county, using a binomial model for the counts of deaths over the counts of detected cases; that is, Deathsjt are distributed as binomial(pjt, Detectedjt), where j denotes the counties and t denotes the number of days from the zenith value of rate within a county (Days). The mean of this binomial model, pjt, represents the probability of death and is expressed as a Bayesian random coefficients exponential decay model within a logit link framework, so that the predicted rates remain within 0 and 1. Specifically, my mean model estimated the probability of death in county j at time t: (pjt=Logit−1(A1j+(A2j−A1j)×exp(−exp(A3j)(Daysjt−1))). The Aij represent the actual death rates for specific counties.

The main feature of the decay model is that as time (Days) goes to infinity, under the assumption that errors in both the numerator and the denominator go to zero, the cumulative reported CFR would approach an estimate of the true IFR-S in the population. Specifically, in this model: Logit−1(A1j) = county-specific IFR-S, as Days goes to infinity; Logit−1(A2j) = county-specific expected zenith rate, when Days = 1; and −exp(A3j) = county-specific exponential decline rate in the CFR, parameterized such that it takes on negative values only.

The overall US-specific IFR-S can be expressed as Logit−1(b1), where A1j is distributed as normal(b1, 1). Hyperpriors for coefficients were based on Cauchy distributions, as recommended in the Bayesian literature for logistic models.10 Prior sensitivity analyses were carried out based on using normal or uniform distribution for the hyperpriors. Further details about the model are in the appendix.6 I used the Metropolis-Hastings algorithm to estimate this model, using three simultaneous Monte Carlo chains and 10,000 deviates for each chain, 10,000 burn-in runs, and a thinning of 100.

I used data up to April 20, 2020, for my training sample to estimate my model. Model fit was assessed using posterior predictions from the model against four consecutive follow-up days for each county. For most counties, these days were April 21–24, 2020.

Limitations

There were several limitations to my analysis. First, I acknowledge that my estimate of the IFR-S would be higher than the true overall IFR. This is because my model relied on identified cases that are presumably all symptomatic patients with COVID-19. Therefore, even at the limit, my estimated rate would not include the fraction of patients who may have the infection but who remain asymptomatic and recover. My estimate would, however, include patients who start with an asymptomatic infection but become symptomatic later. An estimate of the magnitude of the truly asymptomatic fraction in COVID-19 remains unclear: populationwide antibody testing would be needed to establish this statistic. Results from serotesting from the Diamond Princess cruise ship outbreak suggests that about 17.9 percent of infected people never developed symptoms.11 As a consequence, a reasonable estimate of the overall IFR would be about 20 percent lower than my estimated IFR-S.

Second, my estimated COVID-19 IFR-S may be slightly conservative. My approach was to control for the upward bias in this estimate if raw rates were used. I did not control for the downward bias that may arise because some of the detected cases may become deaths in the future. Recently, Nick Wilson and colleagues attempted to address this downward bias by estimating lagged death rates based on international data; the estimated effect ranged from 0.8 percent in China (excluding Hubei Province) to 4.2 percent in eighty-two other countries and territories.12 That analysis used a time lag of thirteen days,12 based on reported data from China on the time from radiologic confirmation of COVID-19 to death.13 However, as the distribution of time to death varies over this thirteen-day follow-up period, such a correction could give nonsensical results when applied to US data that include less than sixty days of COVID-19 history. The death rate was estimated to be higher than 1 on specific days for many US counties when such a correction was applied. In general, I believe that the downward bias generated as a result of the missing deaths at the limit should be small. This is because my estimate represents the death rate in an asymptote, which is what would happen with many days of accumulated data on the number of detected and death cases. At that point, the additional number of deaths from that last two weeks of detected cases would contribute very little to my overall estimate of the IFR-S.

Third, what I present here are crude IFR-Ss, and not even age-adjusted ones. I did not have any data to assess the distribution of IFR-S across age and comorbidity profiles of patients. One would need, ideally, individual-level data and, at the least, group-specific data to estimate such dispersion; these data are not publicly available.14,15 The Centers for Disease Control and Prevention reports significant variation in fatality rates by age groups.16 Further work is required on this front.

Study Results

Of 3,020 US counties, 1,364 counties reported any confirmed COVID-19 case by April 20, 2020. Of these counties with confirmed cases, 134 reported no COVID-19 deaths until that time; 1,034 counties had any reported COVID-19 deaths by April 20; and 397 counties reported exactly one COVID-19 death by this date. By April 20 there were 753,113 confirmed COVID-19 cases and 41,287 reported COVID-19 deaths. My analysis included 116 counties. Interestingly, I did not include New York County, New York (Federal Information Processing System code 36061, which does not represent all of New York City) in my analysis, despite its having the highest number of cases and deaths in the country. The number of deaths in this county was rising at a faster rate than the number of detected cases until April 20, 2020; hence, the case fatality rate had not reached a zenith. Overall, a total of 40,835 confirmed cases and 1,620 confirmed deaths until April 20 were used for my analysis (see the appendix).6

The 116 counties selected spanned 33 states, with Georgia contributing the maximum with 13 counties, followed by Louisiana with 9 and then South Carolina with 8. After reaching their initial zenith, CFRs were found to be declining within each of these retained counties, supporting my assumptions about the differential declining error rate between the numerator and denominator of the CFRs for these counties. The appendix contains a description of growth in COVID-19 reported cases and deaths and the decline in the CFRs.6 At the zenith of the computed rate in each county, the rate variable varied from 1.7 percent to 33.3 percent. By the end of follow-up, the rate varied from 0.9 percent to 19.3 percent across counties. The number of follow-up days ranged from seven to thirty-one (see the appendix).6

The Bayesian model showed good convergence and mixing properties between the model and the observations. Gelman-Rubin statistics were below 1 for each of the parameters of the model, indicating that the three independent Monte Carlo chains overlapped and converged to similar posterior distributions for the parameters. The appendix presents these results,6 including residual analysis based on fitted posterior means (means predicted by the model for the period before the validation phase) from my prediction model, which appears to fit the county-level data well over time. The posterior mean of the US-specific IFR-S was estimated to be 1.3 percent (median, 1.3 percent; standard deviation: 0.4), with a 95% central credible interval of 0.6–2.1 (exhibit 1).

Exhibit 1 Estimated COVID-19 infection fatality rates among symptomatic patients (IFR-S) for the twenty counties examined with the lowest rates plus the US overall

Exhibit 1
SOURCE Authors’ analysis of publicly available data on COVID-19 counts of cases and deaths. NOTE Point estimates are posterior means; bars are 95% central credible intervals.

The posterior means and the 95% central credible intervals of county-specific IFR-Ss for the twenty counties I examined with the lowest rates (0.5–1.4 percent) plus the overall values for the US and the twenty-one counties I examined with the highest rates (2.3–3.6 percent) are shown in exhibits 1 and 2, respectively. The IFR-S for other counties in the middle that are not shown ranged from 1.5 percent to 2.2 percent and are described in the appendix.6 The lowest rate was estimated to be in Putnam County, New York (0.5 percent; 95% central credible interval, 0.1–1.0), whereas the highest was estimated to be in King County, Washington (3.6 percent; 95% central credible interval, 0.5–6.1). Data at the county level are still evolving, and hence considerable uncertainty exists for some counties, especially toward the higher range of IFR-S estimates. Because these estimates represent the crude IFR-S, many factors contribute to their variation across counties, including demographics (especially age distribution), levels of population health, and supply of health care services. In that sense, the IFR-S is a dynamic quantity even within a county, depending on how the case-mix of the infected population shifts over time.

Exhibit 2 Estimated COVID-19 infection fatality rates among symptomatic patients (IFR-S) for the twenty-one counties examined with the highest rates

Exhibit 2
SOURCE Authors’ analysis of publicly available data on COVID-19 counts of cases and deaths. NOTE Point estimates are posterior means; bars are 95% central credible intervals.

To assess the validity of my prediction model, I forecasted county-specific reported case fatality rates based on the posterior predictive mean over the course of four days after the estimation time window for each county and compared those rates with observed rates during these days.17Exhibit 3 presents the comparison between predicted and observed case fatality rates, with each dot representing a rate for a county on a given day. The exhibit shows that the 95% central credible intervals from the posterior predictive distribution from the model were able to capture the true CFRs (represented by the 45-degree diagonal line) for all counties over these four days. The Bayesian posterior predictive two-sided p values18,19 were less than 0.05 for none of the 116 counties for any of the four days.

Exhibit 3 Predicted COVID-19 case fatality rates by county versus observed rates for the first four consecutive dates that were not used for estimation

Exhibit 3
SOURCE Authors’ analysis of publicly available data on COVID-19 counts of cases and deaths. NOTES Results from Bayesian mixed-effects nonlinear model. Each symbol represents the point estimate of a case fatality rate (CFR) for a county on a given day; bars are 95% central credible intervals.

Discussion

After I modeled the available national data on cumulative deaths and detected COVID-19 cases in the United States, the symptomatic infection fatality rate from COVID-19 was estimated to be 1.3 percent. This estimated rate is substantially higher than the approximate IFR-S of seasonal influenza, which is about 0.1 percent20 (34,200 deaths among 35.5 million patients who got sick with influenza). Influenza is also believed to be completely asymptomatic in 16 percent of the infected population,21 and this fraction is not included in the calculation of its IFR-S.22 My COVID-19 IFR-S estimate is not outside the ballpark of estimates becoming available from other countries, but it is certainly lower, as is expected from addressing the upward bias in those estimates. For example, the COVID-19 fatality rate for China (without correction for the upward bias inherent in looking at observed rates) was initially reported to be 5.6 percent (95% CI, 5.4–5.8).23 By February 20, 2020, however. the crude fatality rate for China was estimated to be 3.8 percent.24 The fatality rate outside China was estimated to be 15.2 percent (95% CI, 12.5–17.9),23 which may be due to the more considerable upward bias during the beginning part of the pandemic within a country. The same patterns occur in the United States, with observed rates being much higher during the initial part of the pandemic. A recent estimate of the CFR using individual-level data from Wuhan residents and from international Wuhan residents who repatriated on six flights found it to range from 0.66 percent to 1.4 percent.25

In a thought experiment in which 35.5 million people contract COVID-19 this year in the US (that is, the same number as were infected with influenza last year),20 then in the absence of any mitigation strategies or distancing behaviors and with the supply of health care services under typical conditions, my IFR-S estimate predicts that there would be nearly 500,000 COVID-19 deaths in the US in 2020. To the extent that COVID-19 is more infectious than influenza and that we have no protection in the form of a vaccine or treatment, the number of infections—and hence the number of deaths—would be higher compared to influenza. Certainly, with the implementation of mitigation strategies, the death toll will be lower. For example, the March 31 White House Coronavirus Task Force projections of 100,000–200,000 deaths from COVID-19 in 2020 were made using assumptions about the effectiveness of distancing directives and measures currently in place.26

Constraints in the supply of health care services could surely increase the symptomatic infection fatality rate and the overall fatality rate.

My estimated IFR-S applies under the assumption that the current supply (up until April 20) of health care services, including hospital beds, ventilators, and access to providers, would continue in the future. Constraints in the supply of health care services could surely increase the symptomatic infection fatality rate and the overall fatality rate. I hope that simulations to understand and forecast the effect of such shortages can be improved, using my estimates of IFR-S as the baseline.27

Similarly, my estimates of the COVID-19 IFR-S in the US can help disease and policy modelers obtain more accurate predictions for the epidemiology of the disease and the impact of alternative policy levers to contain this pandemic.

ACKNOWLEDGMENTS

Anirban Basu received compensation from Salutis Consulting LLC. No funding was received for this analysis. The author thanks Varun Gandhay for excellent research assistance for this work. He also thanks six anonymous reviewers and Donald Metz of Health Affairs for their excellent comments. The views expressed do not represent those of the University of Washington or the National Bureau of Economic Research. An unedited version of this article was published online May 7, 2020, as a Fast Track Ahead Of Print article. That version is available in the online appendix.

NOTES

   
Loading Comments...