Tuesday, June 2, 2020

Three Ways Not To Analyze COVID-19 Statistics

The COVID-19 pandemic and lockdowns continue to cause unprecedented devastation of everyday life in the United States--approximately 100,000 deaths, tens of millions unemployed, and countless plans, activities, and goals put on an indefinite hold.

In this context, news outlets, politicians, and consumers are closely following the trends in the COVID-19 statistics, trying to answer the most pressing questions. Are things getting better or worse in the US? Have we succeeded in the flattening the curve? Are the reopened states seeing a surge in new cases that many have feared?

These are important questions. Unfortunately, much of the reporting on the COVID-19 data obfuscates the underlying reality. In most cases, the problem is not that the reporting is literally false. But it typically focuses on the wrong metrics and fails to account for the severe limitations in the underlying data. The end result is that readers--and perhaps policymakers--come away with a more optimistic or pessimistic understanding than is actually warranted.

With that in mind, here are three errors to watch out for in discussions on COVID-19 data.

1. Focusing on the number of newly reported positive cases

This problem has become more common, particularly since some states have started to reopen. Here are some examples of recent headlines that commit this error:

Virginia Reports Highest One Day Increase in Coronavirus Cases After Gov. Ralph Northam Criticized For Not Wearing Mask - Newsweek, 5/25/2020

Texas sees highest single-day hike in coronavirus deaths, cases - Texas Statesman, 5/14/2020

Intuitively, it seems like the number and trend of newly confirmed COVID-19 cases must be an important number. But by itself, it doesn't tell us much at all. To properly understand it, we also need to know the number and trend in total COVID-19 tests conducted over the same period.

As an illustrative example, let's consider two random days of test results from Virginia. All results that follow are originally sourced from The Atlantic's COVID Tracking Project:



With these facts alone, it would appear May 25 was a much worse day than April 13 for Virginia when it comes to the coronavirus. Over three times as many people were confirmed as positive. Surely, must mean the virus was spreading wider and was more out of control on May 25--after the reopening--than it was in mid-April during the lockdown, right?

Well, not quite. When we add the context of the number of tests performed and the positivity rate (the rate of positive tests out of total test results reported), a very different picture emerges. See below:



From this, we can see a more compelling explanation for why positive tests on April 13 were so much lower--namely, far fewer tests were conducted.

Based on these figures, there's very good reason to assume the virus situation was actually worse on April 13. The high rate of positives suggests that they were unable to test enough people. So if they had had enough resources to test all suspected individuals, it's likely that the number of positives would have been much higher.

But if you only focus on the positive cases, this reality gets completely turned on its head.

A similar version of this general error can be observed in many reports on record increases in daily cases. Confirmed cases are indeed continuing to rise throughout the US. But the good news is that in most places, the total number of tests is rising at an even faster clip.

2. Focusing on the percentage growth rate (or the doubling rate) of confirmed cases

A related analytical error gets made when media outlets report on the percentage growth rate. Examples of this error in the wild can be routinely found in Bloomberg Radio news updates. Last week, they were reporting around a 1.1% increase in cases, which varied slightly depending on the day.

For a print example, I offer this highly neutral take from Willamette Week in Oregon from May 21, "A Rise in COVID-19 Cases in Deschutes County Tests Whether the State Will Close Bars, Restaurants Again. (So Far? No.)":
The number of COVID-19 cases in Deschutes County has increased over the last seven days. On Wednesday, the county reported nine cases—more cases than it has on any other single day. 
Those increases raise the question of whether the state will order the county to shut down the bars, restaurants and hair salons that reopened just six days ago...
A 5 percent increase in COVID cases is the benchmark the state set for reviewing the status of a county and possibly shuttering it again. [Health Researcher Numi Lee] Griffith pointed to a 27 percent increase in cases in Deschutes County during the week ending May 20. (emphasis added)
This article is interesting for a couple reasons. First, we see that it actually starts out by committing error #1, reporting a record increase of nine cases without providing information about the number of tests.

(Later on, the article even notes that many of the new cases were actually identified proactively through contact-tracing rather than simple symptomatic testing. If anything, that's actually a positive indication about the county's preparedness to mitigate the virus, not a cause for alarm.)

But I digress. The key points in the Willamette Week article are that a) Oregon has actually built this metric into its reopening guidelines and b) Deschutes would have violated it with a 27% increase.

The reason people tend to focus on the growth rate (or in some cases, the days-to-doubling) is because we know that the virus naturally spreads at an exponential rate. One person gives it to three people who each give it to three more people and so on.

In theory, the growth rate is useful because it could offer a window into how quickly the virus is spreading currently, and whether the curve has been sufficiently flattened.

But here's the problem. One of the key features that makes COVID-19 harder to deal with is that many people who contract the virus, experience no symptoms at all. And while this is not entirely proven, it's generally believed that these asymptomatic individuals are still contagious and thus contribute to the exponential spread of the disease.

The challenge is that testing capacity has been so limited that states have not been able to conduct the kind of widespread random testing that would be needed to identify all of the asymptomatic cases. The other way to plausibly identify all or most asymptomatic cases is through a robust contact-tracing system like that of South Korea or Taiwan. But the US's capabilities here are still limited. Instead, COVID-19 testing around the country has been prioritized for people with symptoms and healthcare workers.

The upshot of all this is that the growth rate is not a useful proxy for the thing we're actually trying to measure. What we want to know is the true rate of spread for the virus, in real-time. But due to testing limitations, the growth rate mostly reflects a) the growth rate in testing capacity and b) the growth rate in symptomatic patients.

This error actually cuts in both directions. Early on in the COVID-19 crisis in February--when the CDC was hard at work developing a faulty test and the FDA was simultaneously preventing others from creating a better one--the nation was testing virtually no one. So most metrics looked good.

Then in mid-March as testing capacity finally got built out, the number of positive cases quickly exploded. Positive cases were doubling every two to three days, as this chart shows:



And then, starting in mid-April and continuing to the present, the growth rate and doubling rate slowed back down. Perhaps this can be partly explained by the voluntary precautions and the lockdowns. But clearly, the more important driver is this: While the virus may grow exponentially, US testing capacity does not.

At each point in the process, including today, these metrics have not been meaningful in the US. In March, they offered a belated confirmation that the virus was already spreading widely. And now, they suggest that virus is slowing down, in part because testing capacity can only grow so fast.

3. Citing the case fatality rate as a meaningful statistic

As its name implies, the case fatality rate (CFR) is calculated by taking the total number of deaths attributed to COVID-19 and dividing by the total number of confirmed cases. The calculation is straightforward, and but the result is worse than useless in the case of COVID-19, as we'll see.

The most high profile example of bad reporting on the CFR comes from the World Health Organization, whose director said this on March 3:
Globally, about 3.4% of reported COVID-19 cases have died. By comparison, seasonal flu generally kills far fewer than 1% of those infected.
This shockingly high 3.4% figure was used as one of the reasons to justify widespread lockdowns. And yet, the statement itself offers a clue about the problems with this metric.

In that quote, the WHO is comparing the then-calculated CFR of COVID-19 to the infection fatality rate (IFR) of seasonal influenza. These are not the same metric.

In effect, the CFR is what we can easily observe and calculate. The IFR is what we actually care about, but it's harder to determine. The difference between the two metrics is the denominator. The CFR divides by total confirmed cases, and the IFR divides by total infections.

Since confirmed cases are a subset of total infections, the CFR will always be higher this the IFR. This doesn't mean that COVID-19 is the same as the flu. But it does mean that comparing the CFR of one disease to the IFR of another is unlikely to provide useful information.

To be fair, it's conceivable that the gap between the CFR and IFR will not be significant for some diseases. If there was a well-known disease and widespread testing was available, it's likely that the number of confirmed cases would approximate the total number of infections and thus the CFR would be close to the IFR. However, this is not remotely true for COVID-19 now, and it was even less true at the beginning of March.

For COVID-19, there have been testing shortages all over the world, with a few exceptions. As a practical matter, this meant that tests were generally prioritized for people with severe symptoms and healthcare workers. This prioritization was necessary to try to treat patients more effectively and reduce spread in the hospital environment. But it also compromises the value of a CFR calculated off the resulting data.

The first problem is selection bias. If you're primarily testing patients that already had severe symptoms, then the population of confirmed cases is skewed towards those that are going to have worse health outcomes from the disease. In turn, this will systematically push up the CFR.

A related problem is that limited testing obviously means the number of confirmed cases will be far lower than the total number of infections. By contrast, the COVID-19 death count, while imperfect, should at least be less understated. The reason is that some jurisdictions, like the US, now include "probable" cases of COVID-19 in the death counts, even without a confirmed test. Thus, although limited testing will effectively cap the number of confirmed cases reported, it does not cap the number of deaths reported. This reality will also tend to systematically inflate the CFR.

The final problem with the CFR occurs simply because COVID-19 is a new disease, and there's a significant time lag between when someone contracts the disease and when they might actually pass away as a result. At any given time, some percentage of the total confirmed and active cases, relate to individuals who will eventually die from the disease. This will cause the CFR to be artificially lower than reality (though the effect is diluted as the disease progresses over time).

As we see, the errors in the CFR are considerable. And while they point in different directions, there's every reason to believe that on net, the CFR significantly overestimates the true lethality rate of COVID-19.

The problem is not that the CFR is literally false. The CFR for COVID-19 is being calculated correctly, it's just not a meaningful number.

"Follow the Data"

These days, it seems like we are constantly being told by pundits and politicians that we need to "follow the data" when it comes to COVID-19.

By itself, that's not bad advice. But too often, these people act as though the data provides a script. We just look at the data, put it in our model, and voila! enlightenment rains down upon us.

It would be nice if it worked that way. In reality, "The Data" doesn't tell us anything. People interpreting the data tell us about their conclusions, and they're not always right.

No comments:

Post a Comment