Thursday, May 14, 2020

COVID-19 data collection: Garbage In, Garbage Out

Introduction

If you recall my 4/21 analysis of the NYC COVID-19 data (COVID-19 NYC Stats – Not What They Seem), you’d remember a graph showing the extent to which missing data has a major impact on the reported incident counts. Here's an updated version:


As you can see, the peak hasn’t moved from April 6th, but we’re still getting data for dates as far back as March! Here's the updated 7 day average report, which shows more clearly that a few days ago additional incidents were reported for every date going back as far as March 25th:


After seeing this analysis, Jon Asmundsson, editor of Bloomberg Markets asked me if this holds for other regions.

This led me to take a look.

Garbage in

My first stop was the Our World in Data github repository. I forked the repository, imported my analysis code, extracted the historical reports and graphed the histories of the USA cases per day:


It’s nice to see that the 7 day average is dropping, but where are the new reports? There are no updates – no missing data!   How can it be that the total cases for a given day are completely known the following day and never need to be updated? Given what we know about the NYC data, and given that the NYC data is large part of the USA data, it can’t possibly be the case that on a given day they know exactly the number of cases the day before. Something odd must be going on.

So I entered an issue on the OWID COVID-19 github repository. I asked for clarification of how they generate the data. Edouard Mathieu, Data Manager at OWID, responded:
For confirmed cases and deaths, our data comes from the European Centre for Disease Prevention and Control (ECDC). We discuss how and when the ECDC collects and publishes this data here
Importantly, the ECDC follows a general rule of not changing past values in its data. If cases/deaths are reported with a lag—a general lag, as you described, or occasional ’blocks’ of new data–these new cases/deaths will be added on the date that the country reported them to the ECDC.
So, OWID gets their data from the ECDC – The European Center for Disease Prevention and Control, and the ECDC doesn’t collect data by incident date, it collects the data by the date on which it receives the reports.

Further research showed that it’s not just the ECDC that collects COVID-19 data in this way. The Johns Hopkins University COVID-19 repository, and the New York Times COVID-19 repository also record instances by report receipt date instead of by incident date. So these databases, the major sources of data that people use for modeling, for planning disease responses and for reporting, are collecting the data by report date instead of by incident date.

I followed up with Edouard Mathieu. I asked him if he knew of any rationale for why the data was being collected this way. His impression was that governments try to give the most accurate view and record data based on incident date, updating history as needed. On the other hand, aggregators like WHO, ECDC and JHU are more concerned with ease of aggregation and stability of reported numbers, so they instead record data based on the reporting date.

I also contacted Lauren Gardner, Associate Professor, Department of Civil and Systems Engineering, Co-Director of Center for Systems Science and Engineering (CSSE), Johns Hopkins University. Professor Gardner and her team are responsible for the Johns Hopkins COVID-19 Dashboard. She agreed that there are issues with using reporting dates rather than incident dates, but unfortunately, that’s often all that’s available.

Garbage out?

What’s the big deal? Counting by report date instead of incident essentially takes some percentage of the actual data and moves it later in time. One would expect this to flatten the curve. As a result, it should makes the infection rate appear lower before the peak, make the peak appear later, and make the infection rate appear to drop off more slowly after the peak. Moreover, since sites will report a number of days together, it also makes the data jumpier and thus harder to analyze.

The problem is that scientists are using these numbers to model the disease, the government is using these numbers to plan how to address the risks, and the media is reporting about the numbers. So it potentially reduces the accuracy of the models, interferes with planning and leads to hysterical media reports about irrelevant rising and falling of death counts.

How big is the effect, really? We can determine this for data sets where we have both the report dates as well as the incident dates. So I did this for the NYC data, backing out what it would look like if it was recorded by report date instead of by incident date. The raw daily cases are:


And the rolling 7 day averages are:


As you can see, the reporting date data is far noisier; so much so that the 7 day cycle I documented in Covid-19 NYC Stats - A Ray of Hope is obscured and the 7 day rolling window still shows substantial noise. For example, the report date based data exhibits a spike on May 11th of about 5,000 cases, far higher than the incident based data shows. The incident based data shows a slight restatement of the data going back to the third week of March.

Presumably NYC received a report a few days ago from a particular site, and that report relayed daily infections back through March that hadn’t yet been recorded. Report date based data then records this as a huge spike which never actually occurred. Because of these underlying data collection mechanisms, the report date based data collection often give a substantially distorted view from day to day.

Another case in point is that the noise makes the peak appear to have occurred much earlier than it actually does.

Even after smoothing with the 7 day rolling window, report date data is overstating the post-peak number of cases by a substantial amount, sometimes by over 50%.

Surprisingly, the growth rate to the peak is higher rather than lower. This is presumably because reporting delays were greater when the data started being collected, leading to batches of reports coming in together at a later date.

Conclusions

During this pandemic, it’s great to see that organizations like OWID, the WHO and the ECDC, major news outlets, like The New York Times, and major universities, like Johns Hopkins University, are all collecting and aggregating data on COVID-19 cases and deaths and making this data publicly available. Without such aggregation, it would be very difficult to globally understand, analyze, and respond appropriately to the pandemic.

On the other hand, it’s unfortunate that they collect the data in a way that obscures the current state of the disease and makes analysis more difficult than it need be. It’s also disturbing that news agencies are reporting on these numbers as if they actually occurred on the reporting date, and governments may be taking action based on the same misconception.

I find it surprising that epidemiologists who make a career out of analyzing epidemics and pandemics would record the data in such a fashion. But, on the other hand, I suppose such work tends to be on a longer time scale and it’s only in the current pandemic that we needed accurate, up to date infection and death counts.

I’d hope that someone would take it upon themselves to collect and aggregate the data on an incident date basis. This would be a huge undertaking, but the longer this pandemic persists, the more important this becomes.