Tuesday, April 21, 2020

Covid-19 NYC Stats - Not What They Seem

I always say that before developing a model, it's important to look at and understand the data.  It's all too easy to jump into modeling and analyses before doing so, and those analyses are likely to be flawed if one ignores the all important step of studying and understanding the data itself.

I'm not immune to this tendency - I fell into that trap in my 4/7 analysis of the NYC Covid-19 data!  According to the data at that time (collected from the NYC COVID-19 data website, I suggested that the growth rate in the number of cases had already dropped to zero around 3/30.  That wasn't the case.

It's true that NYC posts a full, updated history each day, and they explicitly state that due to "delays in reporting, the most recent data may be incomplete".  But the latest data report doesn't allow one to account for the delays in reporting. I had thought the drop-off at the end of the data was due to reporting delays, but it turns out the delays are far larger than that.

How can we tell, if the history isn't posted? Well, it would be nice if the history was easily accessible, but since the data is updated in a git repository, the previous reports are recorded and available.  It just takes some work to extract them.  So, I forked the above repository and did this work.  Here's one result:


Now you can see that some reports are taking quite long to be incorporated.  For example, the 4/15 report updated the number of cases as far back as 3/22!  And, while the 4/7 through 4/11 reports flatten out with a growth rate close to zero from 4/1 to 4/6, the 4/12 report changes that, showing the number of cases per day continuing to grow through 4/7!

So how do we determine if we're past the peak?  If we had the individual reports from each reporter, we could simply look at how the number of cases changes at a given location.  That would be ideal.  Lacking that, we have to try work around the fact that missing data makes it erroneously appear that the number of cases is dropping.

One possibility is to look at the mortality data instead.  The peak in the deaths per day should be about a week or two later than the peak in the number of people infected per day, so the signal will be late, but one would expect this data to be better reported.  Here it is (with 7 day averaging):

Even in the mortality data, some reports restate as much as two weeks of the past. So it's also unclear how long we have to wait before we can rely on the numbers.

Another possibility is to analyze the change in the peaks from report to report.  The idea is that if we've reached the peak, then all the future reports should show that same peak.  So we look at the date of the peak for each report. The longer that date remains the same, the more likely it is that that date was the peak.   So let's look at that for the raw data and the 7 day averages for both the infections as well as for the mortalities:
We can see here that the raw cases per day peak has been on 4/5 for about a week, and the mortality rate peak has held fast at 4/6.  But between the cyclicity of the data and the missing reports, we might not trust that so much.  Looking at the rolling 7 day averages, we get 4/7 and 4/8, but the mortality peak hasn't been maintained for very long.

So, maybe we've already reached the peak, at least in reported cases per day.  At least, I hope we did.  But without better data, it's hard to say.  It's unfortunate that the full data set isn't readily available, so we're left guessing.  And, as is usually the case, it's the data...


Tuesday, April 7, 2020

Covid-19 NYC Stats - A Ray of Hope

Here are the reported Covid-19 cases/Day in NYC that can be seen on the NYC COVID-19 data website, updated on 4/7:


To get a better idea of the trend, given that the data appears to have seasonality on a 7 day period, we average the last 7 days & get:


Now the trends are pretty clear. Up to around the 16th, the data is exhibiting exponential growth. Then the growth rate starts dropping, as people started working from home and practicing social distancing. At that point, the growth looks like it became linear.

Once official social distancing efforts started, the slope dropped. From around 3/30 to 4/4, starting 7 days after NY went on PAUSE, we see the cases per day leveled off. For the last few days the trailing average number of cases per day has been dropping.

So, according to this data, it looks like social distancing broke the exponential growth, making the growth linear, and PAUSE has flattened the growth.

While the last few days show a welcome drop in the daily number of reported cases, this cannot be trusted because there's a substantial time lag for the data to be reported.  For example, the latest update revised many of the previously reported numbers.  None the less, the trends calculated with the previously reported data look pretty similar to those calculated today.

Of course, it's also possible that the number of reported cases per day is dropping because people know the hospitals are overloaded. If they're not too sick, they might be avoiding the hospitals, and potentially not be counted. Hopefully, this is not the case, and the new cases/day stays low enough for long enough that the hospitals (and everyone else) can recover.

So, there's a ray of hope (at least in NYC) that social distancing is working, but we'll have to keep it up and see how things proceed.