Monday, December 21, 2020

COVID-19 in NYC - Analysis of Latest Data

From my latest NYC analysis, (cell 12, in https://github.com/hjstein/coronavirus-data/blob/master/Notebooks/Current.ipynb, updated this morning), it looks like the NYC growth rate slowed substantially on December 5th, about 1.5 weeks after Thanksgiving:


The growth rate is close to zero, but is still positive.  And we now have almost 225 hospitalizations/day and close to 30 deaths/day.

As of this morning, 12/21, the reports only go through 12/17, so the numbers only look accurate up to around 12/11, at which point we were up to ~2,600 infections/day (on a 7 day rolling average basis).  I expect the more recent numbers could get to over 2,800:


Cell 14 shows that we're experiencing over 220 hospitalizations/day now, and will probably breach 225.


Extrapolating based on cell 16, it looks like we might have surpassed 30 deaths/day:


(For an explanation of the above graphs, see previous blog posts.)

It makes sense that after the big events of the fall (Halloween, Thanksgiving, etc), that we'd see a large increase in infections.  After these events, I had expected infection rates to come down about a 1-2 weeks later.  What we've seen is that after each of these events, the levels have stabilized but not dropped.

In retrospect, this make sense.  Consider R, which is the number of people that an infected person infects.  When R>1, the number of cases will rise over time, and when R<1, the number of cases will drop.  Through July & August, the cases per day has been flat, indicating an R of about 1.  Each of these events leaves a larger percentage of the population infectious, so if the mitigation actions (lock downs, distancing, masks, ...) aren't changed, then you have a larger infectious population with the same R=1 value, so the numbers can remain high and won't drop until additional actions are taken, like identifying and isolating cases, or stopping the gatherings that lead to the most infections.

Monday, November 2, 2020

COVID-19 SECOND WAVE IN NYC? YES AND NO

Since September, NYC COVID-19 infection rates have been mostly increasing.  This has lead to a spike in news articles, such as New York City sees 'very worrisome' spike in coronavirus infection rate.  How worrisome is it, really?  Let's look at the data.

Here's the 7-day rolling average of the number of cases per day since the drop off of the first wave in June:


In the beginning of June, we see the tail end of the drop off in the number of cases per day of the first wave.  This in the impact of the mitigating measures taken at that time, such as 
the lock-down, social distancing and the use of masks.  This left us with about 325 cases per day.  There was a small increase in the number of cases per day in mid-July, peaking around 375, after which the number of cases per day dropped down to under 250.

Starting in September, we see a rise in the number of cases per day.  It has steadily increased, peaking at 550 cases per day on October 5th.  Subsequently, it dropped down to 450 cases per day and started rising again, hitting 550 cases per day a second time on October 29th.  Given that this last peak was less than a week ago, we can expect it to rise higher as additional reports for that date arrive (for issues with delayed data, see Covid-19 NYC Stats - Not What They Seem).

People are calling this rise since September a "worrisome second wave", as exemplified in the article cited above.  But, to put this rise in perspective, we need to compare the data leading up to the latest peak to the data leading up to the first peak.

Here's the 7-day rolling average number of new cases per day leading up to the new peak:


And here's the same for the data leading up to the first peak:

As you can see, for the recent peak, it's taken about 4 weeks to double.  We were at 250 cases per day at the beginning of September and didn't hit 550 cases per day until the end of the first week of October.

Compare this to the beginning of the pandemic.  On March 10th, there were almost no cases per day.  Over the course of one week, the number of cases per day jumped up to 1,000.  Two days later we were experiencing 2,000 cases per day.  Three days later we hit 3,000 cases per day, and 5 days later were seeing 4,000 cases/day.  The peak was over 5,000 cases per day.

So, compared to the first wave, the recent peak is more of a ripple than a wave. The rise through September is probably due to the school openings and easing of general restrictions.  Similarly, it would not be surprising if the rise at the end of October is from people preparing for and celebrating Halloween.  If this is the case, we should see the counts subsequently dropping in November and rising again around Thanksgiving.

While we still have to be careful, it should not be unexpected that between school openings, restaurant openings and holidays, we'd see some additional growth.  But it looks like the social distancing measures are continuing to do their work in keeping the number of cases per day reasonably low.  In general, we should expect the number of cases per day to rise somewhat and then stabilize at new levels as social distancing measures are eased.  And we should expect that rates will rise as we approach commonly celebrated holidays and subsequently fall off.

Should we call this rise a second wave that needs additional mitigating steps?  It looks to me like the recent rise of infections per day in NYC is to be expected, and is a ripple, not a wave.  The only big second wave we're seeing in NYC is in the media's portrayal of the ripple.


Thursday, May 14, 2020

COVID-19 data collection: Garbage In, Garbage Out

Introduction

If you recall my 4/21 analysis of the NYC COVID-19 data (COVID-19 NYC Stats – Not What They Seem), you’d remember a graph showing the extent to which missing data has a major impact on the reported incident counts. Here's an updated version:


As you can see, the peak hasn’t moved from April 6th, but we’re still getting data for dates as far back as March! Here's the updated 7 day average report, which shows more clearly that a few days ago additional incidents were reported for every date going back as far as March 25th:


After seeing this analysis, Jon Asmundsson, editor of Bloomberg Markets asked me if this holds for other regions.

This led me to take a look.

Garbage in

My first stop was the Our World in Data github repository. I forked the repository, imported my analysis code, extracted the historical reports and graphed the histories of the USA cases per day:


It’s nice to see that the 7 day average is dropping, but where are the new reports? There are no updates – no missing data!   How can it be that the total cases for a given day are completely known the following day and never need to be updated? Given what we know about the NYC data, and given that the NYC data is large part of the USA data, it can’t possibly be the case that on a given day they know exactly the number of cases the day before. Something odd must be going on.

So I entered an issue on the OWID COVID-19 github repository. I asked for clarification of how they generate the data. Edouard Mathieu, Data Manager at OWID, responded:
For confirmed cases and deaths, our data comes from the European Centre for Disease Prevention and Control (ECDC). We discuss how and when the ECDC collects and publishes this data here
Importantly, the ECDC follows a general rule of not changing past values in its data. If cases/deaths are reported with a lag—a general lag, as you described, or occasional ’blocks’ of new data–these new cases/deaths will be added on the date that the country reported them to the ECDC.
So, OWID gets their data from the ECDC – The European Center for Disease Prevention and Control, and the ECDC doesn’t collect data by incident date, it collects the data by the date on which it receives the reports.

Further research showed that it’s not just the ECDC that collects COVID-19 data in this way. The Johns Hopkins University COVID-19 repository, and the New York Times COVID-19 repository also record instances by report receipt date instead of by incident date. So these databases, the major sources of data that people use for modeling, for planning disease responses and for reporting, are collecting the data by report date instead of by incident date.

I followed up with Edouard Mathieu. I asked him if he knew of any rationale for why the data was being collected this way. His impression was that governments try to give the most accurate view and record data based on incident date, updating history as needed. On the other hand, aggregators like WHO, ECDC and JHU are more concerned with ease of aggregation and stability of reported numbers, so they instead record data based on the reporting date.

I also contacted Lauren Gardner, Associate Professor, Department of Civil and Systems Engineering, Co-Director of Center for Systems Science and Engineering (CSSE), Johns Hopkins University. Professor Gardner and her team are responsible for the Johns Hopkins COVID-19 Dashboard. She agreed that there are issues with using reporting dates rather than incident dates, but unfortunately, that’s often all that’s available.

Garbage out?

What’s the big deal? Counting by report date instead of incident essentially takes some percentage of the actual data and moves it later in time. One would expect this to flatten the curve. As a result, it should makes the infection rate appear lower before the peak, make the peak appear later, and make the infection rate appear to drop off more slowly after the peak. Moreover, since sites will report a number of days together, it also makes the data jumpier and thus harder to analyze.

The problem is that scientists are using these numbers to model the disease, the government is using these numbers to plan how to address the risks, and the media is reporting about the numbers. So it potentially reduces the accuracy of the models, interferes with planning and leads to hysterical media reports about irrelevant rising and falling of death counts.

How big is the effect, really? We can determine this for data sets where we have both the report dates as well as the incident dates. So I did this for the NYC data, backing out what it would look like if it was recorded by report date instead of by incident date. The raw daily cases are:


And the rolling 7 day averages are:


As you can see, the reporting date data is far noisier; so much so that the 7 day cycle I documented in Covid-19 NYC Stats - A Ray of Hope is obscured and the 7 day rolling window still shows substantial noise. For example, the report date based data exhibits a spike on May 11th of about 5,000 cases, far higher than the incident based data shows. The incident based data shows a slight restatement of the data going back to the third week of March.

Presumably NYC received a report a few days ago from a particular site, and that report relayed daily infections back through March that hadn’t yet been recorded. Report date based data then records this as a huge spike which never actually occurred. Because of these underlying data collection mechanisms, the report date based data collection often give a substantially distorted view from day to day.

Another case in point is that the noise makes the peak appear to have occurred much earlier than it actually does.

Even after smoothing with the 7 day rolling window, report date data is overstating the post-peak number of cases by a substantial amount, sometimes by over 50%.

Surprisingly, the growth rate to the peak is higher rather than lower. This is presumably because reporting delays were greater when the data started being collected, leading to batches of reports coming in together at a later date.

Conclusions

During this pandemic, it’s great to see that organizations like OWID, the WHO and the ECDC, major news outlets, like The New York Times, and major universities, like Johns Hopkins University, are all collecting and aggregating data on COVID-19 cases and deaths and making this data publicly available. Without such aggregation, it would be very difficult to globally understand, analyze, and respond appropriately to the pandemic.

On the other hand, it’s unfortunate that they collect the data in a way that obscures the current state of the disease and makes analysis more difficult than it need be. It’s also disturbing that news agencies are reporting on these numbers as if they actually occurred on the reporting date, and governments may be taking action based on the same misconception.

I find it surprising that epidemiologists who make a career out of analyzing epidemics and pandemics would record the data in such a fashion. But, on the other hand, I suppose such work tends to be on a longer time scale and it’s only in the current pandemic that we needed accurate, up to date infection and death counts.

I’d hope that someone would take it upon themselves to collect and aggregate the data on an incident date basis. This would be a huge undertaking, but the longer this pandemic persists, the more important this becomes.

Tuesday, April 21, 2020

Covid-19 NYC Stats - Not What They Seem

I always say that before developing a model, it's important to look at and understand the data.  It's all too easy to jump into modeling and analyses before doing so, and those analyses are likely to be flawed if one ignores the all important step of studying and understanding the data itself.

I'm not immune to this tendency - I fell into that trap in my 4/7 analysis of the NYC Covid-19 data!  According to the data at that time (collected from the NYC COVID-19 data website, I suggested that the growth rate in the number of cases had already dropped to zero around 3/30.  That wasn't the case.

It's true that NYC posts a full, updated history each day, and they explicitly state that due to "delays in reporting, the most recent data may be incomplete".  But the latest data report doesn't allow one to account for the delays in reporting. I had thought the drop-off at the end of the data was due to reporting delays, but it turns out the delays are far larger than that.

How can we tell, if the history isn't posted? Well, it would be nice if the history was easily accessible, but since the data is updated in a git repository, the previous reports are recorded and available.  It just takes some work to extract them.  So, I forked the above repository and did this work.  Here's one result:


Now you can see that some reports are taking quite long to be incorporated.  For example, the 4/15 report updated the number of cases as far back as 3/22!  And, while the 4/7 through 4/11 reports flatten out with a growth rate close to zero from 4/1 to 4/6, the 4/12 report changes that, showing the number of cases per day continuing to grow through 4/7!

So how do we determine if we're past the peak?  If we had the individual reports from each reporter, we could simply look at how the number of cases changes at a given location.  That would be ideal.  Lacking that, we have to try work around the fact that missing data makes it erroneously appear that the number of cases is dropping.

One possibility is to look at the mortality data instead.  The peak in the deaths per day should be about a week or two later than the peak in the number of people infected per day, so the signal will be late, but one would expect this data to be better reported.  Here it is (with 7 day averaging):

Even in the mortality data, some reports restate as much as two weeks of the past. So it's also unclear how long we have to wait before we can rely on the numbers.

Another possibility is to analyze the change in the peaks from report to report.  The idea is that if we've reached the peak, then all the future reports should show that same peak.  So we look at the date of the peak for each report. The longer that date remains the same, the more likely it is that that date was the peak.   So let's look at that for the raw data and the 7 day averages for both the infections as well as for the mortalities:
We can see here that the raw cases per day peak has been on 4/5 for about a week, and the mortality rate peak has held fast at 4/6.  But between the cyclicity of the data and the missing reports, we might not trust that so much.  Looking at the rolling 7 day averages, we get 4/7 and 4/8, but the mortality peak hasn't been maintained for very long.

So, maybe we've already reached the peak, at least in reported cases per day.  At least, I hope we did.  But without better data, it's hard to say.  It's unfortunate that the full data set isn't readily available, so we're left guessing.  And, as is usually the case, it's the data...


Tuesday, April 7, 2020

Covid-19 NYC Stats - A Ray of Hope

Here are the reported Covid-19 cases/Day in NYC that can be seen on the NYC COVID-19 data website, updated on 4/7:


To get a better idea of the trend, given that the data appears to have seasonality on a 7 day period, we average the last 7 days & get:


Now the trends are pretty clear. Up to around the 16th, the data is exhibiting exponential growth. Then the growth rate starts dropping, as people started working from home and practicing social distancing. At that point, the growth looks like it became linear.

Once official social distancing efforts started, the slope dropped. From around 3/30 to 4/4, starting 7 days after NY went on PAUSE, we see the cases per day leveled off. For the last few days the trailing average number of cases per day has been dropping.

So, according to this data, it looks like social distancing broke the exponential growth, making the growth linear, and PAUSE has flattened the growth.

While the last few days show a welcome drop in the daily number of reported cases, this cannot be trusted because there's a substantial time lag for the data to be reported.  For example, the latest update revised many of the previously reported numbers.  None the less, the trends calculated with the previously reported data look pretty similar to those calculated today.

Of course, it's also possible that the number of reported cases per day is dropping because people know the hospitals are overloaded. If they're not too sick, they might be avoiding the hospitals, and potentially not be counted. Hopefully, this is not the case, and the new cases/day stays low enough for long enough that the hospitals (and everyone else) can recover.

So, there's a ray of hope (at least in NYC) that social distancing is working, but we'll have to keep it up and see how things proceed.

Monday, August 10, 2015

College Costs

I just saw the article by Laura Meckler and Josh Mitchell about Clinton's approach to fixing higher education:

http://www.wsj.com/articles/hillary-clinton-proposes-debt-free-tuition-at-public-colleges-1439179200

More details are available at:

http://www.bloomberg.com/politics/articles/2015-08-10/hillary-clinton-to-outline-350-billion-college-affordability-pitch

Clinton follows the usual misguided governmental problem solving game plan, namely throwing money at the problem. In her case, it's throwing $350 billion of federal funding over 10 years at public institutions, with the requirement that universities control spending. I can't find any details on Clinton's website (https://www.hillaryclinton.com), which is much bigger on collecting emails & donations than describing her platform.  Maybe it's on her home server. However, Inside Higher Ed has an article with a link to some documents describing the plan in more detail:

https://www.insidehighered.com/news/2015/08/10/clinton-proposes-350-billion-plan-make-college-affordable

In any case, Clinton's not the only candidate taking a crack at education.  Sanders wants to spend on it, Rubio wants to allow federal funds to be spent outside of traditional colleges, and Paul wants to offer tax breaks.

But no one is seriously addressing bloated college costs.  And yes, they're bloated.  Take a look at the numbers:

http://trends.collegeboard.org/college-pricing/figures-tables/tuition-fees-room-board-time-1974-75-2014-15-selected-years

Why should tuition and fees have grown from $17,000 in 1990 to $22,000 in 2000, to $31,000 in 2015?  And that's in inflation adjusted terms!  Have energy costs sky-rocketed?  Are professor's salaries through the roof?  No, but administrative costs are:

http://www.washingtonmonthly.com/magazine/septemberoctober_2011/features/administrators_ate_my_tuition031641.php?page=all

Here are some specific cases, such as UC management (http://ucbfa.org/2013/01/uc-management-bloat-updated/):


 North Dakota university system bloat (http://watchdog.org/221936/north-dakota-university-2/):


Clinton talks about only giving these funds to colleges that reduce costs.  Take a look of the details and see if you think the proposed federal regulations will effectively reduce costs.  And this still is only for the most part for public institutions.

These institutions of higher learning are non-profit organizations, but they are not run for any sort of public good.  Efficiency and cost control is ignored.  Their operations are opaque and lack oversight.  I challenge you to grab a university financial statement and figure out the exact percentage of the operating budget that actually goes to paying instructors.  I dug for hours and I couldn't find enough information to distinguish between costs and income from profit centers like food services and university run hospitals, and direct educational costs such as instructor salaries.

But what I could find, after much digging, were academic vs non-academic head counts.  In the top 10 schools, the percentage of non-academic positions ranges from 63% to 91%.  In other words, only 9%-37% of the employees are actually teaching.  If we assume the extremely low percentages are due to employment at teaching hospitals that I haven't been able to break out separately, it still means that at best, only about 1/3 of the employees actually teach, and 2/3 are doing other things - administration, facilities, student services, ...  How can it take 2 support staff for each professor?  Only through obscenely bloated operations.

Fix the bloat and you'll fix educational costs.  A step forward might just be full and effective financial disclosures.

Tuesday, January 17, 2012

SOPA and PIPA

Wikipedia, and a growing list of other websites have gone black to protest two bills that are currently making the rounds in U.S. Congress.  The Stop Online Piracy Act (SOPA) in the House, and the Protect IP Act (PIPA) in the Senate.  In an effort to stop online distribution of copyrighted materials, the bills would require website owners to police user contributed materials and block sites that are infringing copyrights.  So service providers and search engines would have to inspect all traffic searching for copyrighted material so they can block it [1, 2].

If Congress can back SOPA and PIPA, what's next?  Will phone companies have to listen in on all phone calls ensure that people don't commit illegal transactions over their phone lines?  Will UPS and FedEx have to inspect every package to ensure that they don't deliver stolen goods?  Will cities have to make sure that contraband is not transported over their streets?  How can something so absurd in other contexts have a chance of becoming law in the Internet context?

And why does this keep coming up on the Internet and technology? Are people so obtuse that once the discussion becomes about the technology, they can no longer follow it?  We had movie theaters changing the aspect ratio of their films so they wouldn't fit well on TVs.  Then we had Disney & Universal suing Sony in 1976 when they introduced the VCR to the U.S.  It wasn't settled until 1984, when the Supreme Court concluded that people could tape broadcasts [3].  And in the 1980s, we have the music labels getting a kickback for each blank cassette tape sold (the same holds for blank CDs) [4].  Then it was the RIAA suing people left and right, claiming they were sharing music over the internet.  And don't forget copyright periods that keep getting lengthened.  Everything to protect the industry and nothing for the consumer.

This makes the members of Congress look like corporate shills for Hollywood and the recording industry.  How is it that the book industry is able to change and adapt to the internet era (sort of), but the recording industry and Hollywood can't?  Are the recording industry and Hollywood now too big to fail too?  If not, why is Congress trying to enact SOPA & PIPA?  Of course, the recording industry shouldn't be happy with piracy, but why don't they fight their own battles?  I guess when you're friends with Congress, it's more profitable to try to force someone else to fight for you.

References:

1. https://blacklist.eff.org/

2. http://en.wikipedia.org/wiki/Wikipedia:SOPA_initiative/Learn_more

3. http://www.ce.org/Press/CEA_Pubs/941.asp

4. http://www.pbs.org/wgbh/pages/frontline/shows/music/inside/cron.html