Forecasting Deaths From COVID-19 in North Carolina up to the End of 2020

Using Polynomial Regression with Scikit-learn in Python

8 min readNov 3, 2020

Context

About 40 days ago, on September 23rd to be exact, I was tasked by my linear algebra professor with constructing a model to predict deaths from COVID-19 in our university’s home state of North Carolina — from that point up to the day of the election, November 3rd. The model was a fairly simple 2nd order polynomial regression calculated using the Scikit-learn library in Python. Using daily data obtained from the NC Department of Health and Human Services COVID-19 dashboard the model was “trained” on data ranging from March 25th, the date of the first recorded COVID death in the state, to September 22nd. Extrapolating this model 41 days into the future from that day, the predicted number of deaths in NC on November 3rd came out to be 4,452.

At the time I didn't think much of it. It was an exercise in using regression and the goal of the assignment was to illustrate the different use cases for either linear regression or higher-order polynomial regression depending on the nature of your data.

Today, I decided to take a look at the model to see how accurate its prediction proved to be. To my surprise, the model accurately forecasted the number of deaths from COVID-19 in NC on November 3rd within 5 deaths (4457 actual deaths vs 4452 predicted). Back then, nearly a month and a half ago, I thought the model’s predictions were a grave overestimate but it in fact turned out to be an underestimate, unfortunately.

Matplotlib scatter plot of the actual death data up to 9/22 and a 41-day forecast leading up to the election today

So, I figured I would share my code and findings for those interested as well as update the model with the most recent death figures and attempt to forecast deaths up to the end of the year.

Updating the model with new data

Here’s the entire Jupyter Notebook containing all code and previous model estimations

After downloading the data from the NC DHHS, I cleaned it to only include days after the first death from COVID-19 was recorded in NC on March 25th this year. I then load it in using the Pandas library.

# Loading in most recent data updated as of today Nov. 3rd

NCcovidNew = pd.read_excel('NCcovidNew.xlsx', sheet_name='Sheet 1')

Next, we assign our X and Y variables, in this case, date and deaths, and then reshape our data as needed.

date = NCcovidNew['Date']
y = NCcovidNew['NC Deaths']
date.shape # 224 days of data# Reshpaing the data
 
x=np.linspace(224,1,224)
x=np.reshape(x,(224,1))
y=np.array(y)
y=np.reshape(y,(224,1))

Refitting the model

Scikit-learn makes linear regression simple to implement with the following:

reg = linear_model.LinearRegression() # Constructing SKlearn model
reg.fit()

Yep, that's it. However, since our model will be a 2nd order polynomial regression taking the form h(x)=a+bx+cx², we need to assign a polynomial feature term that takes degree=2 as one of its arguments. The following code encompasses the entirety of the model.

poly_f=PolynomialFeatures(degree=2,include_bias=False) # Instantiating quadratic termX_poly=poly_f.fit_transform(x) # Applying quadratic transformationreg.fit(X_poly,y) # Fitting transformed data to model# Assigning variables to the model's parameter estimatesa=reg.intercept_[0]
b=reg.coef_[0][0]
c=reg.coef_[0][1]print(“a (Intercept term) =”,a)
print(“b (Coefficient 1) =”,b)
print(“c (Coefficient 2) =”,c)

Here are the model’s parameter estimates:

a (Intercept term) = -64.88491059525245 b (Coefficient 1) = 10.06007469864205 c (Coefficient 2) = 0.04522201123331129

Putting it all together

Using Matplotlib once again, let's plot the data for daily deaths as a scatter plot and then overlay the data with our model’s forecast for the next 58 days leading up to December 31st.

plt.figure(figsize=(14,14))
plt.scatter(x,y)x=np.linspace(282,1,282)
x=np.reshape(x,(282,1))
plt.xlim(0, 290)
plt.ylim(0,8000)plt.plot(x, a + b*x+c*np.power(x,2), "r")plt.title('Predicted vs Actual COVID-19 Deaths in NC')
plt.xlabel('Days since first recorded death')
plt.ylabel('Deaths')plt.axvline(x=224+58,color='b') # Blue vertical line corresponds with Dec. 31st
plt.axhline(y=a + b*(224+58) + c*(224+58)**2,color='b') 
plt.axhline(y=a + b*(224) + c*(224)**2,color='g') # Corresponds with today's death total of 4,457

The model, a simple quadratic equation, fits the data quite well. The vertical and horizontal green lines correspond with today's date, November 3rd, and death count of 4457. The blue lines correspond with December 31st’s date and predicted death count. So what is the predicted death count on 2020’s final day?

Prediction

In order to predict deaths by December 31st, we need to plug in the corresponding x-value into our model which takes the form of h(x)=a+bx+cx². The following code accomplishes this:

deathsByNewYears = a + b*(224+58) + c*(224+58)**2 # Forecasting 58 days into the future from todayprint("Predicted number of deaths on Dec. 31st:",int(deathsByNewYears))

And the output:

Predicted number of deaths on Dec. 31st: 6368

The model estimates 1,911 more deaths from COVID-19 will occur in the 58 days between today and the year’s end, culminating in 6,368 deaths on 2020's final day. Some future improvements that I’d like to make include calculating and displaying shaded confidence intervals in order to provide a true probabilistic forecast rather than an extrapolation of a simple curve. Likewise, I’d also like to create a dashboard for the model so that it automatically updates its parameter estimates and forecasts as new data is uploaded to the NC DHHS and then fed into the model.

Take this all with a grain of salt

To be clear, I’m not an epidemiologist nor an expert of any kind for that matter, but I do have a passion for data and storytelling. As for the model’s prediction, my overly optimistic gut feeling is that the true death total won’t eclipse 6000 by year’s end, but I wouldn’t bet the farm on it.

Globally, the virus is on a renewed tear, prompting many European nations to reinstate ‘wave-breaking’ shutdowns upon breaking previous records for new cases. In the U.S. new data are increasingly pointing towards a bleak winter as in the last two weeks records for the daily number of new cases have been set and shattered on successive days, most recently last Friday, when 98,500 infections were reported. President Trump also recently suggested at a campaign rally that he would fire Dr. Fauci after the election.

The key figure to watch for is hospitalizations since hospitals see seasonal increases in patients admitted in the winter due to the seasonality of the Flu. Hospitals already dealing with seasonal increases in patients may become overwhelmed in the face of widespread revived coronavirus outbreaks. Unfortunately, hospitalizations have been marching higher since the end of September and show little sign of letting up.

Graph made in Google Sheets using data from The COVID Tracking Project

Here in North Carolina, the rate of positive tests have also picked back up in recent weeks

Graph made in Google Sheets using data from the NC DHHS

Another important factor to consider is the number of ‘excess deaths’ observed or in other words the number of deaths outside of the expected norm in a given place and time. Statisticians are pretty smart folks as it turns out and have a very good understanding of mortality in the aggregate and as such are able to better estimate the true impact of pandemics and other types of disasters on society. Here is a quote taken from the CDC’s recent report on excess deaths

As of October 15, 216,025 deaths from coronavirus disease 2019 (COVID-19) have been reported in the United States*; however, this number might underestimate the total impact of the pandemic on mortality. Measures of excess deaths have been used to estimate the impact of public health pandemics or disasters, particularly when there are questions about underascertainment of deaths directly attributable to a given event or cause (1–6).†…
Estimates of excess deaths can provide a comprehensive account of mortality related to the COVID-19 pandemic, including deaths that are directly or indirectly attributable to COVID-19. Estimates of the numbers of deaths directly attributable to COVID-19 might be limited by factors such as the availability and use of diagnostic testing (including postmortem testing) and the accurate and complete reporting of cause of death information on the death certificate. Excess death analyses are not subject to these limitations because they examine historical trends in all-cause mortality to determine the degree to which observed numbers of deaths differ from historical norms.

Thus far the data do indicate an underestimation of COVID-related mortality, especially among people of color. However, recently excess death figures have been trending lower. This could suggest that as we have learned more about the virus, care has improved, and subsequently we have begun to adapt to the virus’ pathology in the aggregate.

Final thoughts

One last point I’d like to mention is that historically, the average number of reported deaths drops around the end of the year. Of course, mortality rates don't magically decrease around Christmas, instead, the holidays and its associated institutional delays cause reporting systems to lag before observing a brief spike in early January when things reopen. Therefore, it would not surprise me if the reported number of deaths from COVID-19 drops precipitously towards the latter half of December while the true number of COVID related deaths do not follow suit and instead march right on pace into the new year, only for reporting systems to catch up come early January. The model shown above could end up overshooting its estimated target for this reason but still be technically correct.

In any case, I hope I’m wrong and the true number of deaths start to plummet soon before things snowball uncontrollably into the winter. With concerted efforts on the part of the Federal and state governments, and with individuals practicing safe social distancing, mask-wearing, and handwashing, we can flatten the curve this time for good in the U.S. Indeed, the more inaccurate this model proves to be via overestimation, the better off we’ll all be.