Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.989
Model:                            OLS   Adj. R-squared:                  0.989
Method:                 Least Squares   F-statistic:                     1406.
Date:                Fri, 20 Sep 2019   Prob (F-statistic):           3.13e-45
Time:                        18:23:49   Log-Likelihood:                 12.201
No. Observations:                  50   AIC:                            -16.40
Df Residuals:                      46   BIC:                            -8.754
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9652      0.067     73.703      0.000       4.830       5.101
x1             0.5070      0.010     48.794      0.000       0.486       0.528
x2             0.4904      0.041     12.008      0.000       0.408       0.573
x3            -0.0215      0.001    -23.557      0.000      -0.023      -0.020
==============================================================================
Omnibus:                        3.943   Durbin-Watson:                   2.141
Prob(Omnibus):                  0.139   Jarque-Bera (JB):                3.003
Skew:                          -0.578   Prob(JB):                        0.223
Kurtosis:                       3.325   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.42792065  4.91364528  5.36022582  5.74093321  6.03868475  6.24885078
  6.38001529  6.4525656   6.4953428   6.5409032   6.62016949  6.75735014
  6.9659621   7.24660994  7.58688656  7.96341163  8.3456729   8.7010393
  9.00012212  9.22160348  9.35573881  9.40595787  9.38830121  9.3287848
  9.25912565  9.21153079  9.21340577  9.282852    9.42569272  9.63451627
  9.88989382 10.16357145 10.4231122  10.63722537 10.78090748 10.83955029
 10.81134089 10.70755955 10.55072639 10.37090142 10.20074574 10.07015542
 10.00134967 10.00522027 10.0795421  10.20933869 10.36934285 10.52814799
 10.6533681  10.71695871]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.6838658  10.5203297  10.245584    9.90345959  9.55165335  9.24760193
  9.03441926  8.93034075  8.92425864  8.97844156]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.965155
x1                  0.506958
np.sin(x1)          0.490450
I((x1 - 5) ** 2)   -0.021489
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.683866
1    10.520330
2    10.245584
3     9.903460
4     9.551653
5     9.247602
6     9.034419
7     8.930341
8     8.924259
9     8.978442
dtype: float64