Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     804.8
Date:                Tue, 03 Dec 2019   Prob (F-statistic):           9.69e-40
Time:                        21:23:04   Log-Likelihood:                -4.0883
No. Observations:                  50   AIC:                             16.18
Df Residuals:                      46   BIC:                             23.82
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8153      0.093     51.605      0.000       4.627       5.003
x1             0.5184      0.014     36.025      0.000       0.489       0.547
x2             0.4540      0.057      8.025      0.000       0.340       0.568
x3            -0.0209      0.001    -16.577      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        1.521   Durbin-Watson:                   1.799
Prob(Omnibus):                  0.467   Jarque-Bera (JB):                1.434
Skew:                           0.392   Prob(JB):                        0.488
Kurtosis:                       2.728   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.29164673  4.76544771  5.20266487  5.57855738  5.87731324  6.09464717
  6.23850462  6.32775643  6.38909833  6.45266494  6.54707874  6.69474749
  6.90818266  7.18794386  7.52254688  7.89035055  8.26311226  8.61062821
  8.90569576  9.12858276  9.27026949  9.33393059  9.33441344  9.29579881
  9.24744443  9.21916162  9.23631738  9.31566662  9.46259944  9.6702555
  9.92065117 10.1876342  10.44118046 10.65232678 10.7979296  10.86446729
 10.8502618  10.76575431 10.6317898  10.47619225 10.32919327 10.21846512
 10.16457423 10.17760219 10.25548939 10.3843734  10.54086653 10.69589835
 10.8194919  10.88568879]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.86356567 10.72017495 10.47331954 10.16356995  9.84433129  9.56876778
  9.37678633  9.28526592  9.28392502  9.33783892]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           4.815290
x1                  0.518437
np.sin(x1)          0.453968
I((x1 - 5) ** 2)   -0.020946
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.863566
1    10.720175
2    10.473320
3    10.163570
4     9.844331
5     9.568768
6     9.376786
7     9.285266
8     9.283925
9     9.337839
dtype: float64