Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1065.
Date:                Fri, 24 Apr 2020   Prob (F-statistic):           1.73e-42
Time:                        14:17:04   Log-Likelihood:                 4.2343
No. Observations:                  50   AIC:                           -0.4686
Df Residuals:                      46   BIC:                             7.180
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0458      0.079     63.868      0.000       4.887       5.205
x1             0.5012      0.012     41.134      0.000       0.477       0.526
x2             0.5964      0.048     12.452      0.000       0.500       0.693
x3            -0.0200      0.001    -18.693      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        1.977   Durbin-Watson:                   1.969
Prob(Omnibus):                  0.372   Jarque-Bera (JB):                1.174
Skew:                          -0.332   Prob(JB):                        0.556
Kurtosis:                       3.351   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.54585161  5.06544943  5.53948885  5.93546501  6.23260388  6.42527541
  6.52391845  6.55332576  6.54857056  6.54924415  6.59295115  6.70913117
  6.91422195  7.20895873  7.5782534   7.99367346  8.4181134   8.81189104
  9.13926716  9.37431739  9.50519151  9.53606038  9.48643071  9.38794015
  9.2791593   9.1992547   9.18155384  9.24806959  9.40588352  9.64598241
  9.94473913 10.26779449 10.57570235 10.83041052 11.00151263 11.07124381
 11.0373994  10.91369716 10.72752352 10.51543403 10.31714739 10.16901979
 10.0980717  10.11754847 10.22474415 10.40144609 10.61692719 10.83299434
 11.01026337 11.1146298 ]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.11028937 10.95038634 10.6583104  10.28736367  9.90771053  9.58919888
  9.38425896  9.31506661  9.36811384  9.49751605]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.045799
x1                  0.501192
np.sin(x1)          0.596429
I((x1 - 5) ** 2)   -0.019998
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.110289
1    10.950386
2    10.658310
3    10.287364
4     9.907711
5     9.589199
6     9.384259
7     9.315067
8     9.368114
9     9.497516
dtype: float64