Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.986
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 1065.
Date: Fri, 24 Apr 2020 Prob (F-statistic): 1.73e-42
Time: 14:17:04 Log-Likelihood: 4.2343
No. Observations: 50 AIC: -0.4686
Df Residuals: 46 BIC: 7.180
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0458 0.079 63.868 0.000 4.887 5.205
x1 0.5012 0.012 41.134 0.000 0.477 0.526
x2 0.5964 0.048 12.452 0.000 0.500 0.693
x3 -0.0200 0.001 -18.693 0.000 -0.022 -0.018
==============================================================================
Omnibus: 1.977 Durbin-Watson: 1.969
Prob(Omnibus): 0.372 Jarque-Bera (JB): 1.174
Skew: -0.332 Prob(JB): 0.556
Kurtosis: 3.351 Cond. No. 221.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.54585161 5.06544943 5.53948885 5.93546501 6.23260388 6.42527541
6.52391845 6.55332576 6.54857056 6.54924415 6.59295115 6.70913117
6.91422195 7.20895873 7.5782534 7.99367346 8.4181134 8.81189104
9.13926716 9.37431739 9.50519151 9.53606038 9.48643071 9.38794015
9.2791593 9.1992547 9.18155384 9.24806959 9.40588352 9.64598241
9.94473913 10.26779449 10.57570235 10.83041052 11.00151263 11.07124381
11.0373994 10.91369716 10.72752352 10.51543403 10.31714739 10.16901979
10.0980717 10.11754847 10.22474415 10.40144609 10.61692719 10.83299434
11.01026337 11.1146298 ]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.11028937 10.95038634 10.6583104 10.28736367 9.90771053 9.58919888
9.38425896 9.31506661 9.36811384 9.49751605]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.045799
x1 0.501192
np.sin(x1) 0.596429
I((x1 - 5) ** 2) -0.019998
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.110289
1 10.950386
2 10.658310
3 10.287364
4 9.907711
5 9.589199
6 9.384259
7 9.315067
8 9.368114
9 9.497516
dtype: float64