Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.983
Model: OLS Adj. R-squared: 0.982
Method: Least Squares F-statistic: 870.3
Date: Fri, 04 Aug 2023 Prob (F-statistic): 1.66e-40
Time: 17:42:09 Log-Likelihood: 0.92009
No. Observations: 50 AIC: 6.160
Df Residuals: 46 BIC: 13.81
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0993 0.084 60.406 0.000 4.929 5.269
x1 0.4854 0.013 37.281 0.000 0.459 0.512
x2 0.4528 0.051 8.847 0.000 0.350 0.556
x3 -0.0194 0.001 -16.986 0.000 -0.022 -0.017
==============================================================================
Omnibus: 0.792 Durbin-Watson: 2.088
Prob(Omnibus): 0.673 Jarque-Bera (JB): 0.208
Skew: -0.004 Prob(JB): 0.901
Kurtosis: 3.316 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.61389117 5.0677465 5.48560305 5.842783 6.12351461 6.32352351
6.45073492 6.52397163 6.56986138 6.61846208 6.69832332 6.83179579
7.03135911 7.29757146 7.61897789 7.97399231 8.33444405 8.6702062
8.9541454 9.16657973 9.29851246 9.35311013 9.34518231 9.29874846
9.24309155 9.207947 9.21861721 9.29181443 9.43291503 9.63507614
9.88035993 10.14268074 10.39209087 10.59970093 10.74242627 10.80677971
10.79108746 10.70576412 10.57160158 10.41635298 10.27017296 10.16066329
10.10833805 10.12325315 10.20335424 10.33481418 10.49430483 10.65282978
10.7804884 10.85138841]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.83595979 10.70031051 10.46219813 10.16208988 9.85325486 9.58872194
9.40829639 9.32881422 9.34002008 9.40707819]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f6e5ab92190>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.099294
x1 0.485366
np.sin(x1) 0.452812
I((x1 - 5) ** 2) -0.019416
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.835960
1 10.700311
2 10.462198
3 10.162090
4 9.853255
5 9.588722
6 9.408296
7 9.328814
8 9.340020
9 9.407078
dtype: float64