Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     870.3
Date:                Fri, 04 Aug 2023   Prob (F-statistic):           1.66e-40
Time:                        17:42:09   Log-Likelihood:                0.92009
No. Observations:                  50   AIC:                             6.160
Df Residuals:                      46   BIC:                             13.81
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0993      0.084     60.406      0.000       4.929       5.269
x1             0.4854      0.013     37.281      0.000       0.459       0.512
x2             0.4528      0.051      8.847      0.000       0.350       0.556
x3            -0.0194      0.001    -16.986      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.792   Durbin-Watson:                   2.088
Prob(Omnibus):                  0.673   Jarque-Bera (JB):                0.208
Skew:                          -0.004   Prob(JB):                        0.901
Kurtosis:                       3.316   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.61389117  5.0677465   5.48560305  5.842783    6.12351461  6.32352351
  6.45073492  6.52397163  6.56986138  6.61846208  6.69832332  6.83179579
  7.03135911  7.29757146  7.61897789  7.97399231  8.33444405  8.6702062
  8.9541454   9.16657973  9.29851246  9.35311013  9.34518231  9.29874846
  9.24309155  9.207947    9.21861721  9.29181443  9.43291503  9.63507614
  9.88035993 10.14268074 10.39209087 10.59970093 10.74242627 10.80677971
 10.79108746 10.70576412 10.57160158 10.41635298 10.27017296 10.16066329
 10.10833805 10.12325315 10.20335424 10.33481418 10.49430483 10.65282978
 10.7804884  10.85138841]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.83595979 10.70031051 10.46219813 10.16208988  9.85325486  9.58872194
  9.40829639  9.32881422  9.34002008  9.40707819]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f6e5ab92190>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.099294
x1                  0.485366
np.sin(x1)          0.452812
I((x1 - 5) ** 2)   -0.019416
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.835960
1    10.700311
2    10.462198
3    10.162090
4     9.853255
5     9.588722
6     9.408296
7     9.328814
8     9.340020
9     9.407078
dtype: float64