# predict()¶

## Description¶

Calculates linear predictions, various residuals, leverage statistics, distance statistics, and others. Included in classes which support this method, it’s also available as separate method using -predict()-.

• ols

• anova

## Parameters¶

### Input¶

predict(mdl_data = {}, estimate = None)

• mdl_data : Class object from ols or anova

• estimate : Desired estimate. Available options are:

• “y” or “xb” : Linear prediction

• “residuals”, “res”, or “r” : Residuals

• “standardized_residuals”, “standardized_r”, or “r_std” : Standardized residuals

• “studentized_residuals”, “student_r”, or “r_stud” : Studentized (jackknifed) residuals

• “leverage”, “lev” : Leverage of the observation (diagonal of the H matrix)

### Returns¶

Numpy array with n x 1 dimensions; where n is the number of observations

## Formulas¶

Note

• Y is the dependent variable array/vector

• X is the independent variable design matrix

• H is the hat matrix

• $$^T$$ indicates a transpose, i.e. $$X^T$$ is the transpose of X

### Linear prediction 1¶

The linear prediction is calculated as:

$Y_e = X @ \text{betas}$

### Residuals 1¶

Residuals are calculated as:

$\text{e} = Y_{obs} - Y_e$

### Standardized residuals 1¶

Standardized residuals are calculated using the following formula:

$t_i = \frac{e_i}{\sqrt{\text{MSE} * (1 - H_{ii})}}$

### Studentized (jackknifed) residuals 1¶

Studentized (jackknifed) residuals are calculated using the following formula:

$\begin{split}t_i = r_i * (\frac{n - k -2}{n - k - 1 - r_i^2})^\frac{1}{2} \\\end{split}$

where:

• n = number of observations

• k = number of predictors

• r = standardized residual

### Leverage 1¶

Calculate the leverage of the observation; the leverage is the diagonal of the H (hat) matrix. The H matrix is calculated using:

$X @ (X^TX)^{-1}@X^T$

## Examples¶

First to load required libraries for this example. Below, an example data set will be loaded in using statsmodels.datasets; the data loaded in is a data set available through Stata called ‘systolic’.

import researchpy as rp
import pandas as pd
# Used to load example data #
import statsmodels.datasets

systolic = statsmodels.datasets.webuse('systolic')


Now let’s get some quick information regarding the data set.

systolic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58 entries, 0 to 57
Data columns (total 3 columns):
#   Column    Non-Null Count  Dtype
---  ------    --------------  -----
0   drug      58 non-null     int16
1   disease   58 non-null     int16
2   systolic  58 non-null     int16


Now to take a look at the descriptive statistics of the univariate data. The output indicates that there are no missing observations and that each variable is stored as an integer.

rp.summarize(systolic["systolic"])

Name N Mean Median Variance SD SE 95% Conf. Interval
0 systolic 58 18.8793 21 163.862 12.8009 1.6808 [15.5135, 22.2451]
rp.crosstab(systolic["disease"], systolic["drug"])

Variable Outcome Count Percent
0 drug 4 16 27.59
1 2 15 25.86
2 1 15 25.86
3 3 12 20.69
4 disease 3 20 34.48
5 2 19 32.76
6 1 19 32.76
Now to conduct the ANOVA; by default Type 3 sum of squares are used. There are a few ways one can conduct an ANOVA using Researchpy, the suggested approach is to assign the ANOVA model to an object that way one can utilize the built-in methods. If one does not want to do that, then running the model with and displaying the results in one-line will work too; the output will be returned as a tuple. The suggested approach will be shown in this example.
m = anova("systolic ~ C(drug) + C(disease) + C(drug):C(disease)", data = systolic, sum_of_squares = 3)

desc, table = m.results()
print(desc, table, sep = "\n"*2)


Note: Effect size values for factors are partial.

Number of obs = 58.0000
Root MSE = 10.5096
R-squared = 0.4560
Adj R-squared = 0.3259
Source Sum of Squares Degrees of Freedom Mean Squares F value p-value Eta squared Omega squared
Model 4,259.3385 11 387.2126 3.5057 0.0013 0.4560 0.3221
drug 2,997.4719 3.0000 999.1573 9.0460 0.0001 0.3711 0.2939
disease 415.8730 2.0000 207.9365 1.8826 0.1637 0.0757 0.0295
drug:disease 707.2663 6.0000 117.8777 1.0672 0.3958 0.1222 0.0069
Residual 5,080.8167 46 110.4525
Total 9,340.1552 57 163.8624
If it's of interest, one can also access the underlying regression table.
m.predict(estimate="r")

12.667
14.667
6.667
-16.333
-10.333
-7.333
4.750
-2.250
4.750
-7.250
10.600
-23.400
4.600
4.600
3.600
0.000
-5.000
6.000
14.000
-15.000
0.500
-0.500
-2.500
2.500
-15.167
7.833
9.833
13.833
-14.167
-2.167
-15.333
12.667
2.667
6.600
4.600
2.600
-3.400
-10.400
12.500
-7.500
0.500
-5.500
10.400
-4.600
8.400
-15.600
1.400
14.167
-0.833
-0.833
-17.833
3.167
2.167
7.800
-7.200
10.800
-9.200
-2.200

## References¶

1(1,2,3,4,5)

Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. Applied Linear Statistical Models. McGraw-Hill/Irwin, fifth edition, 2005. ISBN 0-07-238688-6.