.. _predict: ************************ predict() ************************ Description =========== Calculates linear predictions, various residuals, leverage statistics, distance statistics, and others. Included in classes which support this method, it's also available as separate method using -predict()-. Classes where predict included ------------------------------ * ols * anova Parameters ========== Input ----- **predict(mdl_data = {}, estimate = None)** * **mdl_data** : Class object from ols or anova * **estimate** : Desired estimate. Available options are: * *"y"* or *"xb"* : Linear prediction * *"residuals"*, *"res"*, or *"r"* : Residuals * *"standardized_residuals"*, *"standardized_r"*, or *"rstand"* : Standardized residuals * *"studentized_residuals"*, *"student_r"*, or *"rstud"* : Studentized (jackknifed) residuals * *"leverage"*, *"lev"* : Leverage of the observation (diagonal of the H matrix) Returns ------- Numpy array with n x 1 dimensions; where n is the number of observations Formulas ================== .. note:: * Y is the dependent variable array/vector * X is the independent variable design matrix * H is the hat matrix * :math:`^T` indicates a transpose, i.e. :math:`X^T` is the transpose of X Linear prediction :footcite:p:`kutner2005` ---------------------------------------------- The linear prediction is calculated as: .. math:: Y_e = X @ \text{betas} Residuals :footcite:p:`kutner2005` ---------------------------------------------- Residuals are calculated as: .. math:: \text{e} = Y_{obs} - Y_e Standardized residuals :footcite:p:`kutner2005` -------------------------------------------------- Standardized residuals are calculated using the following formula: .. math:: t_i = \frac{e_i}{\sqrt{\text{MSE} * (1 - H_{ii})}} Studentized (jackknifed) residuals :footcite:p:`kutner2005` ------------------------------------------------------------- Studentized (jackknifed) residuals are calculated using the following formula: .. math:: t_i = r_i * (\frac{n - k -2}{n - k - 1 - r_i^2})^\frac{1}{2} \\ where: * n = number of observations * k = number of predictors * r = standardized residual Leverage :footcite:p:`kutner2005` ------------------------------------------------------------- Calculate the leverage of the observation; the leverage is the diagonal of the H (hat) matrix. The H matrix is calculated using: .. math:: X @ (X^TX)^{-1}@X^T Examples ======== First to load required libraries for this example. Below, an example data set will be loaded in using statsmodels.datasets; the data loaded in is a data set available through Stata called 'systolic'. .. code:: python import researchpy as rp import pandas as pd # Used to load example data # import statsmodels.datasets systolic = statsmodels.datasets.webuse('systolic') Now let's get some quick information regarding the data set. .. code:: python systolic.info() .. parsed-literal:: Int64Index: 58 entries, 0 to 57 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 drug 58 non-null int16 1 disease 58 non-null int16 2 systolic 58 non-null int16 Now to take a look at the descriptive statistics of the univariate data. The output indicates that there are no missing observations and that each variable is stored as an integer. .. code:: python rp.summarize(systolic["systolic"]) .. raw:: html

	Name	N	Mean	Median	Variance	SD	SE	95% Conf. Interval
0	systolic	58	18.8793	21	163.862	12.8009	1.6808	[15.5135, 22.2451]

.. code:: python rp.crosstab(systolic["disease"], systolic["drug"]) .. raw:: html

	Variable	Outcome	Count	Percent
0	drug	4	16	27.59
1		2	15	25.86
2		1	15	25.86
3		3	12	20.69
4	disease	3	20	34.48
5		2	19	32.76
6		1	19	32.76

Now to conduct the ANOVA; by default Type 3 sum of squares are used. There are a few ways one can conduct an ANOVA using Researchpy, the suggested approach is to assign the ANOVA model to an object that way one can utilize the built-in methods. If one does not want to do that, then running the model with and displaying the results in one-line will work too; the output will be returned as a tuple. The suggested approach will be shown in this example. .. code:: python m = anova("systolic ~ C(drug) + C(disease) + C(drug):C(disease)", data = systolic, sum_of_squares = 3) desc, table = m.results() print(desc, table, sep = "\n"*2) .. raw:: html

Note: Effect size values for factors are partial.

Number of obs =	58.0000
Root MSE =	10.5096
R-squared =	0.4560
Adj R-squared =	0.3259

Source	Sum of Squares	Degrees of Freedom	Mean Squares	F value	p-value	Eta squared	Omega squared
Model	4,259.3385	11	387.2126	3.5057	0.0013	0.4560	0.3221

drug	2,997.4719	3.0000	999.1573	9.0460	0.0001	0.3711	0.2939
disease	415.8730	2.0000	207.9365	1.8826	0.1637	0.0757	0.0295
drug:disease	707.2663	6.0000	117.8777	1.0672	0.3958	0.1222	0.0069

Residual	5,080.8167	46	110.4525
Total	9,340.1552	57	163.8624

.. code:: python m.predict(estimate="r")[:10] .. parsed-literal:: array([[ 12.6667], [ 14.6667], [ 6.6667], [-16.3333], [-10.3333], [ -7.3333], [ 4.75 ], [ -2.25 ], [ 4.75 ], [ -7.25 ]]) References =========== .. footbibliography::