ols()

Description

Conducts linear regression using the ordinary least squares approach.

Parameters

Input

ols(formula_like, data = {})

formula_like : A valid formula which will parse the data into a design matrix.

data : The dataframe which contains the data to be analyzed.

Returns

Returns an object with class “ols”; this object has accessible methods which are described below.

ols methods

results(return_type = “Dataframe”, decimals = 4, pretty_format = True, conf_level = 0.95)

return_type : The type of data structure the results should be returned as. Supported options are ‘Dataframe’ which will return a Pandas DataFrame or ‘Dictionary’ which will return a dictionary.

decimals : The number of decimal places the data should be rounded too.

pretty_format : If pretty formatting should be applied. This adds extra empty spaces in the returned data structure for visualization of the results.

conf_level : The confidence interval desired.

-results- will return 3 objects, (1) is summary information, (2) is model table, and (3) is the regression table.

predict(estimate = None)

estimate : Desired estimate. Available options are:

“y” or “xb” : Linear prediction

“residuals”, “res”, or “r” : Residuals

“standardized_residuals”, “standardized_r”, or “r_std” : Standardized residuals

“studentized_residuals”, “student_r”, or “r_stud” : Studentized (jackknifed) residuals

“leverage”, “lev” : Leverage of the observation (diagonal of the H matrix)

See predict() for formula information.

Effect Size Measures Formulas

By default, this method will return the measures of \(R^2\), \(\text{Adj. }R^2\), \(\eta^2\), \(\epsilon^2\), and \(\omega^2\). Please note that for the factor terms, the reported effect sizes are partial, i.e., \(\eta^2_p\), \(\epsilon^2_p\), and \(\omega^2_p\) respectively. See Olejnik and Aligna (2000) [1], Kelley and Preacher (2012) [2], and/or Grissom and Kim (2012) [3]

Eta-squared (\(\eta^2\)) and \(R^2\)

\[\eta^2 = \frac{\text{SS}_{model}}{\text{SS}_{total}}\]

Adjusted \(R^2\)

\[\text{Adj. }R^2 = 1 - \frac{\text{df}_{total}}{\text{df}_{error}} * \frac{\text{SS}_{error}}{\text{SS}_{total}}\]

Omega-squared (\(\omega^2\))

\[\omega^2 = \frac{\text{SS}_{effect} - (\text{df}_{effect} * \text{MS}_{error})}{\text{SS}_{total} + \text{MS}_{error}}\]

Examples

First to load required libraries for this example. Below, an example data set will be loaded in using statsmodels.datasets; the data loaded in is a data set available through Stata called ‘systolic’.

import researchpy as rp
import pandas as pd
# Used to load example data #
import statsmodels.datasets

systolic = statsmodels.datasets.webuse('systolic')

Now let’s get some quick information regarding the data set.

systolic.info()

<class 'pandas.core.frame.DataFrame'>
 Int64Index: 58 entries, 0 to 57
Data columns (total 3 columns):
#   Column    Non-Null Count  Dtype
---  ------    --------------  -----
0   drug      58 non-null     int16
1   disease   58 non-null     int16
2   systolic  58 non-null     int16

Now to take a look at the descriptive statistics of the univariate data. The output indicates that there are no missing observations and that each variable is stored as an integer.

rp.summarize(systolic["systolic"])

	Name	N	Mean	Median	Variance	SD	SE	95% Conf. Interval
0	systolic	58	18.8793	21	163.862	12.8009	1.6808	[15.5135, 22.2451]

rp.crosstab(systolic["disease"], systolic["drug"])

	Variable	Outcome	Count	Percent
0	drug	4	16	27.59
1		2	15	25.86
2		1	15	25.86
3		3	12	20.69
4	disease	3	20	34.48
5		2	19	32.76
6		1	19	32.76

Now to fit the linear regression model, below is sample syntax.

m = ols("systolic ~ C(drug) + C(disease) + C(drug):C(disease)", data = systolic)

 desc, mod, table = m.results()
 print(desc, mod, table, sep = "\n"*2)

Number of obs =	58.0000
Root MSE =	10.5096
R-squared =	0.4560
Adj R-squared =	0.3259

Source	Sum of Squares	Degrees of Freedom	Mean Squares	F value	p-value	Eta squared	Omega squared
Model	4259.3385	11	387.2126	3.5057	0.0013	0.456	0.3221

Residual	5080.8167	46	110.4525
Total	9340.1552	57	163.8624

systolic	Coef.	Std. Err.	t	p-value	95% Conf. Interval
Intercept	29.3333	4.2905	6.8367	0.0000	[20.6969, 37.9697]
drug
1	(reference)
2	-1.3333	6.3639	-0.2095	0.8350	[-14.1432, 11.4765]
3	-13.0000	7.4314	-1.7493	0.0869	[-27.9587, 1.9587]
4	-15.7333	6.3639	-2.4723	0.0172	[-28.5432, -2.9235]
disease
1	(reference)
2	-1.0833	6.7839	-0.1597	0.8738	[-14.7387, 12.572]
3	-8.9333	6.3639	-1.4038	0.1671	[-21.7432, 3.8765]
drug:disease
2:2	6.5833	9.7839	0.6729	0.5044	[-13.1107, 26.2774]
2:3	-0.9000	8.9999	-0.1000	0.9208	[-19.0159, 17.2159]
3:2	-10.8500	10.2435	-1.0592	0.2950	[-31.4692, 9.7692]
3:3	1.1000	10.2435	0.1074	0.9150	[-19.5192, 21.7192]
4:2	0.3167	9.3017	0.0340	0.9730	[-18.4066, 19.04]
4:3	9.5333	9.2022	1.0360	0.3056	[-8.9897, 28.0564]