crosstab()

Returns up to 3 DataFrames depending on what desired. Can calculate row, column, or cell percentages if requested. Otherwise, counts are returned as the default.

DataFrame 1 is always the crosstabulation results, the other 2 DataFrames returned depends on the options selected which is determined by the arguments test and expected_freqs. If all 3 options are returned, then the order of the returned DataFrames is as follows: crosstabulation results, \(\chi^2\) test results with effect size of Cramer’s Phi or V depending on size of table, and the expected frequency table.

Arguments

crosstab(group1, group2, prop= None, test = False, margins= True, correction = None, cramer_correction = None, exact = False, expected_freqs= False)

  • group1 and group2, requires the data to be a Pandas Series

  • prop, can either be ‘row’, ‘col’, or ‘cell’. ‘row’ will calculate the row percentages, ‘column’ will calculate the column percentages, and ‘cell’ will calculate the cell percentage based on the entire sample

  • test, can take “chi-square”, “g-test”, “mcnemar”, or “fisher”.
    • If “chi-square”, the chi-square (\(\chi^2\)) test of independence [1] will be calculated and returned in a second DataFrame.

    • If “g-test”, will conduct the G-test (likelihood-ratio \(\chi^2\)) [1] and the results will be returned in a second DataFrame.

    • If “fisher”, will conduct Fisher’s exact test [2].

    • If “mcnemar”, will conduct the McNemar \(\chi^2\) [3] test for paired nominal data.

  • margins, if False will return a crosstabulation table without the total counts for each group. This argument is only supported for counts; the margins will always be returned for the percentages

  • correction, if True, applies the Yates’ correction for continuity. Valid argument for chi-square”, “g-test”, and “mcnemar”.

  • cramer_correction, if True, applies the bias correction developed by Tschuprow (1925) to Cramer’s V.

  • exact, is only a valid option for when the “mcnemar” test is selected. In that case, exact = True will then the binomal distribution will be used. If false (default), the \(\chi^2\) distribution is used.

  • expected_freqs, if True, will return a DataFrame that contains the expected counts for each cell. Not a valid argurment for mcnemar test.

returns
  • Up to 3 Pandas DataFrames as a tuple;
    • First DataFrame is always the crosstab table with either the counts, cell, row, or column percentages

    • Second DataFrame is either the test results or the expected frequencies. If a test is selected and expected frequencies are desired, the second DataFrame will be the test results; otherwise, if just expected frequencies are desired, the second DataFrame will be that and there will not be a third DataFrame returned.

    • Third DataFrame is always the expected frequencies

Note

If conducting a McNemar test, make sure the outcomes in both variables are labelled the same.

Effect size measures formulas

Note

If adjusted \(\chi^2\) values are used in the test’s calculation, then those adjusted \(\chi^2\) values are also used to calculate effect size.

Cramer’s Phi (2x2 table)

For analyses were it’s a 2x2 table, the following formula is used to calculate Cramer’s Phi (\(\phi\)) [4]:

\[\phi = \sqrt{\frac{\chi^2}{N}}\]

Where N = total number of observations in the analysis

Cramer’s V (RxC where R or C > 2)

For analyses were it’s a table that is larger than a 2x2, the following formula is used to calculate Cramer’s V [4]:

\[V = \sqrt{\frac{\chi^2}{(N*(k - 1))}}\]

Where K is the number of categories for either R or C (whichever has fewer categories)

\[\tilde{V} = \sqrt\frac{\tilde{\phi}^2}{\text{min}(\tilde{r} - 1, \tilde{c} - 1)}\]

Where r is the number of rows and c is the number of columns, and

\[\begin{split}\tilde{\phi}^2 = \text{max}(0, \frac{\chi^2}{n} - \frac{(c - 1)(r - 1)}{n - 1}) \\ \tilde{c} = c - \frac{(c - 1)^2}{n - 1} \\ \tilde{r} = r - \frac{(r - 1)^2}{n - 1}\end{split}\]

Examples

import researchpy, pandas, numpy

numpy.random.seed(123)

df = pandas.DataFrame(numpy.random.randint(3, size= (101, 3)),
                  columns= ['disease', 'severity', 'alive'])

df.head()
disease severity alive
0 2 1 2
1 2 0 2
2 2 1 2
3 1 2 1
4 0 1 2
# If only two Series are passed it will output a crosstabulation with margin totals.
# This is the same as pandas.crosstab(), except for researchpy.crosstab() returns
# a table with hierarchical indexing for better exporting format style.

researchpy.crosstab(df['disease'], df['alive'])
alive
0 1 2 All
disease
0 9 14 7 30
1 7 9 15 31
2 7 17 16 40
All 23 40 38 101
# Demonstration of calculating cell proportions

crosstab = researchpy.crosstab(df['disease'], df['alive'], prop= "cell")

crosstab
alive
0 1 2 All
disease
0 8.91 13.86 6.93 29.70
1 6.93 8.91 14.85 30.69
2 6.93 16.83 15.84 39.60
All 22.77 39.60 37.62 100.00
# Demonstration of calculating row proportions

crosstab = researchpy.crosstab(df['disease'], df['alive'], prop= "row")

crosstab
alive
0 1 2 All
disease
0 30.00 46.67 23.33 100.0
1 22.58 29.03 48.39 100.0
2 17.50 42.50 40.00 100.0
All 22.77 39.60 37.62 100.0
# Demonstration of calculating column proportions

crosstab = researchpy.crosstab(df['disease'], df['alive'], prop= "col")

crosstab
alive
0 1 2 All
disease
0 39.13 35.0 18.42 29.70
1 30.43 22.5 39.47 30.69
2 30.43 42.5 42.11 39.60
All 100.00 100.0 100.00 100.00
# To conduct a Chi-square test of independence, pass "chi-square" in the "test =" argument.
# This will also output an effect size; either Cramer's Phi if it a 2x2 table, or
# Cramer's V is larger than 2x2.

# This will return 2 DataFrames as a tuple, 1 with the crosstabulation and the other with the
# test results. It's rather ugly, the recommended way to output is in the next example

researchpy.crosstab(df['disease'], df['alive'], test= "chi-square")
(        alive
             0   1   2  All
 disease
 0           9  14   7   30
 1           7   9  15   31
 2           7  17  16   40
 All        23  40  38  101,                 Chi-square test  results
 0  Pearson Chi-square ( 4.0) =    5.1573
 1                    p-value =    0.2715
 2                 Cramer's V =    0.3196)
# To clean up the output, assign each DataFrame to an object. This allows
# for a cleaner view and each DataFrame to be exported

crosstab, res = researchpy.crosstab(df['disease'], df['alive'], test= "chi-square")

crosstab
alive
0 1 2 All
disease
0 9 14 7 30
1 7 9 15 31
2 7 17 16 40
All 23 40 38 101
res
Chi-square test results
0 Pearson Chi-square ( 4.0) = 5.1573
1 p-value = 0.2715
2 Cramer's V = 0.3196
# To get the expected frequencies, pass "True" in "expected_freqs="

crosstab, res, expected = researchpy.crosstab(df['disease'], df['alive'], test= "chi-square", expected_freqs= True)

expected
alive
0 1 2
disease
0 6.831683 11.881188 11.287129
1 7.059406 12.277228 11.663366
2 9.108911 15.841584 15.049505
# Can also conduct the G-test (likelihood-ratio chi-square)

crosstab, res = researchpy.crosstab(df['disease'], df['alive'], test= "g-test")

res
G-test results
0 Log-likelihood ratio ( 4.0) = 5.3808
1 p-value = 0.2504
2 Cramer's V = 0.3264
# Can also conduct Fisher's exact test

# Need 2x2 data for Fisher's test.
numpy.random.seed(345)

df = pandas.DataFrame(numpy.random.randint(2, size= (90, 2)),
                  columns= ['tx', 'cured'])

crosstab, res = researchpy.crosstab(df['tx'], df['cured'], test= "fisher")

crosstab
cured
0 1 All
tx
0 25 17 42
1 20 28 48
All 45 45 90
res
Fisher's exact test results
0 Odds ratio = 2.0588
1 2 sided p-value = 0.1387
2 Left tail p-value = 0.9717
3 Right tail p-value = 0.0694
4 Cramer's phi = 0.1782
# Lastly, the McNemar test
# Make sure your outcomes are labelled the same in
# both variables
numpy.random.seed(345)

df = pandas.DataFrame(numpy.random.randint(2, size= (90, 2)),
                  columns= ['time1', 'time2'])

crosstab, res = researchpy.crosstab(df['time1'], df['time2'], test= "mcnemar")

crosstab
time2
0 1 All
time1
0 25 17 42
1 20 28 48
All 45 45 90
res
McNemar results
0 McNemar's Chi-square ( 1.0) = 0.2432
1 p-value = 0.6219
2 Cramer's phi = 0.0520

References

1(1,2)

scipy.stats.chi2_contingency. The Scipy community, 2016. Retrived when last updated May 12, 2016. URL: http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.chi2_contingency.html.

2

scipy.stats.fisher_exact. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html.

3

statsmodels.stats.contingency_tables.mcnemar. Statsmodels-developers, 2018. URL: https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html.

4(1,2)

Harald Cramér. Mathematical methods of statistics (PMS-9). Volume 9. Princeton university press, 2016.