Welcome to PySumaries documentation!#
PySummaries is a Python package to easily produce table summarizations from pandas dataframes.
Installation#
You can install the package with pip directly from this repo: pip install git+https://github.roche.com/fajardoo/pysummaries@master
QuickStart#
Let’s say we have a dataframe with some data we want to summarize. Let’s take a look at the data:
import pandas as pd
from pysummaries import get_table_summary, get_sample_data
df = get_sample_data()
df
gender | age | region | group | procedures | has_diabetes | |
---|---|---|---|---|---|---|
0 | Male | 51.0 | East | Control | 2 | True |
1 | Female | 58.0 | North | Experimental | 1 | False |
2 | Female | 68.0 | South | Experimental | 4 | False |
3 | Male | 71.0 | North | Control | 8 | False |
4 | Male | 51.0 | West | Control | 2 | False |
... | ... | ... | ... | ... | ... | ... |
96 | Male | 45.0 | North | Experimental | 8 | False |
97 | Male | 51.0 | North | Experimental | 1 | True |
98 | Male | 25.0 | South | Experimental | 6 | False |
99 | Male | 51.0 | South | Experimental | 4 | False |
100 | NaN | NaN | NaN | Control | 3 | True |
101 rows × 6 columns
Now, let’s do a table one stratifying by group
We can use two backends for the html representation: a pysummaries native representation, and one using the popular great_tables package. We can control which backend to use with the parameter ‘backend’. If backend is not defined, the default is ‘native’.
Let’s start first with the PySummaries native backend:
summary_table = get_table_summary(df, strata='group', backend='native')
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
age | ||||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 7 (14.0 %) | 6 (11.8 %) | 13 (12.9 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
North | 11 (22.0 %) | 11 (21.6 %) | 22 (21.8 %) | |
South | 8 (16.0 %) | 15 (29.4 %) | 23 (22.8 %) | |
West | 23 (46.0 %) | 19 (37.3 %) | 42 (41.6 %) | |
procedures | ||||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) | |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] | |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 36 (72.0 %) | 39 (76.5 %) | 75 (74.3 %) | |
True | 14 (28.0 %) | 12 (23.5 %) | 26 (25.7 %) |
And now, let’s try the great tables backend!
summary_table = get_table_summary(df, strata='group', backend='gt')
summary_table
Control (N=50) |
Experimental (N=51) |
Overall (N=101) |
|
---|---|---|---|
gender | |||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) |
age | |||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) |
region | |||
East | 7 (14.0 %) | 6 (11.8 %) | 13 (12.9 %) |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) |
North | 11 (22.0 %) | 11 (21.6 %) | 22 (21.8 %) |
South | 8 (16.0 %) | 15 (29.4 %) | 23 (22.8 %) |
West | 23 (46.0 %) | 19 (37.3 %) | 42 (41.6 %) |
procedures | |||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) |
has_diabetes | |||
False | 36 (72.0 %) | 39 (76.5 %) | 75 (74.3 %) |
True | 14 (28.0 %) | 12 (23.5 %) | 26 (25.7 %) |
In both cases you can enhance the table with more features. For example, let’s add a title and footer to the table.
Let’s do first with the native backend: To discover what else you can do with the native backend, you can check the documentation chapter about the Native Backend. In that section everything is described around the function pandas_to_report_html, but you can pass any of the arguments to get_table_summary.
summary_table = get_table_summary(df, strata='group', backend='native', caption="<strong>Table 1</strong>", footer="This is the footer")
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
age | ||||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 7 (14.0 %) | 6 (11.8 %) | 13 (12.9 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
North | 11 (22.0 %) | 11 (21.6 %) | 22 (21.8 %) | |
South | 8 (16.0 %) | 15 (29.4 %) | 23 (22.8 %) | |
West | 23 (46.0 %) | 19 (37.3 %) | 42 (41.6 %) | |
procedures | ||||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) | |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] | |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 36 (72.0 %) | 39 (76.5 %) | 75 (74.3 %) | |
True | 14 (28.0 %) | 12 (23.5 %) | 26 (25.7 %) | |
This is the footer |
And now, with the great tables backend: To discover what else can you do with great_tables, please visit the great_tables documentation
# get the GT object and then addg features to the GT object we got back from the function
summary_table = (get_table_summary(df, strata='group', backend='gt')
.tab_header(title="Table 1")
.tab_source_note(source_note = "This is the footer")
.cols_align('center')
)
summary_table
Table 1 | |||
Control (N=50) |
Experimental (N=51) |
Overall (N=101) |
|
---|---|---|---|
gender | |||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) |
age | |||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) |
region | |||
East | 7 (14.0 %) | 6 (11.8 %) | 13 (12.9 %) |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) |
North | 11 (22.0 %) | 11 (21.6 %) | 22 (21.8 %) |
South | 8 (16.0 %) | 15 (29.4 %) | 23 (22.8 %) |
West | 23 (46.0 %) | 19 (37.3 %) | 42 (41.6 %) |
procedures | |||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) |
has_diabetes | |||
False | 36 (72.0 %) | 39 (76.5 %) | 75 (74.3 %) |
True | 14 (28.0 %) | 12 (23.5 %) | 26 (25.7 %) |
This is the footer |
More features#
Controlling the order of categorical variables#
In the previous examples the variable region is ordered alphabetically. In case we would like to order it in a specific order, we can transform the variable to a categorical variable and set the order of the categories.
This applies to both native and great_tables backends.
df2 = df.copy()
df2['region'] = pd.Categorical(df2['region'], categories=['North', 'South', 'East', 'West'], ordered=True)
summary_table = get_table_summary(df2, strata='group')
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
age | ||||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 7 (14.0 %) | 6 (11.8 %) | 13 (12.9 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
North | 11 (22.0 %) | 11 (21.6 %) | 22 (21.8 %) | |
South | 8 (16.0 %) | 15 (29.4 %) | 23 (22.8 %) | |
West | 23 (46.0 %) | 19 (37.3 %) | 42 (41.6 %) | |
procedures | ||||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) | |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] | |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 36 (72.0 %) | 39 (76.5 %) | 75 (74.3 %) | |
True | 14 (28.0 %) | 12 (23.5 %) | 26 (25.7 %) |
Rounding#
We can round the numbers to a specific number of decimals with the rounding parameter In this example, let’s increase the number of decimals to 2.
summary_table = get_table_summary(df, strata='group', rounding=2)
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 (30.0 %) | 22 (43.14 %) | 37 (36.63 %) | |
Male | 34 (68.0 %) | 29 (56.86 %) | 63 (62.38 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (0.99 %) | |
age | ||||
Mean (SD) | 53.86 (17.11) | 49.25 (17.62) | 51.51 (17.44) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 7 (14.0 %) | 6 (11.76 %) | 13 (12.87 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (0.99 %) | |
North | 11 (22.0 %) | 11 (21.57 %) | 22 (21.78 %) | |
South | 8 (16.0 %) | 15 (29.41 %) | 23 (22.77 %) | |
West | 23 (46.0 %) | 19 (37.25 %) | 42 (41.58 %) | |
procedures | ||||
Mean (SD) | 5.06 (2.71) | 4.45 (2.63) | 4.75 (2.67) | |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] | |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 36 (72.0 %) | 39 (76.47 %) | 75 (74.26 %) | |
True | 14 (28.0 %) | 12 (23.53 %) | 26 (25.74 %) |
Variables and Variable names#
If we would like to display nicer labels for our variables, we can do so with the argument columns_labels. If we would like to specify what columns to include we could use columns_include or columns_exclude.
# Let's bring nice labels for gender and age and exclude region
labels = {'gender': 'Birth gender, n (%)',
"age": 'Age at Index'}
# we could also specify columns_exclude
cols_include = ['age', 'gender']
#cols_exclude = ['region']
summary_table = get_table_summary(df, strata='group', columns_include=cols_include, columns_labels=labels)
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
Age at Index | ||||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
Birth gender, n (%) | ||||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) |
Changing the Overall column name#
You can change the name of the overall column with the overall_name parameter.
summary_table = get_table_summary(df, strata='group', overall_name='Total')
summary_table
Control
(N=50) |
Experimental
(N=51) |
Total
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
age | ||||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | 51.5 [40.5 ; 68.0] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 7 (14.0 %) | 6 (11.8 %) | 13 (12.9 %) | |
Missing | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) | |
North | 11 (22.0 %) | 11 (21.6 %) | 22 (21.8 %) | |
South | 8 (16.0 %) | 15 (29.4 %) | 23 (22.8 %) | |
West | 23 (46.0 %) | 19 (37.3 %) | 42 (41.6 %) | |
procedures | ||||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) | |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | 4.0 [3.0 ; 7.0] | |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 36 (72.0 %) | 39 (76.5 %) | 75 (74.3 %) | |
True | 14 (28.0 %) | 12 (23.5 %) | 26 (25.7 %) |
Hiding the Overall column or the N counts#
You can easily discard the overall column setting the show_overall parameter to False. In the same fashion, you can hide the N counts setting the show_n parameter to False.
summary_table = get_table_summary(df, strata='group', show_overall=False, show_n=False)
summary_table
Control | Experimental | ||
---|---|---|---|
gender | |||
Female | 15 (30.0 %) | 22 (43.1 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | |
Missing | 1 (2.0 %) | 0 (0%) | |
age | |||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | |
Median [Q1 ; Q3] | 58.0 [46.0 ; 68.0] | 49.0 [33.0 ; 63.5] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | |
region | |||
East | 7 (14.0 %) | 6 (11.8 %) | |
Missing | 1 (2.0 %) | 0 (0%) | |
North | 11 (22.0 %) | 11 (21.6 %) | |
South | 8 (16.0 %) | 15 (29.4 %) | |
West | 23 (46.0 %) | 19 (37.3 %) | |
procedures | |||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | |
Median [Q1 ; Q3] | 5.0 [3.0 ; 7.0] | 4.0 [3.0 ; 6.5] | |
Min ; Max | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | |||
False | 36 (72.0 %) | 39 (76.5 %) | |
True | 14 (28.0 %) | 12 (23.5 %) |
Changing or hiding the categorical missing level#
For categorical Functions, if there are missing values in the data, they are by default reported creating a new level as “Missing”. You can change this value with the categorical_missing_level parameter. if categorical_missing_level is set to None, the missing values will not be replaced and therefore not reported.
In the example below, we focus on the variable gender and change “Missing” by “Unknown”.
summary_table = get_table_summary(df, strata='group', categorical_missing_level="Unknown", columns_include=['gender'])
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 (30.0 %) | 22 (43.1 %) | 37 (36.6 %) | |
Male | 34 (68.0 %) | 29 (56.9 %) | 63 (62.4 %) | |
Unknown | 1 (2.0 %) | 0 (0%) | 1 (1.0 %) |
Summary presets#
There are some presets on how to summarize the categorical and numerical data. Those can be passed as a string to the numerical_functions or categorical_functions arguments.
The current available categorical presets are: (the default is n_percent)
‘n_percent’: N (%)’
‘n’: N
‘percent’: ‘%’
The current available numerical presets are (each value on a separate row): (the default is ‘meansd_medianq1q3_minmax_missing’)
‘meansd_medianq1q3_minmax_missing’: Mean (SD) / Median [Q1 ; Q3] / Min ; Max / Missing
‘meansd_medianiqr_minmax_missing’: Mean (SD) / Median [IQR] / Min ; Max / Missing
Let’s change the presets to see what happens!
summary_table = get_table_summary(df, strata='group',
categorical_functions= 'percent',
numerical_functions='meansd_medianiqr_minmax_missing',
)
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 30.0 % | 43.1 % | 36.6 % | |
Male | 68.0 % | 56.9 % | 62.4 % | |
Missing | 2.0 % | 0% | 1.0 % | |
age | ||||
Mean (SD) | 53.9 (17.1) | 49.3 (17.6) | 51.5 (17.4) | |
Median [IQR] | 58.0 [22.0] | 49.0 [30.5] | 51.5 [27.5] | |
Min ; Max | 20.0 ; 78.0 | 20.0 ; 78.0 | 20.0 ; 78.0 | |
Missing | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 14.0 % | 11.8 % | 12.9 % | |
Missing | 2.0 % | 0% | 1.0 % | |
North | 22.0 % | 21.6 % | 21.8 % | |
South | 16.0 % | 29.4 % | 22.8 % | |
West | 46.0 % | 37.3 % | 41.6 % | |
procedures | ||||
Mean (SD) | 5.1 (2.7) | 4.5 (2.6) | 4.8 (2.7) | |
Median [IQR] | 5.0 [4.0] | 4.0 [3.5] | 4.0 [4.0] | |
Min ; Max | 1 ; 9 | 1 ; 9 | 1 ; 9 | |
Missing | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 72.0 % | 76.5 % | 74.3 % | |
True | 28.0 % | 23.5 % | 25.7 % |
Other summary functions and numerical missing values#
There are some built-in functions to summarize the data in different ways. In order to use them, you have to import them and then pass them to the parameters numerical_functions or categorical_functions, which will operate on numerical or categorical variables respectively.
In the case of numerical_functions you should pass a dictionary, where the keys of the dictionary should be a string with the label you would like to have for the rows, the value is the function to be applied. You can pass multiple key:value pairs.
In the case of categorical_functions, you should pass a tuple or list with two elements, the first being the function (only one is allowed) and the second a string or number with the value to replace NAs values for a given category level (0 in this case). If you pass a None as a second argument, you will get a nan printed.
In the below example we change the categorical to show only the N and the numerical to show only the median and IQR and the Missing values, where we are also changing the label for the missing values to “Unknown”. You could not include the missing by simply not including that function.
from pysummaries import (categorical_n, categorical_n_percent, categorical_percent,
numerical_mean_sd, numerical_median_iqr, numerical_median_q1q3, numerical_min_max,
numerical_missing)
summary_table = get_table_summary(df, strata='group',
categorical_functions=(categorical_n, 0),
numerical_functions={"Median [IQR]": numerical_median_iqr,
"Unknown": numerical_missing}
)
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15 | 22.0 | 37 | |
Male | 34 | 29.0 | 63 | |
Missing | 1 | nan | 1 | |
age | ||||
Median [IQR] | 58.0 [22.0] | 49.0 [30.5] | 51.5 [27.5] | |
Unknown | 1 (2.0 %) | 0 (0.0 %) | 1 (1.0 %) | |
region | ||||
East | 7 | 6.0 | 13 | |
Missing | 1 | nan | 1 | |
North | 11 | 11.0 | 22 | |
South | 8 | 15.0 | 23 | |
West | 23 | 19.0 | 42 | |
procedures | ||||
Median [IQR] | 5.0 [4.0] | 4.0 [3.5] | 4.0 [4.0] | |
Unknown | 0 (0.0 %) | 0 (0.0 %) | 0 (0.0 %) | |
has_diabetes | ||||
False | 36 | 39 | 75 | |
True | 14 | 12 | 26 |
Writing your own summary functions#
You can create your own summary functions. In order to do so you should write a function that takes a pandas series and a rounding parameter and returns a series with numerical or string values for each level of the categorical variable or a single value for the numerical variable.
In this example we will define a function for the categorical variables and one for the numerical variables.
def mycategorical(curseries, rounding):
"""
My own function for the categorical variables
Let's just count the number of each category.
"""
return curseries.value_counts()
def mynumerical(curseries, rounding):
"""
My own function for the numerical variables
Let's just return the mean.
"""
return round(curseries.mean(), rounding)
summary_table = get_table_summary(df, strata='group',
categorical_functions=(mycategorical, "0"),
numerical_functions={"Mean": mynumerical}
)
summary_table
Control
(N=50) |
Experimental
(N=51) |
Overall
(N=101) |
||
---|---|---|---|---|
gender | ||||
Female | 15.0 | 22.0 | 37.0 | |
Male | 34.0 | 29.0 | 63.0 | |
Missing | 1.0 | 0 | 1.0 | |
age | ||||
Mean | 53.9 | 49.3 | 51.5 | |
region | ||||
East | 7.0 | 6.0 | 13.0 | |
Missing | 1.0 | 0 | 1.0 | |
North | 11.0 | 11.0 | 22.0 | |
South | 8.0 | 15.0 | 23.0 | |
West | 23.0 | 19.0 | 42.0 | |
procedures | ||||
Mean | 5.1 | 4.5 | 4.8 | |
has_diabetes | ||||
False | 36.0 | 39 | 75.0 | |
True | 14.0 | 12 | 26.0 |