Welcome to PySumaries documentation!

Welcome to PySumaries documentation!#

PySummaries is a Python package to easily produce table summarizations from pandas dataframes.

Installation#

You can install the package with pip directly from this repo: pip install git+https://github.roche.com/fajardoo/pysummaries@master

QuickStart#

Let’s say we have a dataframe with some data we want to summarize. Let’s take a look at the data:

import pandas as pd

from pysummaries import get_table_summary, get_sample_data

df = get_sample_data()
df

	gender	age	region	group	procedures	has_diabetes
0	Male	51.0	East	Control	2	True
1	Female	58.0	North	Experimental	1	False
2	Female	68.0	South	Experimental	4	False
3	Male	71.0	North	Control	8	False
4	Male	51.0	West	Control	2	False
...	...	...	...	...	...	...
96	Male	45.0	North	Experimental	8	False
97	Male	51.0	North	Experimental	1	True
98	Male	25.0	South	Experimental	6	False
99	Male	51.0	South	Experimental	4	False
100	NaN	NaN	NaN	Control	3	True

101 rows × 6 columns

Now, let’s do a table one stratifying by group

We can use two backends for the html representation: a pysummaries native representation, and one using the popular great_tables package. We can control which backend to use with the parameter ‘backend’. If backend is not defined, the default is ‘native’.

Let’s start first with the PySummaries native backend:

summary_table = get_table_summary(df, strata='group', backend='native')  
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.8 %)	13 (12.9 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
North	11 (22.0 %)	11 (21.6 %)	22 (21.8 %)
South	8 (16.0 %)	15 (29.4 %)	23 (22.8 %)
West	23 (46.0 %)	19 (37.3 %)	42 (41.6 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)	75 (74.3 %)
True	14 (28.0 %)	12 (23.5 %)	26 (25.7 %)

And now, let’s try the great tables backend!

summary_table = get_table_summary(df, strata='group', backend='gt')  
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.8 %)	13 (12.9 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
North	11 (22.0 %)	11 (21.6 %)	22 (21.8 %)
South	8 (16.0 %)	15 (29.4 %)	23 (22.8 %)
West	23 (46.0 %)	19 (37.3 %)	42 (41.6 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)	75 (74.3 %)
True	14 (28.0 %)	12 (23.5 %)	26 (25.7 %)

In both cases you can enhance the table with more features. For example, let’s add a title and footer to the table.

Let’s do first with the native backend: To discover what else you can do with the native backend, you can check the documentation chapter about the Native Backend. In that section everything is described around the function pandas_to_report_html, but you can pass any of the arguments to get_table_summary.

summary_table = get_table_summary(df, strata='group', backend='native', caption="<strong>Table 1</strong>", footer="This is the footer")  
summary_table

**Table 1**
	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.8 %)	13 (12.9 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
North	11 (22.0 %)	11 (21.6 %)	22 (21.8 %)
South	8 (16.0 %)	15 (29.4 %)	23 (22.8 %)
West	23 (46.0 %)	19 (37.3 %)	42 (41.6 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)	75 (74.3 %)
True	14 (28.0 %)	12 (23.5 %)	26 (25.7 %)
This is the footer

And now, with the great tables backend: To discover what else can you do with great_tables, please visit the great_tables documentation

# get the GT object and then  addg features to the GT object we got back from the function
summary_table = (get_table_summary(df, strata='group', backend='gt')  
                        .tab_header(title="Table 1")
                        .tab_source_note(source_note = "This is the footer")
                        .cols_align('center')
)
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
Table 1
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.8 %)	13 (12.9 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
North	11 (22.0 %)	11 (21.6 %)	22 (21.8 %)
South	8 (16.0 %)	15 (29.4 %)	23 (22.8 %)
West	23 (46.0 %)	19 (37.3 %)	42 (41.6 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)	75 (74.3 %)
True	14 (28.0 %)	12 (23.5 %)	26 (25.7 %)
This is the footer

More features#

Controlling the order of categorical variables#

In the previous examples the variable region is ordered alphabetically. In case we would like to order it in a specific order, we can transform the variable to a categorical variable and set the order of the categories.

This applies to both native and great_tables backends.

df2 = df.copy() 
df2['region'] = pd.Categorical(df2['region'], categories=['North', 'South', 'East', 'West'], ordered=True)
summary_table = get_table_summary(df2, strata='group')  
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.8 %)	13 (12.9 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
North	11 (22.0 %)	11 (21.6 %)	22 (21.8 %)
South	8 (16.0 %)	15 (29.4 %)	23 (22.8 %)
West	23 (46.0 %)	19 (37.3 %)	42 (41.6 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)	75 (74.3 %)
True	14 (28.0 %)	12 (23.5 %)	26 (25.7 %)

Rounding#

We can round the numbers to a specific number of decimals with the rounding parameter In this example, let’s increase the number of decimals to 2.

summary_table = get_table_summary(df, strata='group', rounding=2)
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15 (30.0 %)	22 (43.14 %)	37 (36.63 %)
Male	34 (68.0 %)	29 (56.86 %)	63 (62.38 %)
Missing	1 (2.0 %)	0 (0%)	1 (0.99 %)
age
Mean (SD)	53.86 (17.11)	49.25 (17.62)	51.51 (17.44)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.76 %)	13 (12.87 %)
Missing	1 (2.0 %)	0 (0%)	1 (0.99 %)
North	11 (22.0 %)	11 (21.57 %)	22 (21.78 %)
South	8 (16.0 %)	15 (29.41 %)	23 (22.77 %)
West	23 (46.0 %)	19 (37.25 %)	42 (41.58 %)
procedures
Mean (SD)	5.06 (2.71)	4.45 (2.63)	4.75 (2.67)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.47 %)	75 (74.26 %)
True	14 (28.0 %)	12 (23.53 %)	26 (25.74 %)

Variables and Variable names#

If we would like to display nicer labels for our variables, we can do so with the argument columns_labels. If we would like to specify what columns to include we could use columns_include or columns_exclude.

# Let's bring nice labels for gender and age and exclude region
labels = {'gender': 'Birth gender, n (%)', 
        "age": 'Age at Index'}
# we could also specify columns_exclude
cols_include = ['age', 'gender']
#cols_exclude = ['region']
summary_table = get_table_summary(df, strata='group', columns_include=cols_include, columns_labels=labels)  
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
Age at Index
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
Birth gender, n (%)
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)

Changing the Overall column name#

You can change the name of the overall column with the overall_name parameter.

summary_table = get_table_summary(df, strata='group', overall_name='Total')
summary_table

	Control (N=50)	Experimental (N=51)	Total (N=101)
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]	51.5 [40.5 ; 68.0]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7 (14.0 %)	6 (11.8 %)	13 (12.9 %)
Missing	1 (2.0 %)	0 (0%)	1 (1.0 %)
North	11 (22.0 %)	11 (21.6 %)	22 (21.8 %)
South	8 (16.0 %)	15 (29.4 %)	23 (22.8 %)
West	23 (46.0 %)	19 (37.3 %)	42 (41.6 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]	4.0 [3.0 ; 7.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)	75 (74.3 %)
True	14 (28.0 %)	12 (23.5 %)	26 (25.7 %)

Hiding the Overall column or the N counts#

You can easily discard the overall column setting the show_overall parameter to False. In the same fashion, you can hide the N counts setting the show_n parameter to False.

summary_table = get_table_summary(df, strata='group', show_overall=False, show_n=False)
summary_table

	Control	Experimental
gender
Female	15 (30.0 %)	22 (43.1 %)
Male	34 (68.0 %)	29 (56.9 %)
Missing	1 (2.0 %)	0 (0%)
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)
Median [Q1 ; Q3]	58.0 [46.0 ; 68.0]	49.0 [33.0 ; 63.5]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)
region
East	7 (14.0 %)	6 (11.8 %)
Missing	1 (2.0 %)	0 (0%)
North	11 (22.0 %)	11 (21.6 %)
South	8 (16.0 %)	15 (29.4 %)
West	23 (46.0 %)	19 (37.3 %)
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)
Median [Q1 ; Q3]	5.0 [3.0 ; 7.0]	4.0 [3.0 ; 6.5]
Min ; Max	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36 (72.0 %)	39 (76.5 %)
True	14 (28.0 %)	12 (23.5 %)

Changing or hiding the categorical missing level#

For categorical Functions, if there are missing values in the data, they are by default reported creating a new level as “Missing”. You can change this value with the categorical_missing_level parameter. if categorical_missing_level is set to None, the missing values will not be replaced and therefore not reported.

In the example below, we focus on the variable gender and change “Missing” by “Unknown”.

summary_table = get_table_summary(df, strata='group', categorical_missing_level="Unknown", columns_include=['gender'])
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15 (30.0 %)	22 (43.1 %)	37 (36.6 %)
Male	34 (68.0 %)	29 (56.9 %)	63 (62.4 %)
Unknown	1 (2.0 %)	0 (0%)	1 (1.0 %)

Summary presets#

There are some presets on how to summarize the categorical and numerical data. Those can be passed as a string to the numerical_functions or categorical_functions arguments.

The current available categorical presets are: (the default is n_percent)

‘n_percent’: N (%)’
‘n’: N
‘percent’: ‘%’

The current available numerical presets are (each value on a separate row): (the default is ‘meansd_medianq1q3_minmax_missing’)

‘meansd_medianq1q3_minmax_missing’: Mean (SD) / Median [Q1 ; Q3] / Min ; Max / Missing
‘meansd_medianiqr_minmax_missing’: Mean (SD) / Median [IQR] / Min ; Max / Missing

Let’s change the presets to see what happens!

summary_table = get_table_summary(df, strata='group', 
            categorical_functions= 'percent',
            numerical_functions='meansd_medianiqr_minmax_missing',
)
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	30.0 %	43.1 %	36.6 %
Male	68.0 %	56.9 %	62.4 %
Missing	2.0 %	0%	1.0 %
age
Mean (SD)	53.9 (17.1)	49.3 (17.6)	51.5 (17.4)
Median [IQR]	58.0 [22.0]	49.0 [30.5]	51.5 [27.5]
Min ; Max	20.0 ; 78.0	20.0 ; 78.0	20.0 ; 78.0
Missing	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	14.0 %	11.8 %	12.9 %
Missing	2.0 %	0%	1.0 %
North	22.0 %	21.6 %	21.8 %
South	16.0 %	29.4 %	22.8 %
West	46.0 %	37.3 %	41.6 %
procedures
Mean (SD)	5.1 (2.7)	4.5 (2.6)	4.8 (2.7)
Median [IQR]	5.0 [4.0]	4.0 [3.5]	4.0 [4.0]
Min ; Max	1 ; 9	1 ; 9	1 ; 9
Missing	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	72.0 %	76.5 %	74.3 %
True	28.0 %	23.5 %	25.7 %

Other summary functions and numerical missing values#

There are some built-in functions to summarize the data in different ways. In order to use them, you have to import them and then pass them to the parameters numerical_functions or categorical_functions, which will operate on numerical or categorical variables respectively.

In the case of numerical_functions you should pass a dictionary, where the keys of the dictionary should be a string with the label you would like to have for the rows, the value is the function to be applied. You can pass multiple key:value pairs.

In the case of categorical_functions, you should pass a tuple or list with two elements, the first being the function (only one is allowed) and the second a string or number with the value to replace NAs values for a given category level (0 in this case). If you pass a None as a second argument, you will get a nan printed.

In the below example we change the categorical to show only the N and the numerical to show only the median and IQR and the Missing values, where we are also changing the label for the missing values to “Unknown”. You could not include the missing by simply not including that function.

from pysummaries import (categorical_n, categorical_n_percent, categorical_percent, 
        numerical_mean_sd, numerical_median_iqr, numerical_median_q1q3, numerical_min_max, 
        numerical_missing)

summary_table = get_table_summary(df, strata='group', 
            categorical_functions=(categorical_n, 0),
            numerical_functions={"Median [IQR]": numerical_median_iqr,
                                 "Unknown": numerical_missing}
)
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15	22.0	37
Male	34	29.0	63
Missing	1	nan	1
age
Median [IQR]	58.0 [22.0]	49.0 [30.5]	51.5 [27.5]
Unknown	1 (2.0 %)	0 (0.0 %)	1 (1.0 %)
region
East	7	6.0	13
Missing	1	nan	1
North	11	11.0	22
South	8	15.0	23
West	23	19.0	42
procedures
Median [IQR]	5.0 [4.0]	4.0 [3.5]	4.0 [4.0]
Unknown	0 (0.0 %)	0 (0.0 %)	0 (0.0 %)
has_diabetes
False	36	39	75
True	14	12	26

Writing your own summary functions#

You can create your own summary functions. In order to do so you should write a function that takes a pandas series and a rounding parameter and returns a series with numerical or string values for each level of the categorical variable or a single value for the numerical variable.

In this example we will define a function for the categorical variables and one for the numerical variables.

def mycategorical(curseries, rounding):
    """
    My own function for the categorical variables
    Let's just count the number of each category.
    """
    return curseries.value_counts()

def mynumerical(curseries, rounding):
    """
    My own function for the numerical variables
    Let's just return the mean.
    """
    return round(curseries.mean(), rounding)

summary_table = get_table_summary(df, strata='group',
                categorical_functions=(mycategorical, "0"),
                numerical_functions={"Mean": mynumerical}
        )
summary_table

	Control (N=50)	Experimental (N=51)	Overall (N=101)
gender
Female	15.0	22.0	37.0
Male	34.0	29.0	63.0
Missing	1.0	0	1.0
age
Mean	53.9	49.3	51.5
region
East	7.0	6.0	13.0
Missing	1.0	0	1.0
North	11.0	11.0	22.0
South	8.0	15.0	23.0
West	23.0	19.0	42.0
procedures
Mean	5.1	4.5	4.8
has_diabetes
False	36.0	39	75.0
True	14.0	12	26.0