Welcome to PySumaries documentation!#

PySummaries is a Python package to easily produce table summarizations from pandas dataframes.

Installation#

You can install the package with pip directly from this repo: pip install git+https://github.roche.com/fajardoo/pysummaries@master

QuickStart#

Let’s say we have a dataframe with some data we want to summarize. Let’s take a look at the data:

import pandas as pd
from IPython.display import display, Markdown

from pysummaries import get_table_summary, get_sample_data

df = get_sample_data()
display(df)
gender age region group
0 Male 51.0 East Control
1 Female 58.0 North Experimental
2 Female 68.0 South Experimental
3 Male 71.0 North Control
4 Male 51.0 West Control
... ... ... ... ...
96 Male 45.0 North Experimental
97 Male 51.0 North Experimental
98 Male 25.0 South Experimental
99 Male 51.0 South Experimental
100 NaN NaN NaN Control

101 rows × 4 columns

Now, let’s do a table one stratifying by group

We can use two backends for the html representation: a pysummaries native representation, and one using the popular great_tables package. We can control which backend to use with the parameter ‘backend’. If backend is not defined, the default is ‘native’.

Let’s start first with the PySummaries native backend:

summary_table = get_table_summary(df, strata='group', backend='native')  
display(summary_table)
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 (14.0 %) 6 (11.8 %) 13 (12.9 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
North 11 (22.0 %) 11 (21.6 %) 22 (21.8 %)
South 8 (16.0 %) 15 (29.4 %) 23 (22.8 %)
West 23 (46.0 %) 19 (37.3 %) 42 (41.6 %)

And now, let’s try the great tables backend!

summary_table = get_table_summary(df, strata='group', backend='gt')  
display(summary_table)
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 (14.0 %) 6 (11.8 %) 13 (12.9 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
North 11 (22.0 %) 11 (21.6 %) 22 (21.8 %)
South 8 (16.0 %) 15 (29.4 %) 23 (22.8 %)
West 23 (46.0 %) 19 (37.3 %) 42 (41.6 %)

In both cases you can enhance the table with more features. For example, let’s add a title and footer to the table.

Let’s do first with the native backend: To discover what else you can do with the native backend, you can check the documentation chapter about the Native Backend. In that section everything is described around the function pandas_to_report_html, but you can pass any of the arguments to get_table_summary.

summary_table = get_table_summary(df, strata='group', backend='native', caption="<strong>Table 1</strong>", footer="This is the footer")  
display(summary_table)
Table 1
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 (14.0 %) 6 (11.8 %) 13 (12.9 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
North 11 (22.0 %) 11 (21.6 %) 22 (21.8 %)
South 8 (16.0 %) 15 (29.4 %) 23 (22.8 %)
West 23 (46.0 %) 19 (37.3 %) 42 (41.6 %)
This is the footer

And now, with the great tables backend: To discover what else can you do with great_tables, please visit the great_tables documentation

# get the GT object and then  addg features to the GT object we got back from the function
summary_table = (get_table_summary(df, strata='group', backend='gt')  
                        .tab_header(title="Table 1")
                        .tab_source_note(source_note = "This is the footer")
                        .cols_align('center')
)
display(summary_table)
Table 1
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 (14.0 %) 6 (11.8 %) 13 (12.9 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
North 11 (22.0 %) 11 (21.6 %) 22 (21.8 %)
South 8 (16.0 %) 15 (29.4 %) 23 (22.8 %)
West 23 (46.0 %) 19 (37.3 %) 42 (41.6 %)
This is the footer

More features#

Controlling the order of categorical variables#

In the previous examples the variable region is ordered alphabetically. In case we would like to order it in a specific order, we can transform the variable to a categorical variable and set the order of the categories.

This applies to both native and great_tables backends.

df2 = df.copy() 
df2['region'] = pd.Categorical(df2['region'], categories=['North', 'South', 'East', 'West'], ordered=True)
summary_table = get_table_summary(df2, strata='group')  
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
North 11 (22.0 %) 11 (21.6 %) 22 (21.8 %)
South 8 (16.0 %) 15 (29.4 %) 23 (22.8 %)
East 7 (14.0 %) 6 (11.8 %) 13 (12.9 %)
West 23 (46.0 %) 19 (37.3 %) 42 (41.6 %)
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)

Rounding#

We can round the numbers to a specific number of decimals with the rounding parameter In this example, let’s increase the number of decimals to 2.

summary_table = get_table_summary(df, strata='group', rounding=2)
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.14 %) 37 (36.63 %)
Male 34 (68.0 %) 29 (56.86 %) 63 (62.38 %)
Missing 1 (2.0 %) 0 (0%) 1 (0.99 %)
age
Mean (SD) 53.9 (17.11) 49.3 (17.62) 51.5 (17.44)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 (14.0 %) 6 (11.76 %) 13 (12.87 %)
Missing 1 (2.0 %) 0 (0%) 1 (0.99 %)
North 11 (22.0 %) 11 (21.57 %) 22 (21.78 %)
South 8 (16.0 %) 15 (29.41 %) 23 (22.77 %)
West 23 (46.0 %) 19 (37.25 %) 42 (41.58 %)

Variables and Variable names#

If we would like to display nicer labels for our variables, we can do so with the argument columns_labels. If we would like to specify what columns to include we could use columns_include or columns_exclude.

# Let's bring nice labels for gender and age and exclude region
labels = {'gender': 'Birth gender, n (%)', 
        "age": 'Age at Index'}
# we could also specify columns_include
# column_include = ['age', 'gender']
cols_exclude = ['region']
summary_table = get_table_summary(df, strata='group', columns_exclude=cols_exclude, columns_labels=labels)  
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
Birth gender, n (%)
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
Age at Index
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)

Changing the Overall column name#

You can change the name of the overall column with the overall_name parameter.

summary_table = get_table_summary(df, strata='group', overall_name='Total')
summary_table
Control
(N=50)
Experimental
(N=51)
Total
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5] 51.5 [40.5 ; 68.0]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 (14.0 %) 6 (11.8 %) 13 (12.9 %)
Missing 1 (2.0 %) 0 (0%) 1 (1.0 %)
North 11 (22.0 %) 11 (21.6 %) 22 (21.8 %)
South 8 (16.0 %) 15 (29.4 %) 23 (22.8 %)
West 23 (46.0 %) 19 (37.3 %) 42 (41.6 %)

Hiding the Overall column or the N counts#

You can easily discard the overall column setting the show_overall parameter to False. In the same fashion, you can hide the N counts setting the show_n parameter to False.

summary_table = get_table_summary(df, strata='group', show_overall=False, show_n=False)
summary_table
Control Experimental
gender
Female 15 (30.0 %) 22 (43.1 %)
Male 34 (68.0 %) 29 (56.9 %)
Missing 1 (2.0 %) 0 (0%)
age
Mean (SD) 53.9 (17.1) 49.3 (17.6)
Median [Q1 ; Q3] 58.0 [46.0 ; 68.0] 49.0 [33.0 ; 63.5]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %)
region
East 7 (14.0 %) 6 (11.8 %)
Missing 1 (2.0 %) 0 (0%)
North 11 (22.0 %) 11 (21.6 %)
South 8 (16.0 %) 15 (29.4 %)
West 23 (46.0 %) 19 (37.3 %)

Changing or hiding the categorical missing level#

For categorical Functions, if there are missing values in the data, they are by default reported creating a new level as “Missing”. You can change this value with the categorical_missing_level parameter. if categorical_missing_level is set to None, the missing values will not be replaced and therefore not reported.

In the example below, we focus on the variable gender and change “Missing” by “Unknown”.

summary_table = get_table_summary(df, strata='group', categorical_missing_level="Unknown", columns_include=['gender'])
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 (30.0 %) 22 (43.1 %) 37 (36.6 %)
Male 34 (68.0 %) 29 (56.9 %) 63 (62.4 %)
Unknown 1 (2.0 %) 0 (0%) 1 (1.0 %)

Summary presets#

There are some presets on how to summarize the categorical and numerical data. Those can be passed as a string to the numerical_functions or categorical_functions arguments.

The current available categorical presets are: (the default is n_percent)

  • ‘n_percent’: N (%)’

  • ‘n’: N

  • ‘percent’: ‘%’

The current available numerical presets are (each value on a separate row): (the default is ‘meansd_medianq1q3_minmax_missing’)

  • ‘meansd_medianq1q3_minmax_missing’: Mean (SD) / Median [Q1 ; Q3] / Min ; Max / Missing

  • ‘meansd_medianiqr_minmax_missing’: Mean (SD) / Median [IQR] / Min ; Max / Missing

Let’s change the presets to see what happens!

summary_table = get_table_summary(df, strata='group', 
            categorical_functions= 'percent',
            numerical_functions='meansd_medianiqr_minmax_missing',
)
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 30.0 % 43.1 % 36.6 %
Male 68.0 % 56.9 % 62.4 %
Missing 2.0 % 0% 1.0 %
age
Mean (SD) 53.9 (17.1) 49.3 (17.6) 51.5 (17.4)
Median [IQR] 58.0 [22.0] 49.0 [30.5] 51.5 [27.5]
Min ; Max 20.0 ; 78.0 20.0 ; 78.0 20.0 ; 78.0
Missing 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 14.0 % 11.8 % 12.9 %
Missing 2.0 % 0% 1.0 %
North 22.0 % 21.6 % 21.8 %
South 16.0 % 29.4 % 22.8 %
West 46.0 % 37.3 % 41.6 %

Other summary functions and numerical missing values#

There are some built-in functions to summarize the data in different ways. In order to use them, you have to import them and then pass them to the parameters numerical_functions or categorical_functions, which will operate on numerical or categorical variables respectively.

In the case of numerical_functions you should pass a dictionary, where the keys of the dictionary should be a string with the label you would like to have for the rows, the value is the function to be applied. You can pass multiple key:value pairs.

In the case of categorical_functions, you should pass a tuple or list with two elements, the first being the function (only one is allowed) and the second a string or number with the value to replace NAs values for a given category level (0 in this case). If you pass a None as a second argument, you will get a nan printed.

In the below example we change the categorical to show only the N and the numerical to show only the median and IQR and the Missing values, where we are also changing the label for the missing values to “Unknown”. You could not include the missing by simply not including that function.

from pysummaries import (categorical_n, categorical_n_percent, categorical_percent, 
        numerical_mean_sd, numerical_median_iqr, numerical_median_q1q3, numerical_min_max, 
        numerical_missing)

summary_table = get_table_summary(df, strata='group', 
            categorical_functions=(categorical_n, 0),
            numerical_functions={"Median [IQR]": numerical_median_iqr,
                                 "Unknown": numerical_missing}
)
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15 22.0 37
Male 34 29.0 63
Missing 1 nan 1
age
Median [IQR] 58.0 [22.0] 49.0 [30.5] 51.5 [27.5]
Unknown 1 (2.0 %) 0 (0.0 %) 1 (1.0 %)
region
East 7 6.0 13
Missing 1 nan 1
North 11 11.0 22
South 8 15.0 23
West 23 19.0 42

Writing your own summary functions#

You can create your own summary functions. In order to do so you should write a function that takes a pandas series and a rounding parameter and returns a series with numerical or string values for each level of the categorical variable or a single value for the numerical variable.

In this example we will define a function for the categorical variables and one for the numerical variables.

def mycategorical(curseries, rounding):
    """
    My own function for the categorical variables
    Let's just count the number of each category.
    """
    return curseries.value_counts()

def mynumerical(curseries, rounding):
    """
    My own function for the numerical variables
    Let's just return the mean.
    """
    return round(curseries.mean(), rounding)

summary_table = get_table_summary(df, strata='group',
                categorical_functions=(mycategorical, "0"),
                numerical_functions={"Mean": mynumerical}
        )
summary_table
Control
(N=50)
Experimental
(N=51)
Overall
(N=101)
gender
Female 15.0 22.0 37.0
Male 34.0 29.0 63.0
Missing 1.0 0 1.0
age
Mean 53.9 49.3 51.5
region
East 7.0 6.0 13.0
Missing 1.0 0 1.0
North 11.0 11.0 22.0
South 8.0 15.0 23.0
West 23.0 19.0 42.0