API documentation#

class pysummaries.Pandas2HTMLSummaryTable(html)#

A class to encapsulate the html representation of the summary table

get_raw_html()#

Get the raw HTML string representing the table

Returns:

the html as a string

Return type:

str

pysummaries.calculate_table_summary(df, strata=None, show_overall=True, columns_labels=None, overall_name='Overall', columns_include=None, columns_exclude=None, categorical_functions=None, numerical_functions=None, rounding=1, categorical_missing_level='Missing')#

Calculates a table summary from a pandas dataframe.

Parameters:
  • df (pandas dataframe, mandatory) – pandas dataframe from which to calculate the table one

  • strata (str, optional) – the name of a column in the dataframe to stratify the table one (columns)

  • show_overall (bool, optional) – Show the Overall column. By default True. If False it will take effect only if strata is defined, otherwise ignored

  • columns_labels (dict, optional) – A dictionary defining labels for the columns. Keys should be the column name and values a string with the label. Non existing columns will be ignored.

  • overall_name (string, optional) – name for the column Overall

  • columns_include (list, optional) – columns to include in the report, also determines the order of columns in the report

  • columns_exclude (list, optional) – columns to exclude from the report

  • categorical_functions (str, list or tuple, optional) – if a string, one of the presets for categorical summarization will be applied, by default n_percent. If a tuple or list, the first element must be a function to apply to categorical functions, the second element must be a string or number with the value to report if there are no elements in a categorical level (for example 0 (0%) for n_percent. If None those will be reported as nan.

  • numerical_functions (str or dict, optional) – if a string, one of the presets for numerical summarization will be applied, the default one is ‘meansd_medianq1q3_minmax_missing. If a dictionary it should have a label (as it should appear in the rows index) and as a values functions to apply to the numerical columns of the dataframe. Multiple pairs of labels and functions are supported.

  • rounding (int, optional) – number of decimal points to show, by default 1.

  • categorical_missing_level (str, optional) – if a categorical column has NAs, they will be replaced by the string indicated here, by default ‘Missing’. That will create a new level in the category. If set to None, the NAs will not be replaced.

Returns:

the table summary as a pandas dataframe

Return type:

pandas dataframe

Returns:

the number of observations for each column in the table summary as a dictionary where keys are column names (strata levels) and the values are the counts as integers.

Return type:

dictionary

Example:

>>> from pysummaries import calculate_table_summary
>>> tone = calculate_table_summary(df, strata="age_groups")
pysummaries.categorical_n(curseries, rounding)#

Calculates the N for each category in the series.

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a series with a numerical or string value per categorical level

Return type:

pandas series

pysummaries.categorical_n_percent(curseries, rounding)#

Calculates “N (%)” fo each category in the series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a series with a numerical or string value per categorical level

Return type:

pandas series

pysummaries.categorical_percent(curseries, rounding)#

Calculates the percentage for each category in the series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a series with a numerical or string value per categorical level

Return type:

pandas series

pysummaries.get_sample_data()#

Gets some data for testing and demo purposes

Returns:

sample data

Return type:

pandas dataframe

pysummaries.get_styles(style)#

Get the dictionary representing the available internal default styles

Parameters:

style (dict) – what style to get. empty contains just keys with empty strings as values. Useful to override internal styles.

Returns:

dictionary with internal class name as key and a string with css styles as values.

Return type:

dict

Example:

>>> from pyreportable import get_styles  
>>> style = get_styles('default')
pysummaries.get_table_summary(df, strata=None, backend='native', show_n=True, show_overall=True, columns_labels=None, overall_name='Overall', columns_include=None, columns_exclude=None, rounding=1, categorical_functions=None, numerical_functions=None, categorical_missing_level='Missing', **kwargs)#

Calculates a summary table for the pandas dataframe df and returns an object for nice display.

Parameters:
  • df (pandas dataframe, mandatory) – pandas dataframe from which to calculate the table one

  • strata (str, optional) – the name of a column in the dataframe to stratify the table one (columns)

  • backend (str, optional) – the backend used to display the summary, either ‘native’ or ‘gt’ (great_tables)

  • show_n (bool, optional) – Show the number of observations on the column header

  • show_overall (bool, optional) – Show the Overall column. By default True. If False it will take effect only if strata is defined, otherwise ignored

  • columns_labels (dict, optional) – A dictionary defining labels for the columns. Keys should be the column name and values a string with the label. Non existing columns will be ignored.

  • overall_name (string, optional) – name for the column Overall

  • columns_include (list, optional) – columns to include in the report, also determines the order of columns in the report

  • columns_exclude (list, optional) – columns to exclude from the report

  • rounding (int, optional) – number of decimal points to show, by default 1.

  • categorical_functions (str, list or tuple, optional) – if a string, one of the presets for categorical summarization will be applied, by default n_percent. If a tuple or list, the first element must be a function to apply to categorical functions, the second element must be a string or number with the value to report if there are no elements in a categorical level (for example ‘(0%)’ for n_percent). If None those will be reported as nan.

  • numerical_functions (str or dict, optional) – if a string, one of the presets for numerical summarization will be applied, the default one is ‘meansd_medianq1q3_minmax_missing. If a dictionary it should have a label (as it should appear in the rows index) and as a values functions to apply to the numerical columns of the dataframe. Multiple pairs of labels and functions are supported.

  • categorical_missing_level (str, optional) – if a categorical column has NAs, they will be replaced by the string indicated here, by default ‘Missing’. That will create a new level in the category. If set to None, the NAs will not be replaced.

  • kwargs – keyword arguemnts to pass to the pandas_to_report_html function or the great_tables.GT constructor. See the documentation for those for further details.

Returns:

An object with the html representation of the table

Return type:

Pandas2HTMLSummaryTable if backend is native or great_tables.GT if backend is gt

Example:

>>> from pysummaries import get_table_summary
>>> tone = get_table_summary(df, strata="age_groups")
pysummaries.get_test_data()#

Enriches the sample data with further column for testing

Returns:

sample data

Return type:

pandas dataframe

pysummaries.numerical_mean_sd(curseries, rounding)#

Calculates “Mean (SD)” for the numerical series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a single value with the summary for the series

Return type:

int, float or string

pysummaries.numerical_median_iqr(curseries, rounding)#

Calculates “Median [IQR]” for the numerical series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a single value with the summary for the series

Return type:

int, float or string

pysummaries.numerical_median_q1q3(curseries, rounding)#

Calculates “Median [Q1 ; Q3]” for the numerical series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a single value with the summary for the series

Return type:

int, float or string

pysummaries.numerical_min_max(curseries, rounding)#

Calculates “Min ; Max” for the numerical series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a single value with the summary for the series

Return type:

int, float or string

pysummaries.numerical_missing(curseries, rounding)#

Calculates “N (%)” of missing values (na) for the numerical series

Parameters:
  • curseries (pandas series) – series to be summarized

  • rounding (int) – number of decimal points to show round the results

Returns:

a single value with the summary for the series

Return type:

int, float or string

pysummaries.pandas_to_report_html(df, strat_numbers=None, caption=None, footer=None, customstyles=None, customcss=None, value_styles=None, styles='default', table_id=None, show_index=True)#

Generate a nice html table from a pandas dataframe.

Parameters:
  • df (pandas dataframe, mandatory) – Pandas dataframe. Can have simple or multi indexes and columns.

  • strat_numbers (dictionary, optional) – a dictionary where the key is a column name and value is a number to be displayed as N=xx below each column on the table header. If columns are multiindex, then the keys should be a tuple with all the levels of the multiindex.

  • caption (str, optional) – caption or title to set for the table.

  • footer (str or list of str, optional) – footer for the table. Can be a string or a list of strings. If a list then every element will appear in a different line.

  • customstyles (dict, optional) – dictionary where the keys are our internal classes and values are css styles. These styles will be appended to the styles. You can use an empty styles to completely override the defaults with your own styles.

  • customcss (str, optional) – A string with css targeting classes or ids. It will be pre-pended to the table in a html <style> element without any further processing.

  • value_styles (str, list or list of lists, numpy 2d array or pandas dataframe) – css styles to apply to values, it will be appended to the “values” styles from the default styles. If a 2D structure each element represents the style for the corresponding value in the dataframe. If 1D each element is applied to every element on each row. If a string it will be applied to all elements.

  • styles (str, {'default', 'empty'}) – css styles to use. empty has no styles and is used to override all styles with yours.

  • table_id (str, optional) – id for the table, and as prefix for ids for every element on the table. If not provided, a random id will be generated internally. It can be an empty string.

  • show_index (bool, optional) – if False, dataframe indexes (row labels) will not be shown on the table.

Returns:

an object encapsulating the html representation of the table formatted in a nice way.

Return type:

Pandas2HTMLSummaryTable

Example:

>>> from pyreportable import pandas_to_report_html  
>>> html = pandas_to_report_html(df)