Skip to contents

Sample product marginals dataset

Usage

sample_marginals(dat, n, seed = NULL)

Arguments

dat

Data.frame to sample from, must include only covariates.

n

Number of observations to sample.

seed

NULL or seed for exact reproducibility.

Value

data.frame with same number of columns and n rows.

Details

The product marginals dataset is a grid of values that is sampled independently per each column (feature) from the original dataset. The aim here is to disentangle the correlations between features and assess how each feature affects the model predictions individually. It will not contain new values per column, but it may contain new combinations of values not seen in the original data. One can also check how the model behaves if there are unseen observations (new combination of features). Note that the use of the product marginal dataset for model sculpting only works if the features are approximately additive for model predictions. In the quite rare case when they are not, the sculpted models using the product marginal dataset is expected to have significantly lower performance and the conclusions may be misleading.

One can also try using the original data instead of the product marginals for model sculpting and see how the results differ.

Examples

sample_marginals(mtcars, n = 5, seed = 543)
#>    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1 14.3   6 145.0 113 3.07 5.250 17.98  0  0    4    1
#> 2 22.8   8 360.0 150 3.92 1.513 16.87  1  0    3    2
#> 3 15.5   8 301.0  97 3.15 1.835 16.70  1  0    3    6
#> 4 14.7   4  75.7 110 4.43 5.250 19.44  0  1    4    3
#> 5 19.7   8 472.0  66 2.76 2.780 17.60  1  0    4    8