Descriptive statistics by group

Overview

Getting summary statistics for a quantitative variable is a very common task in data analysis. Unfortunately, R makes it surprisingly difficult.

The qstats function is an attempt to rectify the situation by making it simple to get any number of descriptive statistics for a numeric variable and to break these statistics down by the levels of one or more categorical variables (groups).

The general format is

qstats(data, variable, grouping variables, statistics, other options)

Note that variable names do not have to be quoted.

Using default statistics

By default the sample size, mean, and standard deviation are provided. Let’s take a look at fuel efficiencies for 11,914 automobiles in the cardata data frame.

# simple summary statistics 
qstats(cardata, highway_mpg)
#>       n  mean   sd
#> 1 11914 26.64 8.86

# summary statistics by vehicle_size
qstats(cardata, highway_mpg, vehicle_size)
#>   vehicle_size    n  mean   sd
#> 1      Compact 4764 28.94 9.58
#> 2        Large 2777 22.42 7.37
#> 3      Midsize 4373 26.80 7.91

# summary statistics by vehicle_size and drive type
qstats(cardata, highway_mpg, vehicle_size, driven_wheels)
#>    vehicle_size     driven_wheels    n  mean    sd
#> 1       Compact   all wheel drive  646 26.88  4.77
#> 2       Compact  four wheel drive  407 20.79  2.90
#> 3       Compact front wheel drive 2491 33.26  9.89
#> 4       Compact  rear wheel drive 1220 23.94  7.50
#> 5         Large   all wheel drive  438 26.00 12.84
#> 6         Large  four wheel drive  737 19.57  2.66
#> 7         Large front wheel drive  389 25.78  2.46
#> 8         Large  rear wheel drive 1213 21.78  6.73
#> 9       Midsize   all wheel drive 1269 25.83  4.41
#> 10      Midsize  four wheel drive  259 18.85  2.51
#> 11      Midsize front wheel drive 1907 30.24  9.46
#> 12      Midsize  rear wheel drive  938 23.32  5.16

Specifying other statistics

You can supply a statistics argument with the “stats” parameter. You can pass a single statistic, or multiple statistics as a vector of names.

# single statistic
qstats(cardata, highway_mpg, vehicle_size, stats = "median")
#>   vehicle_size median
#> 1      Compact     28
#> 2        Large     22
#> 3      Midsize     26

# multiple statistics
qstats(cardata, highway_mpg, vehicle_size, 
       stats = c("median", "min", "max"))
#>   vehicle_size median min max
#> 1      Compact     28  12 111
#> 2        Large     22  13 107
#> 3      Midsize     26  12 354

User-defined functions can also be used as a statistics. The only requirement is that the function returns a single number.

#custom statistics
p25 <- function(x) quantile(x, probs=.25)
p75 <- function(x) quantile(x, probs=.75)

#calling the built in and custom statistics
qstats(cardata, highway_mpg, vehicle_size, 
       stats = c("min", "p25", "p75", "max"))
#>   vehicle_size min p25 p75 max
#> 1      Compact  12  24  33 111
#> 2        Large  13  19  25 107
#> 3      Midsize  12  23  31 354

Other options

Other options include

  • na.rm When TRUE, NAs are removed. Default is TRUE.
  • digits The number of decimal points to print. Default = 2.
qstats(cardata, highway_mpg, vehicle_size,  
       stats=c("n", "mean","median","sd"),  
       na.rm=FALSE, digits=2)
#>   vehicle_size    n  mean median   sd
#> 1      Compact 4764 28.94     28 9.58
#> 2        Large 2777 22.42     22 7.37
#> 3      Midsize 4373 26.80     26 7.91