Title: | Functions to Facilitate Exploratory Data Analysis |
---|---|
Description: | Functions for descriptive statistics, data management, and data visualization. |
Authors: | Kabacoff Robert [aut, cre], Barich Griffen [ctb], Jamrog Kelly [ctb], Kravchenko Elizaveta [ctb], Kuruvilla Jacob [ctb], Liu Lex [ctb], Nakamura Shota [ctb], Pham Kim [ctb], Rodriguez Belen [ctb], Ross Shane [ctb], Russo Chris [ctb], Corpuz Frederick [ctb], Juradat Nurah [ctb], Karp Harrison [ctb], Koech Kevin [ctb], Peters Anna [ctb], Shah Dhhyey [ctb], Stevenson Kenneth [ctb], Thomas-Franz Kaitlyn [ctb], Zheng Jiner [ctb], Aldarmaki Ahmed [ctb], Alneyadi Mohammed [ctb], Altai Chossis [ctb], Colorado Sofia [ctb], Northrop Blake [ctb], Peretz Shea [ctb], Qin Cher [ctb], Tuhabonye Emma [ctb], Wong Phillip [ctb] |
Maintainer: | Kabacoff Robert <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.3 |
Built: | 2025-01-19 04:16:19 UTC |
Source: | https://github.com/rkabacoff/qacbase |
Create barcharts for all categorical variables in a data frame.
barcharts( data, fill = "deepskyblue2", color = "grey30", labels = TRUE, sort = TRUE, maxcat = 20, abbrev = 20 )
barcharts( data, fill = "deepskyblue2", color = "grey30", labels = TRUE, sort = TRUE, maxcat = 20, abbrev = 20 )
data |
data frame |
fill |
fill color for bars |
color |
color for bar labels |
labels |
if |
sort |
if |
maxcat |
numeric. barcharts with more than this number of bars will not be plotted. |
abbrev |
numeric. abbreviate bar labels to at most, this character length. |
a ggplot graph
barcharts(cars74)
barcharts(cars74)
Cars dataset with features including make, model, year, engine, and other properties of the car used to predict its price.
cardata
cardata
A data frame with 11914 rows and 16 variables. The variables are as follows:
car brand
model given by its brand
year of manufacture
type of fuel required by its manufacturer
engine horse power
number of cylinders
automatic vs. manual
AWD, FWD, AWD
Number of Doors
Luxury, Performance, Hatchback, etc.
Compact, Midsize, Large
Type of Vehicle: Sedan, SUV, Coupe, etc.
highway miles per gallon
city miles per gallon
Popularity index
manufacturer's suggested retail price
This package contains a detailed car dataset.
Taken from Kaggle https://www.kaggle.com/CooperUnion/cardataset.
summary(cardata)
summary(cardata)
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
cars74
cars74
A data frame with 32 rows and 11 variables. The variables are as follows:
highway miles per gallon
Miles/(US) gallon
Number of cylinders
Displacement (cu.in.)
Gross horsepower
Rear axle ratio
Weight (1000 lbs)
1/4 mile time
Engine cylinder configuration
Transmission type
Number of forward gears
Number of carburetors
This dataset is the mtcars
dataset that comes
with base R. However, cyl
, vs
, am
, gear
and carb
have been converted
to factors and rownames have been converted to the variable auto
.
A description of the variables by Soren Heitmann can be found
here.
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391-411.
summary(cars74)
summary(cars74)
contents
provides a comprehensive description of a data
frame, including summary statistics for both quantitative and
categorical variables
contents(data, digits = 2, maxcat = 10, label_length = 20)
contents(data, digits = 2, maxcat = 10, label_length = 20)
data |
a data frame |
digits |
number of decimal digits for statistics. |
maxcat |
maximum number of levels of a character/factor variable to print. |
label_length |
maximum length of factor level label to print. Longer labels will be truncated. |
Prints a comprehensive description of a data frame via several tables, a general summary table and tables that provide a breakdown of quantitative and categorical variables.
a list with 6 components:
name of data frame
number of rows
number of columns
data frame of overall dataset characteristics
data frame with summary statistics for quantitative variables
data frame with summary statistics for categorical variables
contents(cars74)
contents(cars74)
Create a correlation matrix for all quantitative variables in a data frame.
cor_plot( data, method = c("pearson", "kendall", "spearman"), sort = FALSE, axis_text_size = 12, number_text_size = 3, legend = FALSE )
cor_plot( data, method = c("pearson", "kendall", "spearman"), sort = FALSE, axis_text_size = 12, number_text_size = 3, legend = FALSE )
data |
data frame |
method |
a character string indicating which correlation coefficient is to be computed. One of "pearson" (default), "kendall", or "spearman". |
sort |
logical. If |
axis_text_size |
size for axis labels (default=12). |
number_text_size |
size for correlation coefficient labels (default=3). |
legend |
logical, if TRUE the legend is displayed. (default=FALSE) |
The cor_plot
function will only select quantitative variables from
a data frame. Categorical variables are ignored.
The correlation matrix is presented as a lower triangle matrix.
Missing values are deleted in listwise fashion.
a ggplot graph
This function is a wrapper for the ggcorrplot
function.
cor_plot(cars74) cor_plot(cars74, sort=TRUE)
cor_plot(cars74) cor_plot(cars74, sort=TRUE)
This function creates a two way frequency table.
crosstab( data, rowvar, colvar, type = c("freq", "percent", "rowpercent", "colpercent"), total = TRUE, na.rm = TRUE, digits = 2, chisquare = FALSE, plot = FALSE )
crosstab( data, rowvar, colvar, type = c("freq", "percent", "rowpercent", "colpercent"), total = TRUE, na.rm = TRUE, digits = 2, chisquare = FALSE, plot = FALSE )
data |
data frame |
rowvar |
row factor (unquoted) |
colvar |
column factor (unquoted) |
type |
statistics to print. Options are |
total |
logical. if TRUE, includes total percents. |
na.rm |
logical. if TRUE, deletes cases with missing values. |
digits |
number of decimal digits to report for percents. |
chisquare |
logical. If |
plot |
logical. If |
Given a data frame, a row factor, a column factor, and a type (frequencies, cell percents, row percents, or column percents) the function provides the requested cross-tabulation.
If na.rm = FALSE
, a level labeled <NA>
added. If
total = TRUE
, a level labeled Total
is added. If
chisquare = TRUE
, a chi-square test of independence is
performed.
If plot=TRUE
, return a ggplot2 graph.
Otherwise the function return a list with 6 components:
table
(table). Table of frequencies or percents
type
(character). Type of table to print
total
(logical). If TRUE
, print row and or column totals
digits
(numeric). number of digits to print
rowname
(character). Row variable name
colname
(character). Column variable name
chisquare
(character). If chisquare=TRUE
, contains
the results of the Chi-square test. NULL
otherwise.
# print frequencies crosstab(mtcars, cyl, gear) # print cell percents crosstab(cardata, vehicle_size, driven_wheels) crosstab(cardata, vehicle_size, driven_wheels, plot=TRUE) crosstab(cardata, driven_wheels, vehicle_size, type="colpercent", plot=TRUE, chisquare=TRUE)
# print frequencies crosstab(mtcars, cyl, gear) # print cell percents crosstab(cardata, vehicle_size, driven_wheels) crosstab(cardata, vehicle_size, driven_wheels, plot=TRUE) crosstab(cardata, driven_wheels, vehicle_size, type="colpercent", plot=TRUE, chisquare=TRUE)
Create desnsity plots for all quantitative variables in a data frame.
densities(data, fill = "deepskyblue2", adjust = 1)
densities(data, fill = "deepskyblue2", adjust = 1)
data |
data frame |
fill |
fill color for density plots |
adjust |
a factor multiplied by the smoothing bandwidth. See details. |
The densities
function will only plot quantitative variables from
a data frame. Categorical variables are ignored.
The adjust
parameter mulitplies the smoothing parameter. For example
adjust = 2
will make the density plots twice as smooth.
The adjust = 1/2
will make the density plots half as smooth (i.e., twice as spiky).
a ggplot graph
densities(cars74) densities(cars74, adjust=2) densities(cars74, adjust=1/2)
densities(cars74) densities(cars74, adjust=2) densities(cars74, adjust=1/2)
df_plot
visualizes the variables in a data frame.
df_plot(data)
df_plot(data)
data |
a data frame. |
For each variable, the plot displays
type (numeric
,
integer
,
factor
,
ordered factor
,
logical
, or date
)
percent of available (and missing) cases
Variables are sorted by type and the total number of variables and cases are printed in the caption.
a ggplot2
graph
For more descriptive statistics on a data frame see contents.
df_plot(cars74)
df_plot(cars74)
One-way analysis (ANOVA or Kruskal-Wallis Test) with post-hoc comparisons and plots
groupdiff( data, y, x, method = c("anova", "kw"), digits = 2, horizontal = FALSE, posthoc = FALSE )
groupdiff( data, y, x, method = c("anova", "kw"), digits = 2, horizontal = FALSE, posthoc = FALSE )
data |
a data frame. |
y |
a numeric response variable |
x |
a categorical explanatory variable. It will coerced to be a factor. |
method |
character. Either |
digits |
Number of significant digits to print. |
horizontal |
logical. If |
posthoc |
logical. If |
The groupdiff
function performs one of two analyses:
anova
A one-way analysis of variance, with TukeyHSD post-hoc comparisons.
kw
A Kruskal Wallis Rank Sum Test, with Conover Test post-hoc comparisons.
In each case, summary statistics and a grouped boxplots are
provided. In the parametric case, the statistics are n, mean, and
standard deviation. In the nonparametric case the statistics are
n, median, and median absolute deviation. If posthoc = TRUE
,
pairwise comparisons of superimposed on the boxplots.
Groups that share a letter are not significantly different (p < .05),
controlling for multiple comparisons.
a list with 3 components:
result
omnibus test
summarystats
summary statistics
plot
ggplot2 graph
kwAllPairsConoverTest, multcompLetters.
# parametric analysis groupdiff(cars74, hp, gear) # nonparametric analysis groupdiff(cardata, popularity, vehicle_style, posthoc=TRUE, method="kw", horizontal=TRUE)
# parametric analysis groupdiff(cars74, hp, gear) # nonparametric analysis groupdiff(cardata, popularity, vehicle_style, posthoc=TRUE, method="kw", horizontal=TRUE)
Create histograms for all quantitative variables in a data frame.
histograms(data, fill = "deepskyblue2", color = "white", bins = 30)
histograms(data, fill = "deepskyblue2", color = "white", bins = 30)
data |
data frame |
fill |
fill color for histogram bars |
color |
border color for histogram bars |
bins |
number of bins (bars) for the histograms |
The histograms
function will only plot quantitative variables from
a data frame. Categorical variables are ignored.
a ggplot graph
histograms(cars74) histograms(cars74, bins=15, fill="darkred")
histograms(cars74) histograms(cars74, bins=15, fill="darkred")
lso
lists object sizes and types.
lso( pos = 1, pattern, order.by = "Size", decreasing = TRUE, head = TRUE, n = 10 )
lso( pos = 1, pattern, order.by = "Size", decreasing = TRUE, head = TRUE, n = 10 )
pos |
a number specifying the environment as a position in the search list. |
pattern |
an optional regular expression. Only names matching pattern are returned. glob2rx can be used to convert wildcard patterns to regular expressions. |
order.by |
column to sort the list by. Values are |
decreasing |
logical. If |
head |
logical. Should output be limited to |
n |
if |
This function list the sizes and types of all objects in an environment. By default, the list describes the objects in the current environment, presented in descending order by object size and reported in megabytes (Mb).
a data.frame with four columns (Type, Size, Rows, Columns) and object names as row names.
Based on based on postings by Petr Pikal and David Hinds to the r-help list in 2004 and modified Dirk Eddelbuettel, Patrick McCann, and Rob Kabacoff.
https://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session/.
data(cardata) data(cars74) lso()
data(cardata) data(cars74) lso()
Plots group means with error bars. Error bars can be standard deviations, standard errors, or confidence intervals. Optionally, plots can be based on robust statistics.
mean_plot( data, y, x, by, pointsize = 2, dodge = 0.2, lines = TRUE, width = 0.2, error_type = c("se", "sd", "ci"), percent = 0.95, robust = FALSE )
mean_plot( data, y, x, by, pointsize = 2, dodge = 0.2, lines = TRUE, width = 0.2, error_type = c("se", "sd", "ci"), percent = 0.95, robust = FALSE )
data |
a data frame. |
y |
a numeric response variable. |
x |
a categorical explanatory variable. |
by |
a second categorical explanatory variable (optional). |
pointsize |
numeric. Point size (default = 2). |
dodge |
numeric. If a |
lines |
logical. If |
width |
numeric. Width of the error bars (default = 0.2). Set to 0 to produce pointranges instead of error bars. |
error_type |
character. Error bars represents either standard deviations
|
percent |
numeric. if |
robust |
logical. If |
Robust statistics are based on deciles, the nine values that divide the response variable into 10 equal groups (where each group contains roughly the same fraction of cases). The robust mean is the mean of these nine decile values. The robust standard deviation is the sample standard deviation of the nine decile values. The standard error and confidence interval are calculated in the normal way, but use the robust mean and standard deviation in their calculations. See Abu-Shawiesh et al (2022).
a ggplot2
graph.
Ahmed Abu-Shawiesh, M., Sinsomboonthong, J., & Kibria, B. (2022). A modified robust confidence interval for the population mean of distributrion baed on deciles. Statistics in Transition, vol. 23 (1). pdf
mean_plot(cars74, mpg, cyl) mean_plot(cars74, mpg, cyl, am) mean_plot(cars74, mpg, cyl, am, error_type = "ci", percent = 0.9, width = 0, lines = FALSE, robust = TRUE)
mean_plot(cars74, mpg, cyl) mean_plot(cars74, mpg, cyl, am) mean_plot(cars74, mpg, cyl, am, error_type = "ci", percent = 0.9, width = 0, lines = FALSE, robust = TRUE)
Normalize the numeric variables in a data frame
normalize(data, new_min = 0, new_max = 1)
normalize(data, new_min = 0, new_max = 1)
data |
a data frame. |
new_min |
minimum for the transformed variables. |
new_max |
maximum for the transformed variables. |
normalize
transforms all the numeric variables
in a data frame to have the same minimum and maximum values.
By default, this will be a minimum of 0 and maximum of 1.
Character variables and factors are left unchanged.
a data frame
Use this function to be transform variables into a given range. The default is [0, 1], but [-1, 1], [0, 100], or any other range is permissible.
head(cars74) cars74_st <- normalize(cars74) head(cars74_st)
head(cars74) cars74_st <- normalize(cars74) head(cars74_st)
phelp
provides help on an installed package.
phelp(pckg)
phelp(pckg)
pckg |
The name of a package |
This function provides help on an installed package. The package does not have to be loaded. The package name does not need to be entered with quotes.
No return value, called for side effects.
phelp(stats)
phelp(stats)
This function plots the results of a calculated two-way frequency table.
## S3 method for class 'crosstab' plot(x, size = 3.5, ...)
## S3 method for class 'crosstab' plot(x, size = 3.5, ...)
x |
An object of class |
size |
numeric. Size of bar text labels. |
... |
no currently used. |
a ggplot2 graph
tbl <- crosstab(cars74, cyl, gear, type = "freq") plot(tbl) tbl <- crosstab(cars74, cyl, gear, type = "colpercent") plot(tbl)
tbl <- crosstab(cars74, cyl, gear, type = "freq") plot(tbl) tbl <- crosstab(cars74, cyl, gear, type = "colpercent") plot(tbl)
Plot a frequency or cumulative frequency table
## S3 method for class 'tab' plot(x, fill = "deepskyblue2", size = 3.5, ...)
## S3 method for class 'tab' plot(x, fill = "deepskyblue2", size = 3.5, ...)
x |
An object of class |
fill |
Fill color for bars |
size |
numeric. Size of bar text labels. |
... |
Parameters passed to a function |
a ggplot2 graph
tbl1 <- tab(cars74, carb) plot(tbl1) tbl2 <- tab(cars74, carb, sort = TRUE) plot(tbl2) tbl3 <- tab(cars74, carb, cum=TRUE) plot(tbl3)
tbl1 <- tab(cars74, carb) plot(tbl1) tbl2 <- tab(cars74, carb, sort = TRUE) plot(tbl2) tbl3 <- tab(cars74, carb, cum=TRUE) plot(tbl3)
print.contents
prints the results of the content function.
## S3 method for class 'contents' print(x, ...)
## S3 method for class 'contents' print(x, ...)
x |
a object of class |
... |
not used. |
No return value, called for side effects.
testdata <- data.frame(height=c(4, 5, 3, 2, 100), weight=c(39, 88, NA, 15, -2), names=c("Bill","Dean", "Sam", NA, "Jane"), race=c('b', 'w', 'w', 'o', 'b')) x <- contents(testdata) print(x)
testdata <- data.frame(height=c(4, 5, 3, 2, 100), weight=c(39, 88, NA, 15, -2), names=c("Bill","Dean", "Sam", NA, "Jane"), race=c('b', 'w', 'w', 'o', 'b')) x <- contents(testdata) print(x)
This function prints the results of a calculated two-way frequency table.
## S3 method for class 'crosstab' print(x, ...)
## S3 method for class 'crosstab' print(x, ...)
x |
An object of class |
... |
not currently used. |
No return value, called for side effects
mycrosstab <- crosstab(mtcars, cyl, gear, type = "freq", digits = 2) print(mycrosstab) mycrosstab <- crosstab(mtcars, cyl, gear, type = "rowpercent", digits = 3) print(mycrosstab)
mycrosstab <- crosstab(mtcars, cyl, gear, type = "freq", digits = 2) print(mycrosstab) mycrosstab <- crosstab(mtcars, cyl, gear, type = "rowpercent", digits = 3) print(mycrosstab)
Print the results of calculating a frequency table
## S3 method for class 'tab' print(x, ...)
## S3 method for class 'tab' print(x, ...)
x |
An object of class |
... |
Parameters passed to the print function |
No return value, called for side effects
frequency <- tab(cardata, make, sort = TRUE, na.rm = FALSE) print(frequency)
frequency <- tab(cardata, make, sort = TRUE, na.rm = FALSE) print(frequency)
This function provides descriptive statistics for a quantitative variable alone or separately by groups. Any function that returns a single numeric value can bue used.
qstats(data, x, ..., stats = c("n", "mean", "sd"), na.rm = TRUE, digits = 2)
qstats(data, x, ..., stats = c("n", "mean", "sd"), na.rm = TRUE, digits = 2)
data |
data frame |
x |
numeric variable in data (unquoted) |
... |
list of grouping variables |
stats |
statistics to calculate (any function that produces a
numeric value), Default: |
na.rm |
if |
digits |
number of decimal digits to print, Default: 2 |
a data frame, where columns are grouping variables (optional) and statistics
# If no keyword arguments are provided, default values are used qstats(mtcars, mpg, am, gear) # You can supply as many (or no) grouping variables as needed qstats(mtcars, mpg) qstats(mtcars, mpg, am, cyl) # You can specify your own functions (e.g., median, # median absolute deviation, minimum, maximum)) qstats(mtcars, mpg, am, gear, stats = c("median", "mad", "min", "max"))
# If no keyword arguments are provided, default values are used qstats(mtcars, mpg, am, gear) # You can supply as many (or no) grouping variables as needed qstats(mtcars, mpg) qstats(mtcars, mpg, am, cyl) # You can specify your own functions (e.g., median, # median absolute deviation, minimum, maximum)) qstats(mtcars, mpg, am, gear, stats = c("median", "mad", "min", "max"))
Plot a grid of R colors and their associated names
rcolors(color = NULL, cex = 0.6)
rcolors(color = NULL, cex = 0.6)
color |
character. A text string used to search for specific color variations (see examples.) |
cex |
numeric. text size for color labels. |
By default rcolors
plots the basic 502 distinct colors provided by the
colors
function. If a color name or part of a name is provided, only
colors with matching names are plotted.
No return value, called for side effects
This function is adapted from code published by Karl W. Broman.
rcolors() rcolors("blue") rcolors("red") rcolors("dark")
rcolors() rcolors("blue") rcolors("red") rcolors("dark")
recodes
recodes the values of one or more variables in
a data frame
recodes(data, vars, from, to)
recodes(data, vars, from, to)
data |
a data frame. |
vars |
character vector of variable names. |
from |
a vector of values or conditions (see Details). |
to |
a vector of replacement values. |
For each variable in the vars
parameter, values
are checked against the list of values in the from
vector.
If a value matches, it is replaced with the corresponding
entry in the to
vector.
Once a given observation's value matches a from
value, it is
recoded. That particular observation will not be recoded again by
that recodes()
statement (i.e., no chaining).
One or more values in the from
vector can be an expression,
using the dollar sign ($) to represent the variable being recoded.
If the expression
evaluates to TRUE
, the corresponding to
value is
returned.
If the number of values in the to
vector is less than
the from
vector, the values are recycled. This lets you
convert several values to a single outcome value (e.g., NA
).
If the to
values are numeric, the resulting recoded variable
will be numeric. If the variable being recoded is a factor and the
to
values are character values, the resulting variable will
remain a factor. If the variable being recoded is a character variable
and the to
values are character values, the resulting
variable will remain a character variable.
a data frame
See the vignette for detailed examples.
df <- data.frame(x = c(1, 5, 7, 3, 0), y = c(9, 0, 5, 9, 2), z = c(1, 1, 2, 2, 1) ) df <- recodes(df, vars = c("x", "y"), from = 0, to = NA) df <- recodes(df, vars = "z", from = c(1, 2), to = c("pass", "fail"))
df <- data.frame(x = c(1, 5, 7, 3, 0), y = c(9, 0, 5, 9, 2), z = c(1, 1, 2, 2, 1) ) df <- recodes(df, vars = c("x", "y"), from = 0, to = NA) df <- recodes(df, vars = "z", from = c(1, 2), to = c("pass", "fail"))
Create a scatter plot between two quantitative variables.
scatter( data, x, y, outlier = 3, alpha = 1, digits = 3, title, margin = "none", stats = TRUE, point_color = "deepskyblue2", outlier_color = "violetred1", line_color = "grey30", margin_color = "deepskyblue2" )
scatter( data, x, y, outlier = 3, alpha = 1, digits = 3, title, margin = "none", stats = TRUE, point_color = "deepskyblue2", outlier_color = "violetred1", line_color = "grey30", margin_color = "deepskyblue2" )
data |
data frame |
x |
quantitative predictor variable |
y |
quantitative response variable |
outlier |
number. Observations with studentized residuals larger than this value are flagged. If set to 0, observations are not flagged. |
alpha |
Transparency of data points. A numeric value between 0 (completely transparent) and 1 (completely opaque). |
digits |
Number of significant digits in displayed statistics. |
title |
Optional title. |
margin |
Marginal plots. If specified, parameter can be
|
stats |
logical. If |
point_color |
Color used for points. |
outlier_color |
Color used to identify outliers (see the |
line_color |
Color for regression line. |
margin_color |
Fill color for margin boxplots, density plots, or histograms. |
The scatter
function generates a scatterplot between two quantitative
variables, along with a line of best fit and a 95% confidence interval.
By default, regression statistics (b, r, r2, p) are printed and
outliers (observations with studentized residuals > 3) are flagged.
Optionally, variable distributions (histograms, boxplots, violin plots,
density plots) can be added to the plot margins.
a ggplot2 graph
Variable names do not have to be quoted.
scatter(cars74, hp, mpg) scatter(cars74, wt, hp) p <- scatter(ggplot2::mpg, displ, hwy, margin="histogram", title="Engine Displacement vs. Highway Mileage") plot(p)
scatter(cars74, hp, mpg) scatter(cars74, wt, hp) p <- scatter(ggplot2::mpg, displ, hwy, margin="histogram", title="Engine Displacement vs. Highway Mileage") plot(p)
Calculate the skewness of a numeric variable
skewness(x, na.rm = TRUE)
skewness(x, na.rm = TRUE)
x |
numeric vector. |
na.rm |
if |
a number
skewness(mtcars$mpg)
skewness(mtcars$mpg)
Standardize the numeric variables in a data frame
standardize(data, mean = 0, sd = 1, include_dummy = FALSE)
standardize(data, mean = 0, sd = 1, include_dummy = FALSE)
data |
a data frame. |
mean |
mean of the transformed variables. |
sd |
standard deviation of the transformed variables. |
include_dummy |
logical. If |
standardize
transforms all the numeric variables
in a data frame to have the same mean and standard deviation.
By default, this will be a mean of 0 and standard deviation of 1.
Character variables and factors are left unchanged. By default,
dummy coded variables are also left unchanged. Use
include_dummy=TRUE
to transform these variables as well.
a data frame
head(cars74) cars74_st <- standardize(cars74) head(cars74_st)
head(cars74) cars74_st <- standardize(cars74) head(cars74_st)
Function to calculate frequency distributions for categorical variables
tab( data, x, sort = FALSE, maxcat = NULL, minp = NULL, na.rm = FALSE, total = FALSE, digits = 2, cum = FALSE, plot = FALSE )
tab( data, x, sort = FALSE, maxcat = NULL, minp = NULL, na.rm = FALSE, total = FALSE, digits = 2, cum = FALSE, plot = FALSE )
data |
A dataframe |
x |
A factor variable in the data frame. |
sort |
logical. Sort levels from high to low. |
maxcat |
Maximum number of categories to be included. Smaller categories will be combined into an "Other" category. |
minp |
Minimum proportion for a category to be included. Categories representing smaller proportions willbe combined into an "Other" category. maxcat and minp cannot both be specified. |
na.rm |
logical. Removes missing values when TRUE. |
total |
logical. Include a total category when TRUE. |
digits |
Number of digits the percents should be rounded to. |
cum |
logical. If |
plot |
logical. If |
The function tab
will calculate the frequency
distribution for a categorical variable and output a data frame
with three columns: level, n, percent.
If plot = TRUE
return a ggplot2 bar chart. Otherwise
return a data frame.
tab(cars74, carb) tab(cars74, carb, plot=TRUE) tab(cars74, carb, sort=TRUE) tab(cars74, carb, sort=TRUE, plot=TRUE) tab(cars74, carb, cum=TRUE) tab(cars74, carb, cum=TRUE, plot=TRUE)
tab(cars74, carb) tab(cars74, carb, plot=TRUE) tab(cars74, carb, sort=TRUE) tab(cars74, carb, sort=TRUE, plot=TRUE) tab(cars74, carb, cum=TRUE) tab(cars74, carb, cum=TRUE, plot=TRUE)
This is a data set detailing TV usage on days surveyed as determined by the 2017 American Time Use Survey. The data set includes demographic information, as well as details regarding employment and family makeup, where applicable. Information on days surveyed, as well as whether the day is a holiday, is also included.
tv
tv
A data frame with 10,223 rows and 21 variables. The variables are as follows:
ID of respondent
ATUS final weight
Age of the youngest child in the household that is less than 18 years old (if applicable). Range: 1-17; if no child in household: NA
Age of respondent
Sex of respondent
Status of employment of the respondent. Direct transcription from original codebook: 1 = Employed, at work, 2 = Employed, absent, 3 = Unemployed, on layoff, 4 = Unemployed, looking, 5 = Not in the labor force.
The response to question, “in the last seven days did you have more than one job?” Returns NA if no job.
Does the respondent have a full time job or a part time job? (NA if no job)
Are you enrolled in high school, college, or university? (NA if not currently enrolled)
If yes to educ, are you enrolled in high school or upper schooling? (NA if not currently enrolled)
Presence of the respondent's spouse or unmarried partner in the household with 1 = Spouse present 2 = Unmarried partner present 3 = No spouse/unmarried partner present
Answer to the question, “does your partner have a job?” (NA if not applicable)
Weekly earnings at the respondent’s main job, two decimals implied
Number of children under 18 in the household
Part time/full time job status of partner, if applicable (NA if partner unemployed or no partner)
Total hours usually worked per week (-4: Hours vary)
Day of the week about which the respondent was interviewed (Monday thorugh Friday)
Notes if the respondent was interviewed on a holiday
Total time spent providing elder care that day by the respondent, in minutes
Total time spent during diary day providing secondary childcare for household children younger than 13, in minutes
Minutes spent watching TV
For more information regarding the key visit https://www.bls.gov/tus/atusintcodebk17.pdf. This data is retrieved from the American Time Use Survey, made available through the Bureau of Labor Statistics https://www.bls.gov/tus/datafiles_2017.htm.
summary(tv) hist(tv$tv, col="skyblue")
summary(tv) hist(tv$tv, col="skyblue")
Generates a descriptive graph for a quantitative variable.
univariate_plot( data, x, bins = 30, fill = "deepskyblue", pointcolor = "black", density = TRUE, densitycolor = "grey", alpha = 0.2, seed = 1234 )
univariate_plot( data, x, bins = 30, fill = "deepskyblue", pointcolor = "black", density = TRUE, densitycolor = "grey", alpha = 0.2, seed = 1234 )
data |
a data frame. |
x |
a variable name (without quotes). |
bins |
number of histogram bins. |
fill |
fill color for the histogram and boxplot. |
pointcolor |
point color for the jitter plot. |
density |
logical. Plot a filled density curve over the the histogram. (default=TRUE) |
densitycolor |
fill color for density curve. |
alpha |
Alpha transparency (0-1) for the density curve and jittered points. |
seed |
pseudorandom number seed for jittered plot. |
univariate_plot
generates a plot containing three graphs:
a histogram (with an optional density curve), a horizontal
jittered point plot, and a horizontal box plot. The subtitle
contains descriptive statistics, including the mean, standard
deviation, median, minimum, maximum, and skew.
a ggplot2 graph
The graphs are created with ggplot2 and then assembled into a single plot through the patchwork package. Missing values are deleted.
univariate_plot(mtcars, mpg) univariate_plot(cardata, city_mpg, fill="lightsteelblue", pointcolor="lightsteelblue", densitycolor="lightpink", alpha=.6)
univariate_plot(mtcars, mpg) univariate_plot(cardata, city_mpg, fill="lightsteelblue", pointcolor="lightsteelblue", densitycolor="lightpink", alpha=.6)