R/broman User Guide

R/broman is an R package with miscellaneous R functions that are useful to me. It includes many functions related to graphics (mostly for base graphics), permutation tests, running mean/median/sum/ratio, and a variety of utilities for programming, for data diagnostics/cleaning, and for writing reproducible data analysis reports.

It has gotten rather large, and the functions do a lot of different, unrelated things, so my purpose here is to organize them, and to explain and perhaps illustrate their use.

Document helpers

There are a bunch of functions to help while writing data analysis reports.

add_commas - This adds commas to large numbers so they are easier to read. If x is 150000, add_commas(x) becomes 150,000.
myround - I’m very particular about sig figs, so I may want myround(3.10, 2) to appear as 3.10 and not 3.1.
spell_out - Spell out an integer as a word; values <10 become words and values >=10 remain in numerals. So if x is 5, spell_out(x) becomes “five” while if it were 10 it would remain “10”.
numbers - This is just a vector of the first twenty positive integers as words, for a similar use as spell_out().
vec2string - Take a vector of numbers and turn it into text, like vec2string(problem_ids) might become 123, 125, 128, and 129.
kbdate - Shows today’s date in a particular format. It seems unnecessary. I had used it for date stamps, but it seems to give the same output as Sys.Date(). I think maybe I knew about Sys.time() but not about Sys.Date() so I made my own.

Statistics stuff

chisq - This is like chisq.test() but using Monte Carlo to get an approximate p-value rather than relying on asymptotics.
fisher - This is like fisher.test() in doing Fisher’s exact test but getting the p-value by Monte Carlo rather than doing an exhaustive calculation.
perm.test - This does a two-sample permutation test using the t statistic (via t.test()), using Monte Carlo to get the p-value rather than referring to the t distribution.
paired.perm.test This is similar to perm.test but for paired data.
quantileSE - This is like the function quantile() but it also gives an estimated standard error.
runningmean - For a set of values at a given set of positions, this calculates a running mean, sum, or median, with a fixed window width.
runningratio - Like runningmean but with a pair of values at each position, and then we calculate the ratio of the sums within fixed windows.
runningratio2 - Like runningratio2 but instead of a fixed window, it adapts the size of the window to achieve some minimum number of points before taking the ratio.
rmvn - Simulate data from a multivariate normal distribution.

Data diagnostics/cleaning

align_vectors - align two vectors by their names, either expanding them with NAs (if there are names that are unique to one or the other) or reducing the vectors to the names that appear in both.
pick_more_precise - for a quirky situation where two columns of numbers were subjected to weird rounding patterns and I needed to merge them. I hope I don’t need this again.
get_precision - used by pick_more_precise()
compare_rows - for all pairs of rows in a matrix, calculate either the proportion of differences or the RMS difference, creating a distance matrix.
cf - useful for verifying that data columns match when they’re supposed to. It checks whether they are the same, including the pattern of missing data. I use this a lot when doing data diagnostics/cleaning.
winsorize - a way to deal with outliers in a numeric vector: move some proportion of the most extreme values in to the next-most-extreme values.
normalize - quantile-normalize a set of columns of a matrix (or just two vectors), to force them to have the same marginal distribution

System/file helpers

h - open the help document for a function/object in your web browser. I’m usually using R within emacs, and if I type help(h), the help for the h function shows up within emacs. So I use h(h) to see it within my web browser.
make - this runs GNU make within an R package directory, in a similar way to the various tools in devtools like test() and document().
exit - exit from R without saving .RData
objectsizes - get the sizes of all objects in memory, in MB
openfile - open a file using command line open (Mac/Windows) or start (linux).

Programming

qr2 - the base function qr does a QR decomposition, but the results it produces are not what you might expect. This function gets you the basic Q and R pieces.
simp - Perform numerical integration by Simpson’s rule.
trap - Perform numerical integration by the trapezoidal role. trap() and simp() are much like integrate() but coded directly in R.

Programming helpers

%nin%, %win, and %wnin% - compare to the recent addition to base R, %notin%.
attrnames - get the names of all attributes of an object.
convert2hex dec2hex - convert a base 10 number to hexadecimal (mostly for working with RGB color codes).
hex2dec - convert a hexadecimal number to base 10.
fac2num - convert a factor with numeric levels to numeric
lenuniq - the number of unique values in a vector, by just calling length(unique(x))
maxabs - maximum of absolute values of a vector, removing NAs
paste00 - like paste0 but also using collapse=""
paste. - like paste0 but with sep="."
setRNGparallel - set up random number generation for use in parallel calculations, and then unsetRNGparallel to go back to the standard RNG method.
switchv - a vectorized version of the base function switch.

Data visualization

grayplot - a base-graphics-based scatterplot, but with particular styling I learned from Dan Carr: gray background, white gridlines, black box, no tick marks
grayplot_na - a base-graphics-based scatterplot, but including NA values in margins rather than omitted. Use the force argument to control whether the boxes for NAs in the margins are included even if the data have no NAs.
ciplot - plot a set of confidence intervals, with styling like grayplot()
dotplot - a scatterplot where one variable is categorical (so like stripchart(), with styling like grayplot()
timeplot - a scatterplot where one variable is date/time, with styling like grayplot()
mypairs - a scatterplot matrix (like pairs()) but with only the upper triangle (to save some space) and calling grayplot()
manyboxplot - a version of box plots for high-dimensional data, just drawing lines at various quantiles, potentially going farther out in the tails.
qqline2 - this is like qqline() but for general qqplot(). (qqline() is only for use with qqnorm().)
excel_fig - this is a tool for making figures that look like excel files. I used it to make all of the examples in the “Data organization in spreadsheets” paper Broman and Woo (2018).
venn - make venn diagrams (as circles or squares) where the sizes are “to scale”.
histlines - this is a tool for using lines() for making a histogram, so that you can show multiple histograms on top of each other.

A favored data visualization: ternary plots

A ternary plot is a way to visualize trinomial distributions, using the fact that in an equilateral triangle, the sum of the distances from any point to the three sides is a constant. They’re super useful for visualization genotype frequencies. So I have a bunch of base-graphics-based tools for making these plots.

triplot - creates the basic plot, with no contents but labelling the three vertices.
tripoints - add a scatter of points on a ternary plot.
trilines - same input as tripoints(), but draws lines between the points, like the function lines() vs the function points().
triarrow - draw an arrow on a ternary plot.
tritext - add text to a ternary plot.
trigrid - add grid lines to a ternary plot.

Here’s an example: I simulate a set of trinomial distributions and plot the points, with grid lines behind. (The three gridlines are at 0.25, 0.50, and 0.75.)

n <- 30
m <- 200
x <- replicate(n, table(factor(sample(1:3, m, prob=c(0.25, 0.5, 0.25),
                                      replace=TRUE), levels=1:3)))
x <- t(x)/colSums(x)
par(mar=rep(0,4), bty="n")
triplot(c("AA", "AB", "BB"), gridlines=3)
tripoints(t(x), pch=21, bg=crayons("Purple Mountain's Majesty"))

Data visualization helpers

arrowlocator - use locator() to interactively draw an arrow on a plot.
colwalpha - convert a named color to the same color but with alpha to control transparency.
brocolors - a set of color palettes that I use, including brocolors("web") that are a set of basic colors that look nicer than R’s basic colors.
crayons - colors scraped from the wikipedia entry for Crayola crayons, so you can use fun colors like crayons("Tickle Me Pink") in your graphs. You can use crayons("pink", notexact=TRUE) to find all the ones with “pink” or “Pink” in the name, such as “Shocking Pink”, “Piggy Pink”, and “Pink Sherbert”.
plot_crayons - plot a set of colors from the crayons palette
jiggle - jiggle points vertically. Similar to jitter() but with method="fixed", which gives results similar to the beeswarm package, for closely packed non-overlapping points. Used in my dotplot().
revgray - reverse grayscale, from white to black rather than black to white.
revrainbow - reverse rainbow color palette, going from blue to red. The use of rainbow color palettes in heatmaps is now frowned upon, though.
twocolorpal - set up a color palette that goes from blue to white to red
theme_karl - I don’t do much ggplot, but when I do, this is my theme. Gray background with white gridlines, but with a box around the plot and without the tick marks.
time_axis - set up a time-based axis; used by timeplot()
xlimlabel - figure out what xlim should be for a plot, to make sure that a set of labels will all fit. You give the range of your points, and then the set of labels, and it gives you what xlim should be to get the labels to fit
strwidth2lines - converts strwidth to units of margin lines, to determine how big the margin should be so that your text fits.
strwidth2xlim - seems the same as xlimlabel

Karl Broman