R/broman is an R package with miscellaneous R functions that are useful to me. It includes many functions related to graphics (mostly for base graphics), permutation tests, running mean/median/sum/ratio, and a variety of utilities for programming, for data diagnostics/cleaning, and for writing reproducible data analysis reports.
It has gotten rather large, and the functions do a lot of different, unrelated things, so my purpose here is to organize them and explain and perhaps illustrate their use.
There are a bunch of functions to help while writing data analysis reports.
add_commas
- This adds commas to large numbers so they are easier to read. If
x is 150000, add_commas(x) becomes
150,000.
myround
- I’m very particular about sig figs, so I may want
myround(3.10, 2) to appear as 3.10 and not 3.1.
spell_out
- Spell out an integer as a word; values <10 become words and values
>=10 remain in numerals. So if x is 5,
spell_out(x) becomes “five” while if it were 10 it would
remain “10”.
numbers
- This is just a vector of the first twenty positive integers as words,
for a similar use as spell_out().
vec2string
- Take a vector of numbers and turn it into text, like
vec2string(problem_ids) might become
123, 125, 128, and 129.
kbdate
- Shows today’s date in a particular format. It seems unnecessary. I had
used it for date stamps, but it seems to give the same output as
Sys.Date(). I think maybe I knew about
Sys.time() but not about Sys.Date() so I made
my own.
chisq
- This is like chisq.test() but using Monte Carlo to get an
approximate p-value rather than relying on asymptotics.
fisher
- This is like fisher.test() in doing Fisher’s exact test
but getting the p-value by Monte Carlo rather than doing an exhaustive
calculation.
perm.test
- This does a two-sample permutation test using the t statistic (via
t.test()), using Monte Carlo to get the p-value rather than
referring to the t distribution.
paired.perm.test
This is similar to perm.test but for paired data.
quantileSE
- This is like the function quantile() but it also gives an
estimated standard error.
runningmean
- For a set of values at a given set of positions, this calculates a
running mean, sum, or median, with a fixed window width.
runningratio
- Like runningmean but with a pair of values at each
position, and then we calculate the ratio of the sums within fixed
windows.
runningratio2
- Like runningratio2 but instead of a fixed window, it
adapts the size of the window to achieve some minimum number of points
before taking the ratio.
rmvn
- Simulate data from a multivariate normal distribution.
align_vectors
- align two vectors by their names, either expanding them with NAs (if
there are names that are unique to one or the other) or reducing the
vectors to the names that appear in both.
pick_more_precise
- for a quirky situation where two columns of numbers were subjected to
weird rounding patterns and I needed to merge them. I hope I don’t need
this again.
get_precision
- used by pick_more_precise()
compare_rows
- for all pairs of rows in a matrix, calculate either the proportion of
differences or the RMS difference, creating a distance matrix.
cf
- useful for verifying that data columns match when they’re supposed to.
It checks whether they are the same, including the pattern of missing
data. I use this a lot when doing data diagnostics/cleaning.
winsorize
- a way to deal with outliers in a numeric vector: move some proportion
of the most extreme values in to the next-most-extreme values.
normalize
- quantile-normalize a set of columns of a matrix (or just two vectors),
to force them to have the same marginal distribution
h
- open the help document for a function/object in your web browser. I’m
usually using R within emacs, and if I type help(h), the
help for the h function shows up within emacs. So I use
h(h) to see it within my web browser.
make
- this runs GNU make
within an R package directory, in a similar way to the various tools in
devtools like
test() and document().
exit
- exit from R without saving .RData
objectsizes
- get the sizes of all objects in memory, in MB
openfile
- open a file using command line open (Mac/Windows) or
start (linux).
qr2
- the base function qr does a QR decomposition, but the
results it produces are not what you might expect. This function gets
you the basic Q and R pieces.
simp
- Perform numerical integration by Simpson’s rule.
trap
- Perform numerical integration by the trapezoidal role.
trap() and simp() are much like
integrate() but coded directly in R.
%nin%,
%win,
and %wnin%
- compare to the recent addition to base R,
%notin%.
attrnames
- get the names of all attributes of an object.
convert2hex
dec2hex
- convert a base 10 number to hexadecimal (mostly for working with RGB
color codes).
hex2dec
- convert a hexadecimal number to base 10.
fac2num
- convert a factor with numeric levels to numeric
lenuniq
- the number of unique values in a vector, by just calling
length(unique(x))
maxabs
- maximum of absolute values of a vector, removing NAs
paste00
- like paste0 but also using
collapse=""
paste.
- like paste0 but with sep="."
setRNGparallel
- set up random number generation for use in parallel calculations, and
then unsetRNGparallel
to go back to the standard RNG method.
switchv
- a vectorized version of the base function
switch.
grayplot
- a base-graphics-based scatterplot, but with particular styling I
learned from Dan Carr: gray
background, white gridlines, black box, no tick marks
grayplot_na
- a base-graphics-based scatterplot, but including NA values in margins
rather than omitted. Use the force argument to control
whether the boxes for NAs in the margins are included even if the data
have no NAs.
ciplot
- plot a set of confidence intervals, with styling like
grayplot()
dotplot
- a scatterplot where one variable is categorical (so like
stripchart(), with styling like
grayplot()
timeplot
- a scatterplot where one variable is date/time, with styling like
grayplot()
mypairs
- a scatterplot matrix (like pairs()) but with only the
upper triangle (to save some space) and calling
grayplot()
manyboxplot
- a version of box plots for high-dimensional data, just drawing lines
at various quantiles, potentially going farther out in the
tails.
qqline2
- this is like qqline() but for general
qqplot(). (qqline() is only for use with
qqnorm().)
excel_fig
- this is a tool for making figures that look like excel files. I used
it to make all of the examples in the “Data organization in
spreadsheets” paper Broman and Woo
(2018).
venn
- make venn diagrams (as circles or squares) where the sizes are “to
scale”.
histlines
- this is a tool for using lines() for making a histogram,
so that you can show multiple histograms on top of each other.
A ternary plot is a way to visualize trinomial distributions, using the fact that in an equilateral triangle, the sum of the distances from any point to the three sides is a constant. They’re super useful for visualization genotype frequencies. So I have a bunch of base-graphics-based tools for making these plots.
triplot
- creates the basic plot, with no contents but labelling the three
vertices.
tripoints
- add a scatter of points on a ternary plot.
trilines
- same input as tripoints(), but draws lines between the
points, like the function lines() vs the function
points().
triarrow
- draw an arrow on a ternary plot.
tritext
- add text to a ternary plot.
trigrid
- add grid lines to a ternary plot.
Here’s an example: I simulate a set of trinomial distributions and plot the points, with grid lines behind. (The three gridlines are at 0.25, 0.50, and 0.75.)
n <- 30
m <- 200
x <- replicate(n, table(factor(sample(1:3, m, prob=c(0.25, 0.5, 0.25),
replace=TRUE), levels=1:3)))
x <- t(x)/colSums(x)
par(mar=rep(0,4), bty="n")
triplot(c("AA", "AB", "BB"), gridlines=3)
tripoints(t(x), pch=21, bg=crayons("Purple Mountain's Majesty"))
arrowlocator
- use locator() to interactively draw an arrow on a
plot.
colwalpha
- convert a named color to the same color but with alpha to control
transparency.
brocolors
- a set of color palettes that I use, including
brocolors("web") that are a set of basic colors that look
nicer than R’s basic colors.
crayons
- colors scraped from the wikipedia entry for Crayola crayons, so you
can use fun colors like crayons("Tickle Me Pink") in your
graphs. You can use crayons("pink", notexact=TRUE) to find
all the ones with “pink” or “Pink” in the name, such as “Shocking Pink”,
“Piggy Pink”, and “Pink Sherbert”.
plot_crayons
- plot a set of colors from the crayons palette
jiggle
- jiggle points vertically. Similar to jitter() but with
method="fixed", which gives results similar to the beeswarm package, for
closely packed non-overlapping points. Used in my
dotplot().
revgray
- reverse grayscale, from white to black rather than black to
white.
revrainbow
- reverse rainbow color palette, going from blue to red. The use of
rainbow color palettes in heatmaps is now frowned upon, though.
twocolorpal
- set up a color palette that goes from blue to white to red
theme_karl
- I don’t do much ggplot, but when I do, this is my theme. Gray
background with white gridlines, but with a box around the plot and
without the tick marks.
time_axis
- set up a time-based axis; used by timeplot()
xlimlabel
- figure out what xlim should be for a plot, to make sure
that a set of labels will all fit. You give the range of your points,
and then the set of labels, and it gives you what xlim should be to get
the labels to fit
strwidth2lines
- converts strwidth to units of margin lines, to determine
how big the margin should be so that your text fits.
strwidth2xlim
- seems the same as xlimlabel