Introduction

Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades. Some writers have admonished that “real programmers don’t use spreadsheets” and that we must “stop that subversive spreadsheet” (Casimir 1992; Chadwick 2003). Others have advised researchers on how to use spreadsheets to improve their productivity (Wagner and Keisler 2006). Amid this debate, spreadsheets have continued to play a significant role in researchers’ workflows, and it is clear that they are a valuable tool that researchers are unlikely to abandon completely.

The dangers of spreadsheets are real, however – so much so that the European Spreadsheet Risks Interest Group keeps a public archive of spreadsheet “horror stories” (http://www.eusprig.org/horror-stories.htm). Many researchers have examined error rates in spreadsheets, and Panko (2008) reported that in 13 audits of real-world spreadsheets, an average of 88% contained errors. Popular spreadsheet programs also make certain types of errors easy to commit and difficult to rectify. Microsoft Excel converts some gene names to dates and stores dates differently between operating systems, which can cause problems in downstream analyses (Zeeberg et al. 2004; Woo 2014). Researchers who use spreadsheets should be aware of these common errors and design spreadsheets that are tidy, consistent, and as resistant to mistakes as possible.

Spreadsheets are often used as a multipurpose tool for data entry, storage, analysis, and visualization. Most spreadsheet programs allow users to perform all of these tasks, however we believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately. Analyzing and visualizing data in a separate program, or at least in a separate copy of the data file, reduces the risk of contaminating or destroying the raw data in the spreadsheet.

Murrell (2013) contrasts data that are formatted for humans to view by eye with data that are formatted for a computer. He provides an extended example of computer code to extract data from a set of files with complex arrangements. It is important that data analysts be able to work with such complex data files. But if the initial arrangement of the data files is planned with the computer in mind, this would simplify the later analysis process.

In this paper we offer practical recommendations for organizing spreadsheet data in a way that both humans and computer programs can read. By following this advice, researchers will create spreadsheets that are less error-prone, easier for computers to process, and easier to share with collaborators and the public. Spreadsheets that adhere to our recommendations will interface easily with the tidy tools and reproducible methods described elsewhere in this collection and will form the basis of a robust and reproducible analytic workflow.

For an existing dataset whose arrangement could be improved, we recommend against applying tedious and potentially error-prone hand-editing to revise the arrangement. Rather, we hope that the reader might apply these principles when designing the layout for future datasets.

Be consistent

The first rule of data organization is be consistent. Whatever you do, do it consistently. Entering and organizing your data in a consistent way from the start will prevent you and your collaborators from having to spend time harmonizing the data later.

Use consistent codes for categorical variables. For a categorical variable like the sex of a mouse in a genetics study, use a single common value for males (e.g. “male”) and a single common value for females (e.g. “female”). Don’t sometimes write “M”, sometimes “male”, and sometimes “Male”. Pick one and stick to it.

Use a consistent fixed code for any missing values. We prefer to have every cell filled in, so that one can distinguish between truly missing values and unintentionally missing values. R users prefer “NA”. You could also use a hyphen. But stick with a single value throughout your data. Definitely don’t use a numeric value like -999 or 999; it is easy to miss that it is intended to be missing. Also, don’t insert a note in place of the data, explaining why it is missing. Rather, make a separate column with such notes.

Use consistent variable names. If in one file (say the first batch of subjects), you have a variable called “Glucose_10wk”, then call it exactly that in other files (say for other batches of subjects). If it is variably called “Glucose_10wk”, “gluc_10weeks”, and “10 week glucose”, then downstream the data analyst will have to work out that these are all really the same thing.

Use consistent subject identifiers. If sometimes it is “153” and sometimes “mouse153” and sometimes “mouse-153F” and sometimes “Mouse153”, there is going to be extra work to figure out who is who.

Use a consistent data layout in multiple files. If your data are in multiple files and you use different layouts in different files, it will be extra work for the analyst to combine the files into one dataset for analysis. With a consistent structure, it will be easy to automate this process.

Use consistent file names. Have some system for naming files. If one file is called “Serum_batch1_2015-01-30.csv”, then don’t call the file for the next batch “batch2_serum_52915.csv” but rather use “Serum_batch2_2015-05-29.csv”. Keeping a consistent file naming scheme will help ensure that your files remain well organized, and it will make it easier to batch process the files if you need to.

Use a consistent format for all dates, preferably with the standard format YYYY-MM-DD, for example 2015-08-01. If sometimes you write 8/1/2015 and sometimes 8-1-15, it will be more difficult to use the dates in analyses or data visualizations.

Use consistent phrases in your notes. If you have a separate column of notes (for example, “dead” or “lo off curve”), be consistent in what you write. Don’t sometimes write “dead” and sometimes “Dead”, or sometimes “lo off curve” and sometimes “off curve lo”.

Be careful about extra spaces within cells. A blank cell is different than a cell that contains a single space. And “male” is different from “ male ” (that is, with spaces at the beginning and end).

Choose good names for things

It is important to pick good names for things. This can be hard, and so it is worth putting some time and thought into it.

As a general rule, don’t use spaces, either in variable names (that is, the names of the columns in your data) or in file names. They make programming harder: the analyst will need to surround everything in double quotes, like "glucose 6 weeks", rather than just writing glucose_6_weeks. Where you might use spaces, use underscores or perhaps hyphens. But don’t use a mixture of underscores and hyphens; pick one and be consistent.

Be careful about extraneous spaces (say, at the beginning or end of a variable name). “glucose” is different from “glucose ” (with an extra space at the end).

Avoid special characters, too. (Except for underscores and hyphens; they are okay.) Other symbols ($, @, %, #, &, *, (, ), !, /, etc.) often have special meaning in programming languages, and so they can be harder to handle. They are also a bit harder to type.

The main principle in choosing names, whether for variables or for file names, is short, but meaningful. So not too short.

The Data Carpentry lesson on using spreadsheets (see http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes) has a nice table with good and bad example variable names:

good name good alternative avoid
Max_temp_C MaxTemp Maximum Temp ($^{\circ}$C)
Precipitation_mm Precipitation precmm
Mean_year_growth MeanYearGrowth Mean growth/year
sex sex M/F
weight weight w.
cell_type CellType Cell type
Observation_01 first_observation 1st Obs.

We agree with all of this, though we would maybe cut down on some of the capitalization. So maybe max_temp, precipitation, and mean_year_growth.

Finally, never include “final” in a file name. You will invariably end up with “final_ver2”. (We can’t say that without referring to the widely-cited PhD comic, http://bit.ly/phdcom_final.)

Write dates like YYYY-MM-DD

When entering dates, we strongly recommend using the global “ISO 8601” standard, YYYY-MM-DD, such as 2013-02-27. (See the related xkcd comic, https://xkcd.com/1179.)

But note that Microsoft Excel does bizarre things with dates (see https://storify.com/kara_woo/excel-date-system-fiasco). It stores them internally as a number, counting the days since 1900-01-01. Wait, that is only for Windows; on Macs, it counts the days since 1904-01-01. So, you may need to manually check that the dates haven’t been mangled when your data come out of Excel.

Excel also has a tendency to turn other things into dates. For example, Roger Peng reported on a conversation between Kasper Hansen and Jeff Leek (http://bit.ly/twitter_oct4a, http://bit.ly/twitter_oct4b):

Kasper: Do you have a favorite transcription factor?

Jeff: Yes, Oct-4

Kasper: ?

Jeff: Oct-4: because Excel turns it into a date and it actually has a cool function.

On this point, Ziemann et al. (2016) studied gene lists contained within the supplementary files from 18 journals for the years 2005-2015, and found that ~20% of gene lists had errors in the gene names, related to the conversion of gene symbols to dates or floating-point numbers.

We often prefer to use a plain text format for columns in an Excel worksheet that are going to contain dates, so that it doesn’t do anything to them:

However, if you do this on columns that already contain dates, Excel will convert them to a text value of their underlying numeric representation (i.e. days since 1900-01-01 or 1904-01-01).

Another way to force Excel to treat dates as text is to begin the date with an apostrophe, like this: '2014-06-14 (see http://bit.ly/twitter_apos). Excel will treat the cells as text, but the apostrophe will not appear when you view the spreadsheet or export it to other formats. This is a handy trick, but it requires impeccable diligence and consistency. Alternatively, you could create three separate columns with year, month, and day. Those will be ordinary numbers, and so Excel won’t mess them up. Finally, you could represent dates as an 8-digit integer of the form YYYYMMDD, for example 20140614 for 2014-06-14 (see Briney 2017).

But the point we most want to emphasize here: be consistent in the way in which you write dates. And really, always use the YYYY-MM-DD format (or put the year, month, and day in separate columns, if you want).

Figure 1 displays a portion of a spreadsheet that we got from a collaborator. We don’t quite remember what those e’s were for, but in any case having different date formats within a column makes it more difficult to use the dates in later analyses or data visualizations.

Use care about dates, and be consistent.

A spreadsheet with inconsistent date formats. This spreadsheet does not adhere to our recommendations for consistency of date format.

A spreadsheet with inconsistent date formats. This spreadsheet does not adhere to our recommendations for consistency of date format.

No empty cells

Fill in all cells. Use some common code for missing data. Not everyone agrees with us on this point (for example, White et al. (2013) state a preference for leaving cells blank), but we would prefer to have “NA” or even a hyphen in the cells with missing data, to make it clear that the data are known to be missing rather than unintentionally left blank.

Figure 2 contains two examples of spreadsheets with some empty cells. In Figure 2A, cells were left blank when a single value was meant to be repeated multiple times. Please don’t do this! It is additional work for the analyst to determine the implicit values for these cells. Moreover, if the rows are sorted at some point there may be no way to recover the dates that belong in the empty cells.

Examples of spreadsheets that violate the no empty cells recommendation. A: A spreadsheet where only the first of several repeated values was included. B: A spreadsheet with a complicated layout and some implicit column headers. For a tidy version of this data, see Figure 3.

Examples of spreadsheets that violate the ‘no empty cells’ recommendation. A: A spreadsheet where only the first of several repeated values was included. B: A spreadsheet with a complicated layout and some implicit column headers. For a tidy version of this data, see Figure 3.

The spreadsheet in Figure 2B has a complex layout with information for different treatments. It is perhaps clear that columns B-E all concern the “1 min” treatment, and columns F-I all concern “5 min”, and that columns B, C, F, and G all concern “normal”, while columns D, E, H, and I concern “mutant”. But while it may be easy to see by eye, it can be hard to deal with this in later analyses.

You could fill in some of those cells, to make it more clear. Alternatively, make a “tidy” version of the data (Wickham 2014), with each row being one replicate and with the response values all in one column, as in Figure 3. We will discuss this further, below.

A tidy version of the data in Figure 2B.

A tidy version of the data in Figure 2B.

Put just one thing in a cell

The cells in your spreadsheet should each contain one piece of data. Don’t put more than one thing in a cell.

For example, you might have a column with “plate position” as “plate-well”, such as “13-A01”. It would be better to separate this into “plate” and “well” columns (containing “13” and “A01”), or even “plate”, “well_row”, and “well_column” (containing “13”, “A”, and “1”).

Or you might be tempted to include units, such as “45 g”. It is better to write 45 and put the units in the column name, such as body_weight_g. It is even better to leave the column as body_weight and put the units in a separate data dictionary (see below).

Another common situation is to include a note within a cell, with the data, like “0 (below threshold)”. Instead, write “0” and include a separate column with such notes.

Finally, don’t merge cells. It might look pretty, but you end up breaking the rule of no empty cells.

Make it a rectangle

The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables. The first row should contain variable names. (Please don’t use more than one row for the variable names.) An example of a rectangular layout is shown in Figure 4.

An example spreadsheet with a rectangular layout. This layout will aid future analyses.

An example spreadsheet with a rectangular layout. This layout will aid future analyses.

Some datasets won’t fit nicely into a single rectangle, but they will usually fit into a set of rectangles, in which case you can make a set of Excel files, each with a rectangle of data. It is best to keep each rectangle in its own file; tables scattered around a worksheet are difficult to work with, and they make it hard to export the data to CSV files (which we will discuss shortly). You might also consider having a single Excel file with multiple worksheets. We prefer to have multiple files with one sheet each so we can more easily save the data as CSV files, but if you do use multiple worksheets in a file be sure to use a consistent structure.

Some data don’t even fit into a set of rectangles, but then maybe spreadsheets are not the best format for them, as spreadsheets are inherently rectangular.

The data files that we receive are usually not in rectangular form. More often, there seem to be bits of data sprinkled about. Several examples are shown in Figure 5.