Fill in all cells. Use some common code for missing data. Not everyone agrees with me (for example, White et al. (2013) state a preference for leaving cells blank), but I’d prefer to have “NA” or even a hyphen in the cells with missing data, to make sure it’s clear that the data are known to be missing rather than unintentionally left blank.

I also often see cells left blank when a single value is meant to be repeated multiple times. For example, one might put the date in only a few cells, like this:

169.4 107 8 149.0 2015-06-20 106 7 108.0 105 6 117.0 104 5 97.5 2015-06-18 103 4 95.3 102 3 149.3 2015-06-14 101 2 glucose date id 1 C B A

Don’t do that! If the rows were to be sorted at some point, that date column would be completely mangled.

It’s much better to fill them all in, like this:

169.4 2015-06-20 107 8 149.0 2015-06-20 106 7 108.0 2015-06-18 105 6 117.0 2015-06-18 104 5 97.5 2015-06-18 103 4 95.3 2015-06-14 102 3 149.3 2015-06-14 101 2 glucose date id 1 C B A

I also see this sort of thing for information about different treatments. For example, I recently saw a file like the following:

447 412 611 514 172 178 240 246 B 4 474 451 354 334 179 166 139 147 A 3 mutant normal mutant normal strain 2 5 min 1 min 1 I H G F E D C B A

We’ll talk more about layout shortly, but while it’s sort of clear, to a human, what’s intended above, the computer will have a hard time with it.

You could fill in some of those cells, to make it more clear, but even better would be to make a “tidy” version of the data (more on what is meant by “tidy” later, as part of the discussion of layout), with each row being one replicate, as follows:

447 2 5 mutant B 17 412 1 5 mutant B 16 474 2 5 mutant A 15 451 1 5 mutant A 14 611 2 5 normal B 13 514 1 5 normal B 12 354 2 5 normal A 11 334 1 5 normal A 10 172 2 1 mutant B 9 178 1 1 mutant B 8 179 2 1 mutant A 7 166 1 1 mutant A 6 240 2 1 normal B 5 246 1 1 normal B 4 139 2 1 normal A 3 147 1 1 normal A 2 response replicate min genotype strain 1 E D C B A

No empty cells!


Next up: Put just one thing in a cell.

(Previous: Write dates as YYYY-MM-DD.)