The best layout for your data within in a spreadsheet is as a big rectangle with rows corresponding to subjects and columns corresponding to variables. The first row should contain variable names. (Please don’t use more than one row for the variable names.) Here’s an example:

215.7 0.73 105.2 Female 105 6 142.4 1.16 83.1 Male 104 5 297.6 1.23 124.8 Male 103 4 243.6 1.18 120.0 Female 102 3 273.4 0.60 134.1 Male 101 2 triglyc insulin glucose sex id 1 E D C B A

Some data sets won’t fit nicely into a rectangle, but they’ll usually fit into a set of rectangles, in which case you can make a set of Excel files, each with a rectangle of data. Some data don’t even fit into a set of rectangles, but then maybe spreadsheets are not the best format for them, as spreadsheets seem inherently rectangular.

The data files that I receive are usually not in rectangular form. More often, there seem to be bits of data sprinkled about. Like this:

0.73 1.16 1.23 1.18 0.60 insulin 9 105 104 103 102 101 8 7 105.2 83.1 124.8 120.0 134.1 glucose 6 105 104 103 102 101 5 4 Female Male Male Female Male sex 3 105 104 103 102 101 2 1 F E D C B A

Or maybe like this:

446.5 447.4 412.1 597.9 610.6 514.4 BTBR 9 423.8 474.4 450.6 408.8 353.6 333.6 B6 8 Mutant Normal 7 5MIN 6 5 188.1 171.6 177.8 243.1 240 245.7 BTBR 4 186.9 179.3 166 155.6 138.6 146.6 B6 3 Mutant Normal 2 1MIN 1 G F E D C B A

In both of these cases, the data analyst has to study the layout, figure out what everything means, and then spend some time rearranging things.

Another recent example: a separate worksheet for each subject, each in a complicated format like this:

2.17 20.83 23.24 19.05 20.19 treatment B 13 6.65 7.64 4.64 3.02 15.26 treatment A 12 SD mean values fold change 11 10 1.05 10.12 11.296 9.259 9.811 treatment B 9 3.23 3.71 2.254 1.468 7.414 treatment A 8 0.52 0.49 1.081 0.191 0.186 control 7 SD mean values experiment 6 f sex 5 43 Mouse # 4 126 Days on diet 3 11/3/14 Date 2 1 G F E D C B A

If all of the worksheets have exactly the same layout, then it’s not too hard to pull out the relevant information and combine it into a rectangle. (I’d write a python or ruby script.) But I prefer to not have means and SDs and fold change calculations cluttering things up, and it seems that even for data entry, it’d be easier to have all of the measurements on one worksheet.

Sometimes it’s hard to see how to reorganize things as a rectangle. Consider this example:

off curve lo 530.6 5 15 0.151 198.5 0 24.7 2/9/15 323 14 0.523 260.7 120 13 0.5 232.7 60 12 0.775 362.3 30 11 2.078 439 15 10 2.228 297.4 5 9 0.251 185.8 0 18.9 2/9/15 322 8 lo off curve 217.9 120 7 0.122 99.9 60 6 0.175 312 30 5 0.129 286.1 15 4 0.205 349.3 5 3 lo off curve 99.2 0 24.5 2/9/15 321 2 insulin ng/ml glucose mg/dl time GTT weight GTT date 1 F E D C B A

Well, it’s sort of a rectangle; we could just fill in the empty cells, like this:

off curve lo 530.6 5 24.7 2/9/15 323 15 0.151 198.5 0 24.7 2/9/15 323 14 0.523 260.7 120 18.9 2/9/15 322 13 0.5 232.7 60 18.9 2/9/15 322 12 0.775 362.3 30 18.9 2/9/15 322 11 2.078 439 15 18.9 2/9/15 322 10 2.228 297.4 5 18.9 2/9/15 322 9 0.251 185.8 0 18.9 2/9/15 322 8 lo off curve 217.9 120 24.5 2/9/15 321 7 0.122 99.9 60 24.5 2/9/15 321 6 0.175 312 30 24.5 2/9/15 321 5 0.129 286.1 15 24.5 2/9/15 321 4 0.205 349.3 5 24.5 2/9/15 321 3 lo off curve 99.2 0 24.5 2/9/15 321 2 insulin ng/ml glucose mg/dl time GTT weight GTT date id 1 F E D C B A

But it seems wrong to repeat the weights like that. It’s better to make two separate tables, one with the weights, and one with these other measurements (which are for an in vivo assay, the glucose tolerance test: give a mouse some glucose and measure serum glucose and insulin levels at different times afterwards).

So you’d have one table with the weights, like this:

24.7 2/9/15 323 4 18.9 2/9/15 322 3 24.5 2/9/15 321 2 GTT weight GTT date id 1 C B A

And then another rectangle with the GTT results, like this:

off curve lo 530.6 5 323 15 0.151 198.5 0 323 14 0.523 260.7 120 322 13 0.5 232.7 60 322 12 0.775 362.3 30 322 11 2.078 439 15 322 10 2.228 297.4 5 322 9 0.251 185.8 0 322 8 lo off curve 217.9 120 321 7 0.122 99.9 60 321 6 0.175 312 30 321 5 0.129 286.1 15 321 4 0.205 349.3 5 321 3 lo off curve 99.2 0 321 2 insulin ng/ml glucose mg/dl GTT time id 1 D C B A

I can’t leave those “lo off curve” and “off curve lo” notes in there, though. Best to put “NA” and add a “note” column.

insulin below curve NA 530.6 5 323 15 0.151 198.5 0 323 14 0.523 260.7 120 322 13 0.5 232.7 60 322 12 0.775 362.3 30 322 11 2.078 439 15 322 10 2.228 297.4 5 322 9 0.251 185.8 0 322 8 insulin below curve NA 217.9 120 321 7 0.122 99.9 60 321 6 0.175 312 30 321 5 0.129 286.1 15 321 4 0.205 349.3 5 321 3 insulin below curve NA 99.2 0 321 2 note insulin ng/ml glucose mg/dl GTT time id 1 E D C B A

This is an example of what’s called “tidy” data (see the paper by Hadley Wickham): each row is an experimental unit, which is usually just a subject but in this case is a single assay measurement on a subject. Reorganizing the data into a “tidy” format can simplify later analysis. But the rectangular aspect is the most important part.

Another issue I often see is the use of two rows of header names, like this:

446 43.6 2/2/2007 556.9 36.6 1/19/2007 220.2 19.8 1/5/2007 F 3499 7 182.5 56.7 2/2/2007 191.3 42.9 1/19/2007 121 27.5 1/5/2007 M 3449 6 409.8 56.2 12/22/2006 378 45.9 12/6/2006 238.9 26.6 11/22/2006 F 3434 5 458.7 57.2 11/3/2006 384.7 45.1 10/19/2006 202.4 25.9 10/6/2006 M 3017 4 530.2 39.6 4/27/2007 460.7 31 4/11/2007 635 19.3 3/30/2007 M 3005 3 glucose weight date glucose weight date glucose weight date SEX Mouse ID 2 week 8 week 6 week 4 1 K J I H G F E D C B A

This is the sort of situation where I’ll see merged cells: merging the “week 4” cell with the two cells following, so that the text is centered above the three columns with “date”, “weight”, and “glucose”.

I’d prefer to have the week information within the variable name, like this:

446 43.6 2/2/2007 556.9 36.6 1/19/2007 220.2 19.8 1/5/2007 F 3499 6 182.5 56.7 2/2/2007 191.3 42.9 1/19/2007 121 27.5 1/5/2007 M 3449 5 409.8 56.2 12/22/2006 378 45.9 12/6/2006 238.9 26.6 11/22/2006 F 3434 4 458.7 57.2 11/3/2006 384.7 45.1 10/19/2006 202.4 25.9 10/6/2006 M 3017 3 530.2 39.6 4/27/2007 460.7 31 4/11/2007 635 19.3 3/30/2007 M 3005 2 glucose_8 weight_8 date_8 glucose_6 weight_6 date_6 glucose_4 weight_4 date_4 SEX Mouse ID 1 K J I H G F E D C B A

Even better would be to make it a “tidy” data set with each row being a subject on a particular day, like this:

446 43.6 2/2/2007 8 F 3499 16 556.9 36.6 1/19/2007 6 F 3499 15 220.2 19.8 1/5/2007 4 F 3499 14 182.5 56.7 2/2/2007 8 M 3449 13 191.3 42.9 1/19/2007 6 M 3449 12 121 27.5 1/5/2007 4 M 3449 11 409.8 56.2 12/22/2006 8 F 3434 10 378 45.9 12/6/2006 6 F 3434 9 238.9 26.6 11/22/2006 4 F 3434 8 458.7 57.2 11/3/2006 8 M 3017 7 384.7 45.1 10/19/2006 6 M 3017 6 202.4 25.9 10/6/2006 4 M 3017 5 530.2 39.6 4/27/2007 8 M 3005 4 460.7 31 4/11/2007 6 M 3005 3 635 19.3 3/30/2007 4 M 3005 2 weight glucose date week sex mouse_id 1 F E D C B A

Make it a rectangle! (Or, if necessary, make it a set of rectangles, but that doesn’t have quite the same punch.)


Next up: Create a data dictionary.

(Previous: Put just one thing in a cell.)