Be consistent
The first rule of data organization is be consistent.
-
Use consistent codes for categorical variables. For a categorical variable like sex, use a single common value for males (e.g. “
male”) and a single common value for females (e.g. “female”). Don’t sometimes write “M”, sometimes “male”, and sometimes “Male”. Pick one and stick to it. -
Use a single fixed code for any missing values. I prefer to have every cell filled in (more discussion here), so that one can distinguish between truly missing values and unintentionally missing values. R users prefer “
NA”. You could also use a hyphen. But stick with a single value throughout your data. Definitely don’t use a numeric value like-999or999; it’s easy to miss that it’s intended to be missing. Also, don’t insert a note in place of the data, explaining why it’s missing. Rather, make a separate column with such notes. -
Use consistent variable names. If in one file (say the first batch of subjects), you have a variable called “
Glucose_10wk”, then call it exactly that in other files (say for other batches of subjects). If it’s variably called “Glucose_10wk”, “gluc_10weeks”, and “10 week glucose”, then downstream the data analyst will have to work out that these are all really the same thing. (More on naming variables here.) -
Use consistent subject IDs. If sometimes it’s “
153” and sometimes “mouse153” and sometimes “mouse-153F” and sometimes “Mouse153”, there’s going to be extra work to figure out who’s who. -
Use a common data layout in multiple files. If your data are in multiple files, use the same layout in all files. (More on layout here.)
-
Use consistent file names. Have some system for naming files. If one file is called “
Serum_batch1_2015-01-30.csv”, then don’t call the file for the next batch “batch2_serum_52915.csv” but rather use “Serum_batch2_2015-05-29.csv”. (More on naming files here.) -
Use a single common format for all dates, preferably
YYYY-MM-DD, like2015-08-01. If sometimes you write8/1/2015and sometimes8-1-15, you’re asking for trouble. (More on dates next.) -
Use consistent phrases in your notes. If you have a separate column of notes (for example, “
dead” or “lo off curve”), be consistent in what you write. Don’t sometimes write “dead” and sometimes “Dead”, or sometimes “lo off curve” and sometimes “off curve lo”. -
Be careful about extra spaces within cells. A blank cell is different than a cell that contains a single space. And “
male” is different from “male” (that is, with spaces at the beginning and end). These can be a headache later on.
Next up: Write dates as YYYY-MM-DD.