It’s helpful to have a separate file that explains what all of the variables are. It’s super helpful if this is laid out in rectangular form, so that the data analyst can make use of it in analyses.

Such a “data dictionary” might contain:

  • The exact variable name as in the data file.
  • A version of the variable name that might be used in data visualizations.
  • A longer explanation of what the variable means.
  • The measurement units.
  • Expected minimum and maximum values, perhaps.

This is part of the metadata that you’ll want to prepare: information about the data. You’ll also want a ReadMe file that includes an overview of the project and data.

Here’s an example data dictionary.

Number of days on high-fat diet clinical Days on diet diet_days 8 Indicates if mouse stored food in their bedding clinical Crumblers crumblers 7 Coat color, by visual inspection demographic Coat color coat_color 6 Indicates if mouse showed partial pancreatic inflation clinical Partial inflation partial_inflation 5 Date mouse was sacrificed demographic Date of sac sac_date 4 Male (M) or Female (F) demographic Sex sex 3 Animal identifier demographic Mouse mouse 2 description group plot_name name 1 D C B A

Note that this is a rectangular data set, like any other. The first column contains the variable names. The second column is more readable version, as might be used in data visualizations. The third column groups the variables into different categories, which might also be used in data visualizations. The last column is a description.

Lots of other information could be included. For example, information about the allowed values for the variables would be helpful in identifying data entry errors.

Next up: No calculations in the raw data files.

(Previous: Make it a rectangle.)