data organization organizing data in spreadsheets
My collaborators sometimes ask me, “In what form would you like the data?” My response is always, “In its current form!” If the data need to be reformatted, it’s much better for me to write a script than for them to do a bunch of cut-and-paste. I’m a strong proponent of data analysts being able to handle any data files they might receive.
But in many cases, I have to spend a lot of time writing scripts to rearrange the layout of the data. And how would you like your data analysts to spend their time? Reorganizing data, or really analyzing data?
Most of my collaborators enter and store their data in spreadsheets, and mostly Microsoft Excel. Before starting to enter data into a spreadsheet, it’s good to spend some time thinking about the layout. The way that you organize the data in spreadsheets can have a big impact on your data analyst’s quality of life.
This is a tutorial on that topic: how to organize data in spreadsheets. For complex, high-dimensional data, it may be better to use a formal database. But for many projects, spreadsheets are perfectly fine. But data in spreadsheets can be pretty and easy to work with, or they can be a sloppy mess requiring serious downstream reorganization efforts. We want to avoid the latter.
I don’t think these ideas come naturally to anyone. So if you’re not happy with the structure of your current data files, don’t despair! And also don’t apply tedious and potentially error-prone hand-editing to revise the arrangement. Rather, apply these principles when designing the layout for your next dataset, to help make analyses easier.
- Be consistent.
- Write dates as
YYYY-MM-DD
. - Fill in all of the cells.
- Put just one thing in a cell.
- Make it a rectangle.
- Create a data dictionary.
- No calculations in the raw data files.
- Don’t use font color or highlighting as data.
- Choose good names for things.
- Make backups.
- Use data validation to avoid data entry mistakes.
- Save the data in plain text files.
- Other resources
In collaboration with Kara Woo, this website is now a proper article:
Broman KW, Woo KH (2018) Data organization in spreadsheets. The American Statistician 78:2–10 (doi:10/gdz6cm)
The article is also available at GitHub and PeerJ Preprints, and note that it’s part of a full Data Science issue of The American Statistician, organized by Jenny Bryan and Hadley Wickham, which itself is also available at PeerJ Preprints.
The source for this tutorial is on github. If you have suggestions or corrections, please submit an issue.
Also see my tutorials on git/github, knitr, GNU make, R packages, making a web site with GitHub Pages, and reproducible research.