How to clean your data: best practices
When I first heard the term “cleaning data” I thought it was funny – what is that even supposed to mean? I soon learned that cleaning data can be a long and painful process if you're not prepared for it. Hours and hours of changing variable names by hand, clicking on any and all the options in SPSS to get the dataset to a structure that can actually be analyzed, sifting through hundreds of text entries for the variable “age” and correcting them into a format that the analysis software can recognize as a number... and I kept telling myself: there HAS to be an easier way. And indeed there is. Actually, indeed there ARE. Here's what I learned the hard way.
Your survey questions should be clear and specific enough for the answer you seek
That is, if you ask “what is your weight?”, which you may think is a simple and straightforward question, you might get responses ranging from “60 kg”, “135 lbs”, “10 stones”, “67,5” (this one you may assume to be European for 67.5 probably kilogram), etc. And you know what all these entries have in common? Your data analysis software will see them as text entries, and when you attempt to convert them to numbers, the values will disappear. Gone. And at that point you may have to go through them manually and change them, cleaning your data points one by one, into a format that your program will recognize.
Find extreme outliers for each variable – Frequencies and Ranges
If you have a small dataset, you can probably eyeball your data and see if anything looks abnormal (for example, someone who reported their age to be 155 years; it's possible that this person really meant 15.5 years old). If you have a large dataset, chances are you won't be able to eyeball it – or if you try, you might miss some of these. So run some frequencies on your data and also get the ranges of the values for each variable you're interested in. This will take some time, but less time than going through a hundred thousand entries for 200 variables.
Coding across raters is consistent
If multiple raters have contributed to your dataset, you need to check that they all use the same coding scheme. You can look specifically at the ranges of values for different raters (for example, in SPSS you can use the “Explore” option to look at descriptive statistics within groups) to make sure that they don't look substantially different from each other.
Rename and label variables
If your data comes in a format that looks like this: VAR0001, VAR0002, etc., you will need to rename each variable so you can later know what you're analyzing. DO NOT do this by hand. Use syntax and code whenever you can, and save your syntax and code so you can replicate your cleaning process should you need to (trust me, you will need to). Another important part at this data cleaning stage is labeling your variables – it's not enough that you know your variable is the first question of survey X (e.g., X_001), but you should also know what that question was (e.g. “What is your weight in lbs?”).
Label your values within variables
If your data is organized by categories (e.g. 1 = Male, 2 = Female), you should label these in your dataset if your software permits it.
Identify duplicate entries
Often enough you will have duplicates in your dataset, and these can mess up your analyses. SPSS has an easy way of identifying duplicates, but I am sure that any of the other statistical analysis tools out there also have similar functions.
Use tools like Google Refine
Check out the intro video here: http://www.youtube.com/watch?v=B70J_H_zAWM This can be a lifesaver.
Know what format your data needs to be in for analysis
Not all research questions are created equal, so when it comes to cleaning, a lot depends on the question you are looking to answer. You do not need to clean your entire dataset. Just focus on the variables that are relevant to your particular question. Of course, if you have time and some OCD traits on your side, by all means go ahead and clean it all. However, if you are looking to analyze how people label their weight when given free choice of expression, converting all text entries into numbers won't help you answer your question.
Finally, DOCUMENT YOUR DATA CLEANING.
Keep track of all your cleaning work, either via syntax, code, or if you have to, just a readme.txt file. Document why you've made those decisions, and which cases were removed, and when. Also, always keep the original, untouched dataset and work on a copy. This is to CYA in case anyone questions your approach, and to CYA in case you mess up a data cleaning or analysis step and have to retrace. Having all your syntax ready for a re-run will make you really, really happy.
© 2015 The Fru Gal