ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

How to clean your data: best practices

Updated on June 27, 2015

When I first heard the term “cleaning data” I thought it was funny – what is that even supposed to mean? I soon learned that cleaning data can be a long and painful process if you're not prepared for it. Hours and hours of changing variable names by hand, clicking on any and all the options in SPSS to get the dataset to a structure that can actually be analyzed, sifting through hundreds of text entries for the variable “age” and correcting them into a format that the analysis software can recognize as a number... and I kept telling myself: there HAS to be an easier way. And indeed there is. Actually, indeed there ARE. Here's what I learned the hard way.

Your survey questions should be clear and specific enough for the answer you seek

That is, if you ask “what is your weight?”, which you may think is a simple and straightforward question, you might get responses ranging from “60 kg”, “135 lbs”, “10 stones”, “67,5” (this one you may assume to be European for 67.5 probably kilogram), etc. And you know what all these entries have in common? Your data analysis software will see them as text entries, and when you attempt to convert them to numbers, the values will disappear. Gone. And at that point you may have to go through them manually and change them, cleaning your data points one by one, into a format that your program will recognize.

Find extreme outliers for each variable – Frequencies and Ranges

If you have a small dataset, you can probably eyeball your data and see if anything looks abnormal (for example, someone who reported their age to be 155 years; it's possible that this person really meant 15.5 years old). If you have a large dataset, chances are you won't be able to eyeball it – or if you try, you might miss some of these. So run some frequencies on your data and also get the ranges of the values for each variable you're interested in. This will take some time, but less time than going through a hundred thousand entries for 200 variables.

Coding across raters is consistent

If multiple raters have contributed to your dataset, you need to check that they all use the same coding scheme. You can look specifically at the ranges of values for different raters (for example, in SPSS you can use the “Explore” option to look at descriptive statistics within groups) to make sure that they don't look substantially different from each other.

Rename and label variables

If your data comes in a format that looks like this: VAR0001, VAR0002, etc., you will need to rename each variable so you can later know what you're analyzing. DO NOT do this by hand. Use syntax and code whenever you can, and save your syntax and code so you can replicate your cleaning process should you need to (trust me, you will need to). Another important part at this data cleaning stage is labeling your variables – it's not enough that you know your variable is the first question of survey X (e.g., X_001), but you should also know what that question was (e.g. “What is your weight in lbs?”).

Label your values within variables

If your data is organized by categories (e.g. 1 = Male, 2 = Female), you should label these in your dataset if your software permits it.

Identify duplicate entries

Often enough you will have duplicates in your dataset, and these can mess up your analyses. SPSS has an easy way of identifying duplicates, but I am sure that any of the other statistical analysis tools out there also have similar functions.

Use tools like Google Refine

Check out the intro video here: This can be a lifesaver.

Know what format your data needs to be in for analysis

Not all research questions are created equal, so when it comes to cleaning, a lot depends on the question you are looking to answer. You do not need to clean your entire dataset. Just focus on the variables that are relevant to your particular question. Of course, if you have time and some OCD traits on your side, by all means go ahead and clean it all. However, if you are looking to analyze how people label their weight when given free choice of expression, converting all text entries into numbers won't help you answer your question.


Keep track of all your cleaning work, either via syntax, code, or if you have to, just a readme.txt file. Document why you've made those decisions, and which cases were removed, and when. Also, always keep the original, untouched dataset and work on a copy. This is to CYA in case anyone questions your approach, and to CYA in case you mess up a data cleaning or analysis step and have to retrace. Having all your syntax ready for a re-run will make you really, really happy.

© 2015 The Fru Gal


    0 of 8192 characters used
    Post Comment

    No comments yet.


    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at:

    Show Details
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the or domains, for performance and efficiency reasons. (Privacy Policy)
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)