ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Machine Learning: Model Selection & Cross Validation

Updated on July 3, 2015

Model selection with cross-validation

In this post will go over why cross-validation is important understand how it works and see how it can be applied in many different ways.

Suppose we need to build a machine learning system for the following problem. Given a photograph we would like to predict is a person or it's a bomb. Clearly this is an important problem as well as a public safety issue. As machine learning scientists we represent the input as an X and the output as a Y. Y can be one for a person or 1 for a bomb. In order to build our system we need to collect data from the real world to learn from.

Our dataset consist of many pairs of photographs and labels. Either people or bombs. Once we've collected our data we will train our system and then put it to test in the real world protecting our nation's shopping malls, schools and airports. Let’s look at the training process in more detail. as it turns out we know many ways training machine learning systems each with different parameters and settings for example we can learn one nearest neighbor system (1-NN), 3-NN, 5-NN, a kernel regression system with a sigma of one, a kernel regression system with a sigma of two , Naive Bayes, support vector machines and many others.

The problem of choosing which method to use from pool of possible methods is known as model selection. We want to choose the model that will work the best at test time in the real world. When all we have is a fixed dataset. One way to choose is to train each method on our data and then test on the same data that we have. This is a terrible idea, you can't give your students the answers key before giving them an exam.

Instead we will do the following we will split our data into sections each section is called a fold. In this example we have four folds.

Next we'll iterate over the folds as follow. First iteration we train on fold 1, 2, 3 and then we test our method unfold one. In this case the algorithm has never seen fold one before just like we will test our bomb detector in the real world. We measure the error of our method on this fold we then swap places with fold 1 and 2. Now we train on fold 1, 3, 4 and we test with fold 2. We could repeat this process for each fold with holding that fold from training and then computing error on that fold and test. Some folds are easier to learn than others.

Finally we combine the four errors into a single average. This average is known as the cross-validation error for any single method, cross-validation error is an estimated of how the method would perform if the data we collected is an accurate representation of the real world.

We repeat the cross-validation procedure for each method we might select during training then we can select the model with minimum cross-validation error.

In this case 5 nearest neighbors is our best guess for which model will be the best bomb detector in the world. Now that we have chosen our model, we can evaluate it on a real world. But what sort of performance do we expect?!. Do we expect exactly the same performance as a cross-validation estimate maybe our estimate was optimistic or maybe was too conservative? in fact the 17 percent error we found during the model selection process is almost certainly optimistic this is because model selection has biased our estimate tester. We chose the best cross-validation error out of many possibilities.

So even if we had a pool at 1 million random classifiers we would still expect at least one of them have low cross validation error duo purely the random chance. So we need to take another look at our data. We will still use cross-validation but this time you apply cross-validation twice.

First we separate our data into two parts the first part will be used for model selection. The second will be used for testing to represent the unseen world. The important point is that the world data is never touched by our model selection procedure to perform on the selection we divide the data into folds just like before.

In this case we have six folds we then perform cross validation for each of our methods to determine an error rate. In this case three nearest neighbors is a method with lowest cross-validation now we can evaluate the result of the model selection on our held out test data. This time we use all folds of the training data during training.

Now what does this final number estimate? In fact it is the estimate of our entire learning process. We took data we train multiple methods and then we selected the best according to cross validation. Finally we tested on held out data not seen by the algorithm. In other words we achieved an estimate at how our entire learning procedure which includes model selection as part of training will perform on unseen data. Again if the world happens to be well represented by our dataset, this time our estimate of 16 percent is most likely conservative since we're only using a portion of the data that we have in order to train a model.

In conclusion what to be learned today, Cross-validation is a simple and useful method of model selection more importantly cross-validation is also necessary to obtain an estimate the error of our model selection method.


    0 of 8192 characters used
    Post Comment

    No comments yet.


    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at:

    Show Details
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the or domains, for performance and efficiency reasons. (Privacy Policy)
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)