- R tutorial: Cross-validation

R tutorial: Cross-validation

Learn more about machine learning with R: https://www.datacamp.com/courses/machine-learning-toolbox

In the last video, we manually split our data into a single test set, and evaluated out-of-sample error once. However, this process is a little fragile: the presence or absence of a single outlier...
Learn more about machine learning with R: https://www.datacamp.com/courses/machine-learning-toolbox

In the last video, we manually split our data into a single test set, and evaluated out-of-sample error once. However, this process is a little fragile: the presence or absence of a single outlier can vastly change our out-of-sample RMSE.

A better approach than a simple train/test split is using multiple test sets and averaging out-of-sample error, which gives us a more precise estimate of true out-of-sample error. One of the most common approaches for multiple test sets is known as "cross-validation", in which we split our data into ten "folds" or train/test splits. We create these folds in such a way that each point in our dataset occurs in exactly one test set.

This gives us 10 test sets, and better yet, means that every single point in our dataset occurs exactly once. In other words, we get a test set that is the same size as our training set, but is composed of out-of-sample predictions! We assign each row to its single test set randomly, to avoid any kind of systemic biases in our data. This is one of the best ways to estimate out-of-sample error for predictive models.

One important note: after doing cross-validation, you throw all resampled models away and start over! Cross-validation is only used to estimate the out-of-sample error for your model. Once you know this, you re-fit your model on the full training dataset, so as to fully exploit the information in that dataset. This, by definition, makes cross-validation very expensive: it inherently takes 11 times as long as fitting a single model (10 cross-validation models plus the final model).

The train function in caret does a different kind of re-sampling known as bootsrap validation, but is also capable of doing cross-validation, and the two methods in practice yield similar results.

Lets fit a cross-validated model to the mtcars dataset. First, we set the random seed, since cross-validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly.

The train function has a formula interface, which is identical to the formula interface for the lm function in base R. However, it supports fitting hundreds of different models, which are easily specified with the "method" argument. In this case, we fit a linear regression model, but we could just as easily specify method = 'rf' and fit a random forest model, without changing any of our code. This is the second most useful feature of the caret package, behind cross-validation of models: it provides a common interface to hundreds of different predictive models.

The trControl argument controls the parameters caret uses for cross-validation. In this course, we will mostly use 10-fold cross-validation, but this flexible function supports many other cross-validation schemes. Additionally, we provide the verboseIter = TRUE argument, which gives us a progress log as the model is being fit and lets us know if we have time to get coffee while the models run.

Let's practice cross-validating some models.

#rstats #r programming #data science #data analysis #learn r #r tutorial #data #big data #r for data science #r for data analysis #data science tutorial #data analysis tutorial #caret #machine learning with r

DataCamp

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。

Timetable

動画タイムテーブル

動画数:1668件