R tutorial: Introducing out-of-sample error measures

Learn more about machine learning with R: https://www.datacamp.com/courses/machine-learning-toolbox

Hi! I'm Zach Deane Mayer, and I'm one of the co-authors of the caret package. I have a passion for data science, and spend most of my time working on and thinking about problems in machine learning.

This course focuses on predictive, rather than explanatory modeling. We want models that do not overfit the training data and generalize well. In other words, our primary concern when modeling is "do the models perform well on new data?"

The best way to answer this question is to test the models on new data. This simulates real world experience, in which you fit on one dataset, and then predict on new data, where you do not actually know the outcome.

Simulating this experience with a train/test split helps you make an honest assessment of yourself as a modeler.

This is one of the key insights of machine learning: error metrics should be computed on new data, because in-sample validation (or predicting on your training data) essentially guarantees overfitting.

Out-of-sample validation helps you choose models that will continue to perform well in the future.

This is the primary goal of the caret package in general and this course specifically: don’t overfit. Pick models that perform well on new data.

Let's walk through a simple example of out-of-sample validation: We start with a linear regression model, fit on the first 20 rows of the mtcars dataset.

Next, we make predictions with this model on a NEW dataset: the last 12 observations of the mtcars dataset. The 12 cars in this test set will not be used to determine the coefficents of the linear regression model, and are therefore a good test of how well we can predict on new data.

In practice, rather than manually splitting the dataset, we'd actually use the createResamples or createFolds function in caret, but the manual split simplifies this example.

Finally, we calculate root-mean-squared-error (or RMSE) on the test set by comparing the predictions from our model to the actual MPG values for the test set.

RMSE is a measure of the model's average error. It has the same units as the test set, so this means our model is off by 5 to 6 miles per gallon, on average.

Compared to in-sample RMSE from a model fit on the full dataset, our model is signifigantly worse.

If we had used in-sample error, we would have fooled ourselves into thinking our model is much better than it actually is in reality.

It's hard to make predictions on new data, as this example shows. Out-of-sample error helps account for this fact, so we can focus on models that predict things we don't already know.

Let's practice this concept on some example data.

#rstats #r programming #data science #data analysis #learn r #r tutorial #data #big data #r for data science #r for data analysis #data science tutorial #data analysis tutorial #caret #machine learning with r

チャンネル登録

DataCamp

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。

概要カレンダー動画一覧タイムテーブルチャンネル分析

Timetable

動画タイムテーブル

よく話題になっている単語を表示する

動画数：1659件

字幕を含める

R tutorial: Introducing out-of-sample error measures

DataCamp

Timetable

よく話題になっている単語

Introduction

What is OpenAI, ChatGPT, and the OpenAI API?

What is an API?

Using the OpenAI API vs. the web interface

Why use the OpenAI API?

- Introduction

- What performance improvements will we see in generative AI models?

- What will drive LLM improvements?

- The challenges in improving LLM performance

- Transitioning from generalized to specialized models

- Other types of generative AI models that will shape the future

機械学習のまとめとは

利用規約

プライバシーポリシー

お問い合わせ

その他のデータベース