A Data Science Central Community
In my last post, I have explained about MSE, today I will explain the variance & bias trade-off, Precision recall trade-off while assessing the model accuracy.
Variance refers to the amount by which the estimated output (f) would change if we estimated it (f) using a different training dataset. Since the training data is used to fit the statistical learning method, different training sets will result in different outputs (f).
Ideally, the estimate should not vary much between training sets.
Bias refers to the error that is introduced by approximating a complicated problem by a simpler model.
For example: Consider the distribution of dataset (black fit curve). If a simple linear regression (orange fit curve) is fitted for a dataset which actually needs much flexible model (blue fit curve), the simple linear regression model induces Bias in the model.
Fig: linear regression provides a very poor fit to the data
The test MSE calculated here, can be decomposed into three properties, Variance, Bias, error.
E(y0 − ˆ f(x0))^2 = Var( ˆ f(x0)) + [Bias( ˆ f(x0))]^2 +Var(ɛ).
Where E(y0 − ˆ f(x0))^2 defines the expected test MSE, first part of the equation is the variance, second part is the bias and the third part is the variance of error.
In order to develop best model for any analysis, we need to select a statistical method which achieves low Variance and low bias. This is called Variance-bias trade-off.
Fig: Variance-bias trade-off - Below explanation explains the details.
As a general rule, as the complexity of the model increases the variance will increase & Bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. Initially, as the flexibility increases the bias decreases faster than the variance increases. As a result, test MSE decreases. However, at some point increasing flexibility has little effect on the bias and after this point the variance tends to increase significantly. This point can be treated as optimal point for model selection. As an end note, while assessing the model accuracy we need to take variance-bias trade-off into consideration.
While dealing with a classification problem, we validate the relevance of the model using Precision Recall methods.
Consider the below confusion matrix, one of the best ways to represent the results of a classifier:
ACTUAL |
|||
POSITIVE |
NEGATIVE |
||
PREDICTED |
POSITIVE |
TRUE POSITIVE |
FALSE POSITIVE |
NEGATIVE |
FALSE NEGATIVE |
TRUE NEGATIVE |
Let’s understand the confusion matrix:
TRUE POSITVE: The actual is +ve and model predicted +ve.
FALSE POSITIVE: The actual is –ve but the model predicted positive, FALSE ALARM
FALSE NEGATIVE: The actual is +ve but the model predicted negative - A MISS
TRUE NEGATIVE: The actual is –ve & model predicted –ve.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Both Precision & Recall are inversely related, if precision increases, the recall falls, vice versa. Trade-off between Precision and Recall varies from problem to problem. For example, in Legal document classification problem, model needs to have high Recall, as the model needs to extract/classify more relevant.
© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge