Subscribe to DSC Newsletter

In my last post, I have explained about MSE, today I will explain the variance & bias trade-off, Precision recall trade-off while assessing the model accuracy.

What is Variance and bias of a statistical learning Method?

Variance refers to the amount by which the estimated output (f) would change if we estimated it (f) using a different training dataset. Since the training data is used to fit the statistical learning method, different training sets will result in different outputs (f).

Ideally, the estimate should not vary much between training sets.

Bias refers to the error that is introduced by approximating a complicated problem by a simpler model.

For example: Consider the distribution of dataset (black fit curve). If a simple linear regression (orange fit curve) is fitted for a dataset which actually needs much flexible model (blue fit curve), the simple linear regression model induces Bias in the model.

Fig: linear regression provides a very poor fit to the data

Explanation:

The test MSE calculated here, can be decomposed into three properties, Variance, Bias, error.

E(y0 − ˆ f(x0))^2 = Var( ˆ f(x0)) + [Bias( ˆ f(x0))]^2 +Var(ɛ).

Where E(y0 − ˆ f(x0))^2 defines the expected test MSE, first part of the equation is the variance, second part is the bias and the third part is the variance of error.

In order to develop best model for any analysis, we need to select a statistical method which achieves low Variance and low bias. This is called Variance-bias trade-off.

Fig: Variance-bias trade-off - Below explanation explains the details.

As a general rule, as the complexity of the model increases the variance will increase & Bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. Initially, as the flexibility increases the bias decreases faster than the variance increases. As a result, test MSE decreases. However, at some point increasing flexibility has little effect on the bias and after this point the variance tends to increase significantly. This point can be treated as optimal point for model selection. As an end note, while assessing the model accuracy we need to take variance-bias trade-off into consideration.

What is Precision-Recall of a statistical learning Method?

While dealing with a classification problem, we validate the relevance of the model using Precision Recall methods.

Consider the below confusion matrix, one of the best ways to represent the results of a classifier:

ACTUAL

POSITIVE

NEGATIVE

PREDICTED

POSITIVE

TRUE POSITIVE

FALSE POSITIVE

NEGATIVE

FALSE NEGATIVE

TRUE NEGATIVE

Let’s understand the confusion matrix:

TRUE POSITVE: The actual is +ve and model predicted +ve.

FALSE POSITIVE: The actual is –ve but the model predicted positive, FALSE ALARM

FALSE NEGATIVE: The actual is +ve but the model predicted negative - A MISS

TRUE NEGATIVE: The actual is –ve & model predicted –ve.

Precision: % of the predicted values that are correct.

Precision = TP/(TP+FP)

Recall: % of the correct items that are relevant.

Recall = TP/(TP+FN)

Both Precision & Recall are inversely related, if precision increases, the recall falls, vice versa. Trade-off between Precision and Recall varies from problem to problem. For example, in Legal document classification problem, model needs to have high Recall, as the model needs to extract/classify more relevant.

Views: 1771

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service