A Data Science Central Community
Underfitting : If our algorithm works badly with points in our data set, then the algorithm underfitting the data set. It can be check easily throug the cost function measures. Cost function in linear regression is half the mean squared error ex. if mean squared error is c the cost fucntion is 0.5C^{ 2}. If in an experiment cost ends up high even after many iterations, then chances are we have an underfitting problem. We can say that learning algorithm is not good for the problem. Underfitting is also known as high bias( strong bias towards its hypothesis). In an another words we can say that hypothesis space the learning algorithm explores is too small to properly represent the data.
How to avoid underfitting :
More data will not generally help. It will, in fact, likely increase the training error. Therefore we should increase more features. Because that expands the hypothesis space. This includes making new features from existing features. Same way more parameteres may also expand the hypothesis space.
Overfitting : If our algorithm works well with points in our data set, but not on new points, then the algorithm overfitting the data set. Overfitting check easily through by spliting the data set so that 90% of data in our training set and 10% in a cross-validation set. Train on the training set, then measure the cost on the cross-validation set. If the cross-validation cost is much higher than the training cost, then chances are we have an overfitting problem. In another words we can say that hypothesis space is too large, and perhaps some features are faking the learning algorithm out.
How to avoid overfitting :
To avoid overfitting add the regularization if there are many features. Regularization forces the magnitudes of the parameters to be smaller(shrinking the hypothesis space). For this add a new term to the cost function
which penalizes the magnitudes of the parameters like as
Comment
The elephant in the room - and the dirty little secret that most data science people would rather disown. It's an absolute certainty that there are organizations using models that are nothing more than curve fits. Same applies to the use of stats - read http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0... for further enlightenment.
I like this post a lot, as it exposes the Achilles heel of the #BigData giant. Regularization might be the biggest challenge not directly addressed or solved in the #BigData and #Hadoop frenzied hype bubble. #DataScience extends and builds on #ComputerScience, #MathematicalStatistics, and #ComputationalMathematics, as opposed to partially replacing them. Fast Fourier Transform and Nyquist–Shannon sampling theorem should not be lost or forgotten in the growing #BigData pile of #Hadoop.
© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge