Sunday, June 7, 2020

Bias Variance Tradeoffs - Classic

  • More training examples fixes high variance but not high bias.
  • Fewer features fixes high variance but not high bias.
  • Additional features fixes high bias but not high variance.
  • The addition of polynomial and interaction features fixes high bias but not high variance.
  • When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter).
  • When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.


Error Metrics for Skewed Classes

It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.
  • For example: In predicting a cancer diagnoses where 0.5% of the examples have cancer, we find our learning algorithm has a 1% error. However, if we were to simply classify every single example as a 0, then our error would reduce to 0.5% even though we did not improve the algorithm.
This usually happens with skewed classes; that is, when our class is very rare in the entire data set.
Or to say it another way, when we have lot more examples from one class than from the other class.
For this we can use Precision/Recall.
  • Predicted: 1, Actual: 1 --- True positive
  • Predicted: 0, Actual: 0 --- True negative
  • Predicted: 0, Actual, 1 --- False negative
  • Predicted: 1, Actual: 0 --- False positive
Precision: of all patients we predicted where y=1, what fraction actually has cancer?
True PositivesTotal number of predicted positives=True PositivesTrue Positives+False positives
Recall: Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?
\dfrac{\text{True Positives}}{\text{Total number of actual positives}}= \dfrac{\text{True Positives}}{\text{True Positives}+\text{False negatives}}
These two metrics give us a better sense of how our classifier is doing. We want both precision and recall to be high.
In the example at the beginning of the section, if we classify all patients as 0, then our recall will be \dfrac{0}{0 + f} = 0, so despite having a lower error percentage, we can quickly see it has worse recall.
Accuracy = \frac {true positive + true negative} {total population}
Note 1: if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0. F1 score will not be defined too.

Trading Off Precision and Recall

We might want a confident prediction of two classes using logistic regression. One way is to increase our threshold:
  • Predict 1 if: h_\theta(x) \geq 0.7
  • Predict 0 if: h_\theta(x) < 0.7
This way, we only predict cancer if the patient has a 70% chance.
Doing this, we will have higher precision but lower recall (refer to the definitions in the previous section).
In the opposite example, we can lower our threshold:
  • Predict 1 if: h_\theta(x) \geq 0.3
  • Predict 0 if: h_\theta(x) < 0.3
That way, we get a very safe prediction. This will cause higher recall but lower precision.
The greater the threshold, the greater the precision and the lower the recall.
The lower the threshold, the greater the recall and the lower the precision.
In order to turn these two metrics into one single number, we can take the F value.
One way is to take the average:
\dfrac{P+R}{2}
This does not work well. If we predict all y=0 then that will bring the average up despite having 0 recall. If we predict all examples as y=1, then the very high recall will bring up the average despite having 0 precision.
A better way is to compute the F Score (or F1 score):
\text{F Score} = 2\dfrac{PR}{P + R}
In order for the F Score to be large, both precision and recall must be large.
We want to train precision and recall on the cross validation set so as not to bias our test set.


References

No comments:

Post a Comment

Books I am reading