Bias - Variance
- High Bias - doesnt do well on training set. Fixes are
- Bigger network
- Training Longer
- More complicated NN architecture
- High Variance - doesnt do well on test set, does well on training set. Performance doesnt generalize very well
- More data
- Data augmentation
- Regularization
- Classical bias/variance tradeoff doesnt exist
- Getting a bigger network almost always reduces your bias without hurting your variance
- Getting more data almost always reduces variance
Regularization fixes high variance
- L2 norm - With a high lambda, a lot of the hidden units will get zeroed out making the NN behave more like simpler LR
- Inverted Dropout
- Inputs are randomly eliminated in dropout, so cant rely on one feature. Need to spread out the weights
- Different keep_probabilities can be used for different layers. Lower keep_prob where there are more inputs will mean more regularization
- Since (1-keep_prob) percent of the units are missing, the inversion step ensures that the expected values are still the same
- Dropout was started in the CV field which has a lot of inputs. It doesnt mean that dropouts should be used everywhere blindly
- Early stopping
Speed up training
- Data normalization - if all your features are on a similar scale it will help your learning algo to converge faster
- Vanishing/Exploding gradients - if all your features become too small, gradient descent will take a long time
- Weight Initialization
- Xavier Initialization
- This could also be used as a hyperparameter that will help you train your NN much more quickly
Optimization Algorithms
They enable you to train your algorithms much faster
- Mini-batch gradient descent
- Gradient descent reads the entire training set and then processes it. What if m = 5mn or 50 mn
- It would be much better if the processing could start earlier. Mini-batches = baby training sets having 1000 examples each
- If mini batch size = m, we have batch gradient descent
- If mini batch size = 1, we have stochastic gradient descent. This is very noisy
- Exponentially weighted moving averages
- Gradient Descent with momentum
- Normal GD will take lot of steps and slowly oscillate towards the minimum
- On the vertical axis, the learning rate to be slower as the positive and negative iterations will average out
- On the horizontal axis, the learning rate to be faster as it gains momentum
- RMS Prop and Adam generalizes well across several architectures. Most algos cant better gradient descent
- RMS Prop
- Adam Optimization Algorithms
- It takes momentum and RMS Prop and putting them together
- Learning rate decay
- In the beginning iterations, you can take larger steps and as you approach the minima you can start taking smaller steps
- Otherwise, your algorithm may not really converge and can keep wondering around the minimum
Hyperparameters
Listed in order of importance
- Learning rate (Alpha)
- Momentum (Beta)
- Number of hidden units
- mini-batch size
- Number of layers
- Learning rate decay
- Beta1, Beta2, Epsilon - adam algorithm
Tuning Process
- In classical ML, grid search was used to choose hyperparams. In DL, it is hard to know in advance which of the hyperparams are most effective for your problem. Hence you should choose the hyperparams in random and not use grid search. For example, if you have 2 hyperparams, you may train 25 models but you are just trying 5 different values of the most important hyper param if you use grid search
- Coarse to fine sampling scheme
- Use the learning rate on the log scale and then randomly sample from the log scale
No comments:
Post a Comment