silicon valley stories: Improving deep neural networks : hyperparameter tuning, regularization and optimization - Andrew Ng course

Wednesday, April 8, 2020

Improving deep neural networks : hyperparameter tuning, regularization and optimization - Andrew Ng course

Bias - Variance

High Bias - doesnt do well on training set. Fixes are

Bigger network
Training Longer
More complicated NN architecture

High Variance - doesnt do well on test set, does well on training set. Performance doesnt generalize very well

More data
Data augmentation
Regularization

Classical bias/variance tradeoff doesnt exist

Getting a bigger network almost always reduces your bias without hurting your variance
Getting more data almost always reduces variance

Regularization fixes high variance

L2 norm - With a high lambda, a lot of the hidden units will get zeroed out making the NN behave more like simpler LR
Inverted Dropout

Inputs are randomly eliminated in dropout, so cant rely on one feature. Need to spread out the weights
Different keep_probabilities can be used for different layers. Lower keep_prob where there are more inputs will mean more regularization
Since (1-keep_prob) percent of the units are missing, the inversion step ensures that the expected values are still the same
Dropout was started in the CV field which has a lot of inputs. It doesnt mean that dropouts should be used everywhere blindly

Early stopping

Speed up training

Data normalization - if all your features are on a similar scale it will help your learning algo to converge faster
Vanishing/Exploding gradients - if all your features become too small, gradient descent will take a long time
Weight Initialization

Xavier Initialization
This could also be used as a hyperparameter that will help you train your NN much more quickly

Optimization Algorithms

They enable you to train your algorithms much faster

Mini-batch gradient descent

Gradient descent reads the entire training set and then processes it. What if m = 5mn or 50 mn
It would be much better if the processing could start earlier. Mini-batches = baby training sets having 1000 examples each
If mini batch size = m, we have batch gradient descent
If mini batch size = 1, we have stochastic gradient descent. This is very noisy

Exponentially weighted moving averages

Gradient Descent with momentum

Normal GD will take lot of steps and slowly oscillate towards the minimum
On the vertical axis, the learning rate to be slower as the positive and negative iterations will average out
On the horizontal axis, the learning rate to be faster as it gains momentum

RMS Prop and Adam generalizes well across several architectures. Most algos cant better gradient descent
RMS Prop
Adam Optimization Algorithms

It takes momentum and RMS Prop and putting them together

Learning rate decay

In the beginning iterations, you can take larger steps and as you approach the minima you can start taking smaller steps
Otherwise, your algorithm may not really converge and can keep wondering around the minimum

Hyperparameters

Listed in order of importance

Learning rate (Alpha)
Momentum (Beta)
Number of hidden units
mini-batch size
Number of layers
Learning rate decay
Beta1, Beta2, Epsilon - adam algorithm

Tuning Process

In classical ML, grid search was used to choose hyperparams. In DL, it is hard to know in advance which of the hyperparams are most effective for your problem. Hence you should choose the hyperparams in random and not use grid search. For example, if you have 2 hyperparams, you may train 25 models but you are just trying 5 different values of the most important hyper param if you use grid search
Coarse to fine sampling scheme
Use the learning rate on the log scale and then randomly sample from the log scale

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)