silicon valley stories: The future of NLP

Sunday, May 10, 2020

This is a summary of huggingface's talk on "The future of NLP"

Overall trends are that model size's are growing too fast. Current state of the art models have 1-10 billion parameters. NLP models cannot be run on custom hardware anymore causing a huge gap between academia and industry
Three pronged approach : distillation, pruning and
Model distillation helps reduce the size of a teacher model by training a student model, without losing much of the predictive power of the teacher model. Knowledge distillation saves on inference and power efficiency.
Pruning works directly on your teacher model and remove weights from your teacher model to make it smaller

Head pruning
Weights pruning
Layer pruning - they are repetation of the same module and they have a shortcut connection. So if you remove a layer it is less aggressive than other architectures without shortcut connections.
GPUs are optimized for dense matrix multiplications. If you use these sparse models on GPUs, they can be 3-4 times slower. Graphcore chips are specially designed for sparse modules

Quantization - NNs works well on int8s and not just floats. Scale all the values from float to int with zero point conversion.
A comparison study of XLNet and Bert with large models
RoBERTa is bert trained on larger data and beat XLNet. Bert - 137 billion tokens(13 GB). RoBERTa - 2.2 trillion tokens(160 GB corpus).
Winnowgrad scheme challenge - you can apply heuristics on the wiki and create artificial augmented datasets that the model can learn on. This process is called finetuning.
Scaling laws for neural language models
Power laws work well, if you reduce the embedding and if you don't take the size of the model. If you double the parameters, dataset size, compute, the model loss function will improve linearly
One of the goals of transfer learning is to make the model work on small dataset
Sample efficiency : how better your model gets with one additional example
The metric to measure this is called "Online Code Length". It is a way to look at model compression
For SQuAD, Bert trained on QA dataset > Bert trained on wikipedia > Bert randomly initialized in terms of sample efficiency
On what language model pretraining captures - oLMpics
What we would like is "out of domain generalization", what we have is "in-domain" generalization
Compositionality

References

silicon valley stories