This is a summary of huggingface's talk on "The future of NLP"
- Overall trends are that model size's are growing too fast. Current state of the art models have 1-10 billion parameters. NLP models cannot be run on custom hardware anymore causing a huge gap between academia and industry
- Three pronged approach : distillation, pruning and
- Model distillation helps reduce the size of a teacher model by training a student model, without losing much of the predictive power of the teacher model. Knowledge distillation saves on inference and power efficiency.
- Pruning works directly on your teacher model and remove weights from your teacher model to make it smaller
- Head pruning
- Weights pruning
- Layer pruning - they are repetation of the same module and they have a shortcut connection. So if you remove a layer it is less aggressive than other architectures without shortcut connections.
- GPUs are optimized for dense matrix multiplications. If you use these sparse models on GPUs, they can be 3-4 times slower. Graphcore chips are specially designed for sparse modules
- Quantization - NNs works well on int8s and not just floats. Scale all the values from float to int with zero point conversion.
- A comparison study of XLNet and Bert with large models
- RoBERTa is bert trained on larger data and beat XLNet. Bert - 137 billion tokens(13 GB). RoBERTa - 2.2 trillion tokens(160 GB corpus).
- Winnowgrad scheme challenge - you can apply heuristics on the wiki and create artificial augmented datasets that the model can learn on. This process is called finetuning.
- Scaling laws for neural language models
- Power laws work well, if you reduce the embedding and if you don't take the size of the model. If you double the parameters, dataset size, compute, the model loss function will improve linearly
- One of the goals of transfer learning is to make the model work on small dataset
- Sample efficiency : how better your model gets with one additional example
- The metric to measure this is called "Online Code Length". It is a way to look at model compression
- For SQuAD, Bert trained on QA dataset > Bert trained on wikipedia > Bert randomly initialized in terms of sample efficiency
- On what language model pretraining captures - oLMpics
- What we would like is "out of domain generalization", what we have is "in-domain" generalization
- Compositionality
- Scan
- PCFG dataset
- sd
References
No comments:
Post a Comment