Screenshot of Andrew Ng's explanation of attention model from the deeplearning.ai course
The problem with regular encoder decoder architectures arise when we have long sentences because RNNs dont do well in these scenarios. For eg : while translating a long sentence humans probably dont read the whole sentence and then translate it. The human mind probably reads parts of the sentence and then processes the translation for that part. This leads us to attention models. While translating a word, it weighs in the inputs to the word differently.