With no background in DL, this is my attempt to understand what “Attention” actually is, and to extract the absolute core essence out of it, such that even a middle schooler can understand, and at the end, implement it in python to simulate various scenarios.
I will try to lay down my thought process as an engineer trying to read a specs sheet, in order to figure out how it works, and where to use it. Not going to emphasize upon the history or the reason why it came into existence.
Well, the first step, is to know what “attention” literally means, I wouldn’t be moving forward without that.

The third example uses the verb “draw” (to “draw” someone’s attention, it seems as if this will be useful)
Now I searched for “attention mechanism ELI5 site:reddit.com” hoping to find an explanation which a 5 year old would understand. I was not disappointed:

With this, I have an intuition now. It would probably be something where I have got a lot of input parameters but not all are important. So somehow I manage to “focus” on the parts which matter, and those would implicitly turn out to be the ones which affect the output most. Rest would be noise, for that particular context, perhaps.
Something interesting I found https://assets.cambridge.org/97813165/13293/excerpt/9781316513293_excerpt.pdf

With this makeshift half baked intuition, I think it’s time to take a look into some published papers (i have no idea what jargon i am about to tackle with).
Alright, so contrary to popular belief, it won’t be the “Attention is All You Need” paper first, but instead this one which is mentioned in the references section of the aforementioned paper itself:

https://arxiv.org/pdf/1409.0473

After reading the introduction, I realized that something exists which helps us to represent “sentences” as “vectors”, and that process is done using an “encoder”, and the opposite is done using a “decoder” (the first highlighted section).
The next paragraph essentially is hinting towards what happens with actual humans (refer image.png), and an analogical resemblance seems to be there.
Now I believe that the last paragraph is of very importance. They are stating that their proposed new model somehow
(soft-)searches for a set positions in a source sentence where the most relevant information is concentrated. (emphasis mine)