/blog
I’ve tried to learn the details of the transformer architecture the last couple weeks. Some resources I found useful:
In the spirit of the saying “A picture says a thousand words”, I decided to try to draw a visualization of the parts I found the hardest to understand:
Each line in the diagram below depicts one (usually floating point) number passing through the model. OBS: there are some parts that are left out to de-clutter the picture (for example layer normalization).
For comparison/reference, here are the hypothetical model parameters:
residual stream/model dimension: 3
internal head dimension: 2
number of heads: 2
number of [Attention + MLP] layers: 2
context length (in tokens): 2
Hope this helps someone! If you have any questions/error corrections, send me an email.