Understanding Decoder-Only Transformers Part 1: Masked Self-Attention

· Dev.to