PRACTICUM: http://bit.ly/DLSP20-12-3
We introduce attention, focusing on self-attention and its hidden layer representations of the inputs. Then, we introduce the key-value store paradigm and discuss how to represent queries, keys, and values as rotations of an input. Finally, we use attention to interpret the transformer architecture, taking a forward pass through a basic transformer, and comparing the encoder-decoder paradigm to sequential architectures.
0:01:09 – Attention
0:17:36 – Key-value store
0:35:14 – Transformer and PyTorch implementation
0:54:00 – Q&A