Linear Layers and Activation Functions in Transformer Models
This post is divided into three parts; they are: • Why Linear Layers and Activations are Needed in Transformers • Typical Design of the Feed-Forward Network • Variations of the Activation...