Thank you for your thoughtful review and kind words.

1 min readJan 26, 2024

You've correctly noted that the authors observed a decline in performance when the normalization component was removed. However, it's important to contextualize this observation. The degradation occurred specifically in the evaluation cross-entropy loss across the training iterations.

In reference to Figure 23 in the original paper, it appears that Pre-Layer Normalization (Pre-LN) achieves the best results, which suggests a return to the traditional Transformer architecture's effectiveness.

The proposed simplifications to the Transformer block, guided by signal propagation theory, aim to address the 'scaling challenge' inherent to Transformers when dealing with very large parameters and tokens. This challenge is characterized by issues of stability during scaling.

It is important to recognize that models like ViT-22B are not where you typically encounter the scaling difficulties of Transformers—that's why Pre-LN seems to be more effective in this instance.

Theoretically, normalization can be unnecessary if you are carefully downscaling the residual branches in skip connections, or if you are adjusting the attention matrices closer to an identity matrix and making the MLP nonlinearities more linear.

Continued research on larger-scale models is expected to further demonstrate why normalization will be redundant in most contexts.

Written by Freedom Preetham

No responses yet