LLMs at Scale
Optimizing the Model Architecture
In the previous post, we saw how to optimize a generic training loop for large deep learning models. In this post, we shall implement a GPT-style decoder-only transformer model (most common large language model architecture) and explore some model architecture specific optimizations. Although Large Language Models (LLMs) come with millions