LLMs at Scale

Optimizing the Model Architecture

In the previous post, we saw how to optimize a generic training loop for large deep learning models. In this post, we shall implement a GPT-style decoder-only transformer model (most common large language model architecture) and explore some model architecture specific optimizations. Although Large Language Models (LLMs) come with millions

LLMs at Scale

Inside the PyTorch Compiler

There are two types of deep learning frameworks - eager mode and graph mode. PyTorch is an example of eager mode framework while TensorFlow (at least till 1.x versions) is an example of graph mode framework. In graph model frameworks, we define a static computation graph of tensors and

LLMs at Scale

Optimizing the Training Loop

In previous posts, we built a data collection pipeline and trained a byte pair encoder tailored to our data for our custom LLM training. Before, we define the LLM architecture, let’s dive into some optimization techniques that will help us save (a lot of) time and money. The diagram