NaN Sense

Multi-GPU Distributed Training

In the previous post, we used a single GPU to train a small, 117 million parameter model (GPT-2-small) with a batch size of 32 and a subset of optimizations - 1. Asynchronous I/O. 2. Four gradient accumulation steps. 3. And torch.compile with default parameters. In this post, we

Optimizing the Model Architecture

In the previous post, we saw how to optimize a generic training loop for large deep learning models. In this post, we shall implement a GPT-style decoder-only transformer model (most common large language model architecture) and explore some model architecture specific optimizations. Although Large Language Models (LLMs) come with millions

Inside the PyTorch Compiler

There are two types of deep learning frameworks - eager mode and graph mode. PyTorch is an example of eager mode framework while TensorFlow (at least till 1.x versions) is an example of graph mode framework. In graph model frameworks, we define a static computation graph of tensors and

Optimizing the Training Loop

In previous posts, we built a data collection pipeline and trained a byte pair encoder tailored to our data for our custom LLM training. Before, we define the LLM architecture, let’s dive into some optimization techniques that will help us save (a lot of) time and money. The diagram

Latest

Multi-GPU Distributed Training

Optimizing the Model Architecture

Inside the PyTorch Compiler

Optimizing the Training Loop