The NaN Sense Blog

Step-by-Step Guide to Building Advanced LLM Agents

A Large Language Model, at its core, is just a next-token predictor. While powerful, this fundamental architecture does not natively support the complex, multi-step reasoning and tool execution that characterize advanced agentic AI systems like ChatGPT Agent mode, Manus or Perplexity. The transformation from a raw probabilistic model into an

Multi-GPU Distributed Training

In the previous post, we used a single GPU to train a small, 117 million parameter model (GPT-2-small) with a batch size of 32 and a subset of optimizations - 1. Asynchronous I/O. 2. Four gradient accumulation steps. 3. And torch.compile with default parameters. In this post, we

Optimizing the Model Architecture

In the previous post, we saw how to optimize a generic training loop for large deep learning models. In this post, we shall implement a GPT-style decoder-only transformer model (most common large language model architecture) and explore some model architecture specific optimizations. Although Large Language Models (LLMs) come with millions

Inside the PyTorch Compiler

There are two types of deep learning frameworks - eager mode and graph mode. PyTorch is an example of eager mode framework while TensorFlow (at least till 1.x versions) is an example of graph mode framework. In graph model frameworks, we define a static computation graph of tensors and

Optimizing the Training Loop

In previous posts, we built a data collection pipeline and trained a byte pair encoder tailored to our data for our custom LLM training. Before, we define the LLM architecture, let’s dive into some optimization techniques that will help us save (a lot of) time and money. The diagram

The NaN Sense Blog © 2026