Alan · Product Manager

I want to learn how to train an AI model like ChatGPT.

Course

Neural Networks to LLMs

This course covers the full arc from first principles to the modern LLM stack: neural network foundations, the mechanics of training, the Transformer architecture in depth, large-scale language model pretraining, fine-tuning and adaptation techniques, and finally RLHF-based alignment. Every topic pairs conceptual rigor with full mathematical treatment against working PyTorch or HuggingFace code, so theory and implementation reinforce each other throughout.

Expected Outcome

Upon completing this course, you will be able to explain end-to-end how GPT-style models are built and aligned, train a small language model from scratch, fine-tune a pretrained model using parameter-efficient methods, and read ML research papers with genuine comprehension of the mathematics and design choices involved.

Course Syllabus

Topic 0: Course Introduction

Orientation to the full course arc: what ground will be covered, why the sequence is ordered the way it is, and how each major block connects to the next. Sets expectations for the balance of math, intuition, and code throughout.

0.1

Roadmap introduction: from perceptrons to ChatGPT

What you will learn, why the order matters, and how each topic sets up the next.

Topic 1: ML Foundations and the Python/PyTorch Environment

Before any architecture can be understood, you need a clear mental model of what machine learning actually is and a working environment to experiment in. This topic bridges math background into the ML framing and establishes the PyTorch primitives that every later topic will use.

1.1

What machine learning actually is

Distinguishing ML from classical programming: functions learned from data rather than written by hand.

1.2

The core ML framing: data, model, loss, optimizer

A unified mental model that applies from linear regression all the way to GPT.

1.3

Setting up a Python ML environment

Installing PyTorch, Jupyter or VS Code, GPU considerations, and verifying your setup with a minimal tensor program.

1.4

Tensors: the fundamental data structure

How PyTorch tensors generalize NumPy arrays: shapes, dtypes, broadcasting, and device placement across CPU and GPU.

1.5

Automatic differentiation in PyTorch

How autograd tracks operations and computes gradients automatically: the engine under every training loop.

1.6

Your first end-to-end ML model in PyTorch

Fitting a linear regression from scratch: forward pass, loss computation, gradient step, and inspecting what changed.

1.7

Supervised vs. unsupervised vs. self-supervised learning

Why the distinction matters, and why self-supervised learning is the paradigm LLMs live in.

Topic 2: Neural Networks: Architecture and Activations

This topic builds the core structural vocabulary of deep learning: perceptrons, layers, depth, and the nonlinearities that make neural networks universal approximators. Understanding architecture before training is crucial for reading network diagrams and reasoning about what each piece does.

2.1

The perceptron: a single learned linear boundary

Weights, bias, dot product, and the geometric interpretation of a decision boundary.

2.2

Why depth matters: stacking layers into a network

How composing linear transformations with nonlinearities creates hierarchical feature detectors.

2.3

Activation functions and why they exist

Sigmoid, tanh, ReLU, and GELU: what each one does, where it saturates, and why GELU dominates modern LLMs.

2.4

The fully connected layer in math and code

Wx + b as a matrix operation: implementing nn.Linear and understanding what parameters are being learned.

2.5

Network capacity and the universal approximation theorem

What it guarantees, what it does not guarantee, and why depth is more practical than width alone.

2.6

Building a feedforward network with nn.Module

Defining forward(), registering parameters, and the design pattern that all PyTorch models follow.

2.7

Softmax and output representations

Turning raw scores into probabilities, and why softmax is the natural output layer for classification and language models.

Topic 3: Training Neural Networks

This is the mechanical core of the entire course. Every model is trained by the same fundamental loop: compute a loss, backpropagate gradients, and update parameters. This topic develops that loop rigorously, covering both the mathematics of backpropagation and the engineering of a stable training run.

3.1

Loss functions: measuring how wrong the model is

Mean squared error for regression and cross-entropy for classification: deriving each from first principles and understanding why they are the right choices.

3.2

Gradient descent: the core optimization algorithm

The gradient as a direction of steepest ascent, why we subtract it, and the intuition of descending a loss landscape.

3.3

Backpropagation: the chain rule on computation graphs

Deriving gradients layer by layer through a concrete two-layer example by hand before letting autograd do it.

3.4

The full PyTorch training loop

zero_grad, forward, loss, backward, and optimizer.step: why each call is necessary and what happens if you skip one.

3.5

Batch gradient descent, SGD, and mini-batches

Why full-batch gradient descent is impractical, how stochasticity helps escape local minima, and the sweet spot of mini-batch SGD.

3.6

Momentum and the Adam optimizer

How adaptive learning rates work, why Adam is the near-universal default for LLM training, and when SGD is still preferred.

3.7

Learning rate scheduling

Warmup, cosine decay, and why the trajectory of the learning rate over training matters as much as its initial value.

3.8

Overfitting and the bias-variance trade-off

Training vs. validation loss curves: diagnosing underfitting, overfitting, and knowing when to stop.

3.9

Regularization techniques

L2 weight decay, dropout, and early stopping: what each one does to the loss landscape and when to apply each.

3.10

Batch normalization and layer normalization

Why normalizing activations stabilizes training, and why LayerNorm rather than BatchNorm is used in Transformers.

3.11

Training a small MLP on a real dataset end-to-end

Putting it all together: data loading, training loop, validation, learning curves, and diagnosing what went wrong.

Topic 4: The Transformer Architecture

The Transformer is the architectural backbone of every modern LLM. This topic unpacks it component by component: attention, positional encoding, layer structure, and a working implementation.

4.1

The sequence modeling problem: why RNNs fell short

Vanishing gradients, inability to parallelize, and the key limitations that motivated the Transformer design.

4.2

Attention as a soft lookup: the intuition

The query-key-value metaphor: attending to the most relevant parts of a sequence rather than compressing everything into one vector.

4.3

Scaled dot-product attention: the math

Deriving QK^T / sqrt(d_k), applying softmax, weighting values, and understanding why the scaling factor prevents vanishing gradients in softmax.

4.4

Implementing scaled dot-product attention in PyTorch

Writing the attention function from scratch, inspecting attention weight matrices, and verifying with a small example.

4.5

Multi-head attention

How splitting into heads and projecting back allows the model to capture different kinds of token relationships in parallel.

4.6

Positional encoding: injecting sequence order

Why attention is inherently position-agnostic, and how sinusoidal encodings and later rotary embeddings give the model a sense of position.

4.7

The feed-forward sublayer inside each Transformer block

The two-layer MLP applied position-wise: its role as a key-value memory and why its hidden dimension is typically 4x the model dimension.

4.8

Residual connections and layer normalization

Why skip connections enable training of very deep networks, and the pre-norm vs. post-norm distinction in modern LLMs.

4.9

Encoder vs. decoder vs. encoder-decoder architectures

BERT vs. GPT vs. T5: causal masking, bidirectional attention, and which architecture is right for which task.

4.10

The causal decoder-only Transformer: GPT architecture

Masking future tokens during training, autoregressive generation at inference, and the full block diagram of a GPT-style model.

4.11

Implementing a GPT-style Transformer block

Assembling multi-head attention, feed-forward sublayer, residual connections, and LayerNorm into a single composable block in PyTorch.

4.12

Assembling and running a full mini-GPT

Stacking N decoder blocks into a complete model, counting parameters, and doing a forward pass on a token sequence.

Topic 5: Language Modeling and Pretraining

Having built the architecture, this topic covers how LLMs are actually trained on text at scale: the objective, the data, the tokenization, and what emerges from training. This stage produces a raw base model.

5.1

The language modeling objective: next-token prediction

Why predicting the next token is self-supervised, and how it drives the model to internalize grammar, facts, and reasoning patterns.

5.2

Tokenization: from raw text to token IDs

Byte-pair encoding and WordPiece: why character-level is too granular, word-level is too sparse, and subword tokenization splits the difference.

5.3

Implementing a BPE tokenizer

Walking through the BPE merge algorithm, encoding and decoding sequences, and the effect of vocabulary size on model capacity and compute.

5.4

Text datasets and data pipelines

Common Crawl, The Pile, and other pretraining corpora: how text is collected, filtered, deduplicated, and streamed into training.

5.5

The pretraining compute budget

Scaling laws and the Chinchilla intuition: how model size, dataset size, and compute interact, and what they imply about how much to train.

5.6

Training a small character-level language model

The classic Karpathy-style nanoGPT exercise: training on Shakespeare or similar, watching loss drop, and sampling coherent text.

5.7

What a base language model actually learns

In-context learning, few-shot prompting, emergent capabilities, and the distinction between a raw base model and an instruction-following model.

5.8

Positional embeddings at scale: RoPE and ALiBi

How modern LLMs handle long contexts beyond the original sinusoidal approach, and why position encoding remains an open research area.

5.9

Mixed-precision training and memory efficiency

FP16, BF16, gradient checkpointing, and why training large models requires careful memory management even on high-end hardware.

5.10

Evaluating a language model: perplexity and beyond

What perplexity measures, its relationship to cross-entropy loss, and why benchmark suites like HellaSwag and MMLU are used for more meaningful evaluation.

Topic 6: Fine-Tuning and Adaptation

A pretrained base model is powerful but raw: it predicts text, not answers. This topic covers the landscape of fine-tuning techniques that transform a base model into a useful task-specific system and set up the alignment work that follows.

6.1

Why fine-tuning works

The pretrain-then-finetune paradigm: how pretrained representations transfer to downstream tasks, and why starting from a strong base is far more efficient than training from scratch.

6.2

Supervised fine-tuning on instruction data

Formatting prompt-response pairs, computing cross-entropy loss only on response tokens, and the structure of instruction-tuning datasets.

6.3

Full fine-tuning vs. parameter-efficient fine-tuning

Why updating all parameters is expensive and often unnecessary: the motivation for PEFT methods.

6.4

LoRA: Low-Rank Adaptation

Decomposing weight updates into low-rank matrices, why this preserves pretrained knowledge while enabling task-specific adaptation, and the math behind the rank decomposition.

6.5

Implementing LoRA with HuggingFace PEFT

Applying LoRA adapters to a real model, training only the adapter parameters, and merging them back into the base model.

6.6

Other PEFT approaches: adapters and prompt tuning

Bottleneck adapters, prefix tuning, and soft prompt tuning: when each is appropriate and how they compare to LoRA.

6.7

The HuggingFace Transformers ecosystem

AutoTokenizer, AutoModelForCausalLM, Trainer API, and datasets library: the practical toolkit for fine-tuning without reinventing infrastructure.

6.8

Instruction tuning datasets

FLAN, Alpaca, and Open Hermes: how instruction-response datasets are curated, the role of data quality vs. quantity, and how to evaluate whether fine-tuning improved task performance.

6.9

Fine-tuning a small open-weight LLM for a specific task

A hands-on project: choosing a task, preparing a dataset, applying LoRA, training, and evaluating the result against the base model.

6.10

Catastrophic forgetting and how to mitigate it

Why fine-tuning can degrade general capabilities, and techniques like replay and regularization that preserve base model knowledge.

Topic 7: RLHF and Alignment

This is the final transformation: from an instruction-following model to one that is helpful, harmless, and honest. RLHF integrates pretraining, fine-tuning, and reinforcement learning into the specific process that produced ChatGPT from a base model.

7.1

Why alignment is necessary

The base model limitations: sycophancy, harmful outputs, and the gap between predicting text and being genuinely helpful.

7.2

The RLHF pipeline overview

The three-stage process: supervised fine-tuning, reward model training, and RL optimization, and how each stage builds on the previous one.

7.3

Human preference data

How labelers rank model outputs, the design of preference datasets, and why pairwise comparisons are more reliable than absolute ratings.

7.4

Reward models

Training a model to score outputs using the Bradley-Terry model and binary cross-entropy loss, plus the risks of reward hacking.

7.5

Reinforcement learning basics for the LLM context

Policy, reward, return, and the policy gradient theorem: just enough RL theory to understand why PPO is used here.

7.6

PPO applied to language model fine-tuning

The clipped surrogate objective, the KL penalty that prevents policy drift, and the four-model setup used in practice.

7.7

The KL divergence constraint

Why unconstrained RL on a reward model leads to degenerate outputs, and how the KL penalty keeps the model coherent.

7.8

Constitutional AI and RLAIF

Anthropic-style AI feedback instead of human labelers: how written principles can replace human preference comparisons at scale.

7.9

Direct Preference Optimization

RLHF without RL: how DPO reformulates the RLHF objective into a supervised loss, why it is simpler and increasingly preferred, and when PPO still wins.

7.10

Putting it all together: the ChatGPT training recipe

Tracing the complete production pipeline from raw pretraining data through SFT, reward modeling, PPO, and deployment.

7.11

Open problems in alignment

Scalable oversight, superalignment, specification gaming, and why alignment remains an active research frontier despite current progress.

Topic 8: Reading ML Research Papers

This topic develops the skill of reading primary literature explicitly: not as an afterthought, but as a structured practice built on the technical foundation established in all prior topics.

8.1

The anatomy of an ML paper

Abstract, introduction, related work, method, experiments, ablations, and conclusion: what each section is trying to communicate and how to extract it efficiently.

8.2

How to read a paper without getting lost in the math

The three-pass reading method: skimming, understanding claims, then verifying proofs, and deciding when the math actually matters.

8.3

Reading Attention Is All You Need

Applying architectural knowledge to the original Transformer paper: spotting what the paper introduced, what it assumed, and what it left open.

8.4

Reading the GPT and GPT-2 papers

How language modeling papers established the pretrain-then-finetune paradigm and demonstrated scaling in text generation.

8.5

Reading the InstructGPT paper

The RLHF blueprint: understanding each design decision now that you have the full technical background.

8.6

Reading a PEFT or scaling laws paper

Applying fine-tuning and pretraining knowledge to a current research paper such as LoRA, Chinchilla, or a similar landmark, and evaluating its claims critically.

8.7

Building a personal reading habit

Staying current in a fast-moving field through ArXiv, Papers With Code, research social feeds, and triage habits that preserve depth.

Topic 9: Capstone Project

The capstone synthesizes every major skill developed across the course into a concrete deliverable. It asks you to make real design decisions at every stage: from data to training to evaluation.

9.1

Choosing your capstone direction

Three project tracks: train a small LM from scratch on a custom corpus, fine-tune an open-weight model for a specific task, or implement and ablate a component from a recent paper.

9.2

Project scoping and experimental design

Writing a short proposal: what you will build, what baseline you will compare against, how you will know if it worked, and what could go wrong.

9.3

Implementation sprint

Building the core system: data pipeline, model setup, training loop, and logging.

9.4

Evaluation and analysis

Choosing the right metrics, running ablations to understand what each design choice contributed, and interpreting what the model actually learned.

9.5

Writing up your results in paper style

Structuring findings as a short technical report, practicing the same format used in research papers, and consolidating understanding through explanation.