How a neural network actually works

A tour for curious 13-year-olds, with things you can play with. Read in order. Drag the sliders. Press the buttons.

1. The big picture

A neural network is a very large pile of tiny math equations. Together they turn an input (like a picture, or a sentence) into an output (like "this is a cat" or the next word in a story).

Each tiny equation has some adjustable numbers called weights. At the start the weights are random and the network is terrible. You show it lots of examples where you already know the right answer. Each time it gets one wrong, you nudge every weight a tiny bit in the direction that would have made it slightly less wrong. (We don't yet know how the network figures out the right direction. That's what chapter 5 and chapter 8 are for — hold the question.)

Do this millions or billions of times and the weights end up encoding a surprising amount of knowledge about the world.

That's the main idea. The rest of the book shows how far that idea can stretch.

If you like watching things move while you read, the 3Blue1Brown neural networks series on YouTube animates almost every idea in this book. It's the single best companion to what you're reading now.

Challenge

Before moving on, name three problems that could be turned into "input goes in, prediction comes out." Try one from school, one from a game, and one from real life.

2. One tiny neuron

Imagine a little machine. It has some inputs and one output. Each input is a number — could be the brightness of a pixel, or "did I practice piano? 1 for yes, 0 for no".

The machine does three things in order:

  1. Multiply each input by its own "importance number" (a weight)
  2. Add them all together (plus one extra fudge factor called a bias)
  3. Squish the answer into a number between 0 and 1 using a curve called sigmoid
w₁ w₂ bias input 1 input 2 Σ sigmoid output (0 to 1)

That output between 0 and 1 is how strongly the neuron "fires." It's basically how confident the neuron is that the answer is yes. That's one neuron. Yes, that's the whole thing.

Why sigmoid and not some other shape? Hold that question — we'll meet sigmoid's modern replacement (ReLU) in a few chapters and show why the choice of "squisher" actually matters. If you want the full zoo right now, Wikipedia's activation function page lists every common one.

▸ try it

Play with a neuron

Drag the sliders. The output bar reacts live. Try setting both weights to +1 and the bias to -0.5 — now the neuron acts like an "OR" gate. Try setting them all to 0 — the neuron gives up and just outputs 0.5.

1.00
0.00
1.00
1.00
-0.50
output:
0.62
weighted sum: 0.50 decision: YES
Challenge

Make the neuron behave like AND instead of OR. Hint: both inputs should need to be high before the output crosses 0.5.

3. Training the neuron

Let's teach our one neuron to decide "can I have screen time?"

Rule: screen time is allowed if you practiced piano OR you finished your homework. Here are the examples we'll feed it:

piano done?homework done?screen time allowed?
0 (no)0 (no)0 (no)
1 (yes)0 (no)1 (yes)
0 (no)1 (yes)1 (yes)
1 (yes)1 (yes)1 (yes)

The training loop is the same idea every time:

  1. Start with random weights (the neuron has no clue)
  2. Run all 4 examples through the neuron, see what it predicts
  3. For each one, figure out how wrong it was
  4. Nudge every weight a tiny bit in whichever direction would have made the answer slightly more right
  5. Repeat 10,000 times
▸ try it

Train the neuron yourself

Press "step once" to do a single round of training. Or "train 1000" to fast-forward. Watch the predictions on each of the four examples crawl toward the right answer.

epoch: 0 w1: ? w2: ? bias: ?

After a few thousand nudges, the neuron has learned the rule. Nobody told it "if piano OR homework is done then yes". It figured that out by being slightly less wrong over and over.

That's the trick. Everything from here on is "do this same thing, but much bigger." If you want a parallel walkthrough where every step is animated, the second video in 3Blue1Brown's neural networks series covers exactly this — gradient descent on a single neuron, with pictures.

▸ try it

Draw the line yourself

The neuron's whole job is to draw one straight line that puts YES dots on one side and NO dots on the other. Drag the sliders. The dashed line is the neuron's current decision boundary. The number tells you how many of the 4 dots are on the right side.

Try to get all 4 correct. Hint: weight 1 = 1.0, weight 2 = 1.0, bias = -0.5 is one good answer.

▸ run real python

Run the actual script — in your browser

This is neuro1.py from the folder, real Python with real numpy. First click downloads Python (takes a few seconds). Edit anything you like, then click run.

neuro1.py Python loads on first run
(click "run" to execute)
Challenge

Edit the code so screen time is allowed only when both piano and homework are done. Then predict what the weights should do.

4. When one neuron isn't enough

Let's try a harder rule. Imagine a secret door with two buttons:

The door opens if EITHER the left button OR the right button is pressed. But if both buttons are pressed at the same time, the door stays shut.

This rule is called "exactly one." Here's the data:

left button?right button?door opens?
000 (no buttons)
101 (left only)
011 (right only)
110 (both buttons)

Try to teach a single neuron this. It will fail. Forever. Doesn't matter how many times you train it.

Why? A single neuron is like drawing one straight line on a piece of graph paper to separate the "yes" answers from the "no" answers. Look at our problem on paper:

right button=1 =0 left=0 left=1 YES NO NO YES

Can you draw ONE straight line that puts both YESes on one side and both NOs on the other? Try it on paper. You can't. The YESes are on opposite corners. A single neuron can only draw one line, so it has no shape that fits this problem.

The math word for "can be split by one straight line" is linearly separable. The screen-time rule from chapter 3 is linearly separable; this XOR rule is not. Anything not linearly separable is impossible for one neuron. If you remember one piece of jargon from this book, make it that one — it'll come up everywhere.

▸ try it

Watch a single neuron fail

Same trainer as before, new data. Click "train 1,000 steps" and watch — the predictions get stuck at 0.5 forever. The neuron literally gives up and guesses the average.

epoch: 0 avg error: ?

This isn't a small detail. This is one of the most famous moments in the whole history of AI.

In 1969, two MIT researchers — Marvin Minsky and Seymour Papert — wrote a book called Perceptrons. The whole point of the book was to prove, in math, that a single-layer neural network (called a perceptron at the time) cannot solve XOR. The problem you just played with for 5 minutes.

This landed badly. People had been hyping perceptrons as the future of artificial intelligence. Perceptrons made the field look like a dead end. Funding dried up. Most researchers moved to other things. We now call this period the first AI winter — about 15 years where neural networks were considered a curiosity, not a serious tool.

Here's the twist: Minsky and Papert knew a multi-layer network could probably solve XOR. The actual problem was that nobody had a good way to train multi-layer networks yet. That training algorithm (called backpropagation — we'll explain it in chapter 8) wasn't worked out clearly until 1986, when Rumelhart, Hinton, and Williams published a paper showing how to do it. That paper is essentially why everything you read about today — ChatGPT, self-driving cars, image generators — exists. Almost two decades of progress had been stalled by one missing algorithm.

▸ try it

Try to draw a line — you can't

Same line-drawing widget as before, but with the secret-door labels. Drag the sliders. The YES dots are on opposite corners now. The best you can do is 3 out of 4 correct. Try as long as you want — no straight line will get all 4 right.

▸ run real python

Watch a real single-layer perceptron give up

This is neuro2.py. Same code as neuro1.py, just XOR data. The output will plateau at MSE ≈ 0.25 and every prediction ends up ≈ 0.5.

neuro2.py Python loads on first run
(click "run" to execute)
Challenge

Try to beat XOR with one line anyway. Change the sliders, then explain why your best answer still misclassifies at least one corner.

5. Hidden neurons fix it

The fix turned out to be: use MORE neurons, in layers.

Imagine you have a few neurons sitting in the middle of the network. Each one draws its own line on the graph paper. Now an output neuron looks at what those middle neurons said and combines their answers. Two lines, combined cleverly, can carve out the diagonal pattern.

in 1 in 2 hid 1 hid 2 hid 3 output inputs hidden layer output

We call the middle neurons hidden because we don't tell them what they should mean. They figure out their own job during training. After training, you might peek inside and find:

Nobody told the network those rules. It discovered them by being a tiny bit less wrong over and over.

▸ try it

Now watch it succeed

Same XOR data as last time. New network: 4 hidden neurons + 1 output neuron. Click "train" and watch the predictions actually converge.

epoch: 0

This is the central trick of deep learning. Layers of neurons build up more and more sophisticated ideas without anyone programming those ideas in directly.

The piece of math that makes this work: the chain rule

You might be wondering: how does the network know which way to nudge the hidden-layer weights? With one neuron it was kind of obvious — there's only one output, you can see which direction makes it more right. But the hidden neurons are buried in the middle. They don't talk to the output directly. How do you blame them for a mistake?

The trick is a piece of calculus called the chain rule. The kid-friendly version: imagine the error at the very end of the network, and then ask "how did each layer contribute to that error?" You start at the output, where you can see the error directly. You compute how the output layer's weights affected the error — same as in chapter 3. Then you push that blame backward through the network: each hidden neuron gets credit (or blame) proportional to how strongly it was connected to the output neurons that mattered. Then you can nudge the hidden weights too.

This whole "compute the error, then pass blame backward layer by layer" idea is called backpropagation. It's the algorithm that came out of that 1986 paper we mentioned in chapter 4 — the one that ended the first AI winter. The interactive above is running backprop on every "step + animate" click. You're literally watching the algorithm that unlocked modern AI.

If you want to see the chain rule drawn out beautifully, Chris Olah wrote "Calculus on Computational Graphs: Backpropagation" — the clearest gentle-but-real explainer of backprop on the open web.

▸ try it

Watch the wires light up

Same XOR network, but now you can SEE the neurons. Pick which example to feed in. The lines are the weights — thick blue = strong positive, thick red = strong negative, thin = weak. Click "step" to train for one round and watch the forward pass animate.

left=0, right=0 → want NO
left=1, right=0 → want YES
left=0, right=1 → want YES
left=1, right=1 → want NO
prediction: ? want: ? epoch: 0

Output math

How to read it: blue wires are positive weights and red wires are negative weights. Red does not mean wrong. A hidden neuron can still light up because of its bias and all incoming weights together. Then the right-side wires act like votes: active hidden neurons with blue output wires push the final answer toward YES; active hidden neurons with red output wires push it toward NO. The output is the combined vote from all hidden neurons. (The "YES/NO vote" picture works here because there is one output neuron deciding yes-or-no. When we add more output classes in chapter 6, the votes turn into a list of scores instead.)
▸ run real python

Run a real 2-layer network on XOR

This is neuro3.py. Same XOR data as before, but now with a hidden layer of 4 neurons. The predictions will converge to ~0.99 for YES and ~0.01 for NO.

Heads up: the first time you click "run" on any Python block in this book, your browser downloads Python itself (about 10 megabytes). That takes 10–20 seconds and requires an internet connection. After the first run on this page, every later run is instant.

neuro3.py Python loads on first run
(click "run" to execute)
Challenge

In the Python code, change the hidden layer from 4 neurons to 2. Does it still learn XOR? What happens if you try 1?

What does that "squish" actually do?

You've been hearing about sigmoid — the function that squishes any number into the range 0-1. Time to actually look at what it does, and meet its modern replacement, ReLU.

Here's what these functions look like when you plot them on graph paper. The horizontal axis is what goes INTO the neuron (the weighted sum). The vertical axis is what comes OUT.

▸ try it

Drag the dot. Watch the curve.

Slide the input left/right. The red dot is where the neuron sits right now. The orange dashed line is the slope at that exact point — the derivative.

sigmoid
ReLU
0.00
output (y): 0.50 slope (derivative): 0.25

What does the derivative tell us?

The derivative is the slope of the curve at that point. Why this matters: during training, the network uses the slope to figure out which way to nudge the weights.

Look at sigmoid's slope by dragging the slider:

Now switch to ReLU and compare

Toggle "ReLU" above. The curve becomes a straight line: zero below x=0, then a 45° ramp upward. Its slope is either exactly 1 (positive input) or exactly 0 (negative input). No squishing. No flat tails.

Why we replaced sigmoid in deep networks

The slope of sigmoid is at best 0.25, usually way less. When you stack 80 layers of sigmoid, backpropagation multiplies all the slopes together as the error travels backward. Picture it:

0.25 × 0.25 × 0.25 × ... (80 times) ≈ 0.0000000... (about 10⁻⁴⁹)

That's a 1 with 49 zeros in front of it. Smaller than the number of atoms in a teaspoon of water. The signal vanishes long before it reaches the early layers. Those layers stop learning. This is the famous vanishing gradient problem — a major reason deep networks didn't work for years even after backpropagation was rediscovered.

ReLU fixed it. Its slope of 1 keeps the signal alive through many layers. You'll find ReLU (and modern relatives like GELU, used in BERT and most LLMs; SiLU / Swish, used in some Google models; and LeakyReLU, which lets a tiny bit of signal through on the negative side to avoid "dead" neurons) inside every recent neural network: image recognition, voice assistants, ChatGPT, all of them. Wikipedia's activation function page has them all in one place.

▸ run real python

Run the ReLU version

This is neuro5.py. It keeps the snack picker but swaps the hidden activation from sigmoid to ReLU. The learning rate is smaller because ReLU can move faster.

neuro5.py Python loads on first run
(click "run" to execute)
Challenge

In the ReLU script, try changing lr = 0.1 to 0.5. If training gets worse, you have found why learning rate matters.

6. Picking from a list

So far the neuron has just answered yes or no. What if the question is "which snack should I grab?" with several options?

hungry?after school?weekend?snack
000fruit
100fruit
010chips
110chips
001cookie
101cookie
011cookie
111cookie

Three options now: fruit, chips, cookie. The fix is small: have one output neuron per option. Each outputs a score for its option. A function called softmax turns the scores into probabilities that always add up to 1.

The math is small enough you could do it on paper. For three scores z₁, z₂, z₃, softmax is:

softmax_i = e^(z_i) / (e^(z_1) + e^(z_2) + e^(z_3))

The e^ part makes everything positive and exaggerates the differences (a bigger score becomes way more likely than a slightly smaller one). The dividing makes them add up to 1, so they're proper probabilities. Quick example:

softmax([2.0, 1.0, 0.1])  →  [0.66, 0.24, 0.10]

The model might output [0.10, 0.85, 0.05] meaning "10% chance fruit, 85% chance chips, 5% chance cookie". You pick the highest one (chips).

The labels we feed in for training look like [1,0,0] for fruit, [0,1,0] for chips, [0,0,1] for cookie. Exactly one slot is 1 (the right answer) and the rest are 0. That's called one-hot encoding — we'll meet it again in the next chapter and use the name more often.

▸ try it

The trained snack picker

This network was actually trained for you when this page loaded. Toggle the inputs and watch the probabilities update.

hungry
after school
weekend
🍎 fruit
0.80
🥔 chips
0.15
🍪 cookie
0.05
recommended: fruit

This same pattern — output one neuron per class, softmax to get probabilities — is how a network can recognize handwritten digits (10 classes: 0 through 9), pick a song to recommend, or pick the next word in a sentence (about 100,000 possible classes — one per possible word/word-piece). Same idea, just more output neurons. If you want the deeper "why" of softmax (where it comes from in statistics), Wikipedia's softmax function page is a solid first stop.

▸ run real python

Run the real multi-class snack picker

This is neuro4.py. 3 inputs, 6 hidden neurons, 3 output classes with softmax. Trains in a couple of seconds and prints a table with the right snack for each combination.

neuro4.py Python loads on first run
(click "run" to execute)
Challenge

Add a fourth snack class in your notebook: smoothie. What new output neuron and labels would the network need?

7. Animal guesser

The snack picker had three choices. Real classifiers often have more classes and messier inputs. Let's train a tiny network to guess an animal from clues.

You met one-hot encoding in the previous chapter (the snack labels). This time we'll do it on the inputs too. A neural network wants numbers, not words. So a word choice like size = large becomes a few yes/no inputs:

size_small = 0
size_medium = 0
size_large = 1

Same for place, food, feet, and color. The network does not receive the word "orange"; it receives color_orange = 1 and the other color inputs set to 0. None of these clues is enough by itself; the network has to combine them.

animalsizeplacefoodfeetcolor
dogmediumhomebothpawsbrown
catsmallhomemeatpawsorange
rabbitsmallhomeplantspawsbrown
pigmediumfarmbothhoovespink
horselargefarmplantshoovesbrown
cowlargefarmplantshoovesblack-white
lionlargewildmeatpawsbrown
tigerlargewildmeatpawsorange
▸ try it

Train an animal guesser

Click train, then change the traits. The network will show probability bars for each animal. Try normal combinations, then weird ones like a large home animal that eats meat and has hooves.

Traits

Network shape

epoch: 0 train accuracy: ?

Prediction

How the clues become numbers

What training saved

The trained model is just learned numbers. This preview shows a few weights from the input-to-hidden layer and hidden-to-output layer.

Why this is useful: the network can learn clue-combinations. One hidden neuron might become useful for farm + hooves, another for wild + meat + paws, another for small + home. We did not name those hidden neurons ourselves. Training nudged the weights until useful patterns appeared. (When researchers try to read what hidden neurons learned in real models, the field they're working in is called interpretability. It's hard, and it's some of the most interesting research in AI right now.)

How many hidden neurons? That number is a choice the builder makes before training. Training changes the weights, biases, and probabilities, but it does not grow new neurons in this tiny model. Try 2, 4, 8, and 12 hidden neurons: too few can run out of pattern space, while more gives the model more room to memorize the 8 training animals exactly instead of learning the rule. That's called overfitting, and it's a problem real machine learning has to fight constantly. We'll come back to it when we move to real datasets.

One more thing while you're here: hit the "show learned numbers" button further down. The trained model is just a bag of numbers — that button literally prints them out. This is what's saved when someone says "the model weighs 10 GB" or "I downloaded the weights." A model is its weights and nothing else.

▸ run real python

Run the animal classifier

This is neuro6.py. It trains the same idea in numpy and prints predictions for the training animals plus one mixed-up custom animal.

neuro6.py Python loads on first run
(click "run" to execute)
Challenge

Add sheep in your notebook. Which existing traits would it share with cow and horse? Which new trait, like wool, would make it easier?

8. How learning actually works

You've been hearing "nudge the weights in the right direction" for eight chapters. Time to actually pin that down. This is the math that powers every other chapter in this book.

The loss landscape

Imagine every weight in the network is a knob you can turn. A tiny network might have 20 knobs; ChatGPT has hundreds of billions. For each combination of knob positions, you can measure how wrong the network is on your training data. That single number is called the loss (sometimes "cost" or "error" — same idea).

If you could plot loss against every possible knob setting, you'd get a landscape — bumpy, hilly terrain stretching out in millions of dimensions. High places are bad (the network gets stuff wrong); low places are good. Training a network is the search for a low spot in that landscape.

Humans can't picture a 175-billion-dimensional landscape. Nobody can. But the rule for getting downhill is the same whether you're in 2 dimensions or 2 billion: at your current spot, figure out which direction goes downward fastest, take a small step that way, repeat.

The gradient is just the slope

The word gradient sounds fancy. It just means "slope, in every direction at once." If you're standing on a hillside, the gradient is the arrow that points straight downhill from where you're standing, along with how steep that direction is.

For each weight in the network, the gradient says: "if you increase this weight a tiny bit, the loss goes up (or down) by this much." Calculus (specifically, the chain rule we met in chapter 5) is the tool that computes the gradient for every weight at once. The algorithm that does this efficiently for a deep network is — you guessed it — backpropagation.

Once you have the gradient, the update rule is comically simple:

new_weight = old_weight  −  (learning_rate × gradient)

That's it. Every line of "training" in every script in this book is doing some version of that. The minus sign is what makes it descent — you're moving the weight against the gradient, toward smaller loss.

▸ try it

Roll the ball downhill

The curve is the loss. Horizontal position is the value of one weight. The red dot is where the network currently sits. Click "step" to take one gradient-descent step. Watch the dot find the bottom — and watch the steps get smaller as the slope flattens out, because the gradient itself shrinks near the minimum.

Why we take small steps (learning rate)

That learning_rate in the update is the size of each step. It's one of the most important knobs in deep learning — and it's a knob you pick, not the network.

In neuro1.py the learning rate was implicitly 1 (we didn't write it — the toy data was friendly). In neuro3.py we wrote lr = 0.5. In neuro5.py we used lr = 0.1 because ReLU is more sensitive. In real training people use schedules that shrink the learning rate over time — start big to make fast progress, end small to settle into a good minimum.

Local minima — why we don't usually care

The landscape isn't a simple bowl. It has lots of dips, valleys, plateaus, ridges. A "local minimum" is a dip that's lower than everything around it but not the lowest place in the whole landscape. For decades, people worried that gradient descent would get stuck in bad local minima.

Here's the surprising empirical finding: in really high-dimensional landscapes (the kind real networks live in), most local minima turn out to be roughly as good as the global minimum. Bad minima are rare. There's also a lot of geometry where the network can wiggle around until it finds a path out. Modern networks just train, find a low spot, and the low spot is good enough. Nobody has a clean theory for why this works as well as it does — but it does.

Mini-batches: noisy descent

So far we've described "compute the loss on all your training data, then take a step." That's called full-batch gradient descent. With 60,000 training examples it's wildly expensive — every step requires running the entire network on the entire dataset.

The fix everyone uses: shuffle the data, take a tiny chunk (say 32 or 128 examples — a mini-batch), compute the gradient on just that chunk, take a step, grab the next chunk, repeat. This is called stochastic gradient descent (SGD), or really mini-batch SGD.

Picture-wise: instead of a smooth ball rolling cleanly downhill, imagine a slightly drunk ball wobbling downhill. Each step's direction is approximately correct but jittery, because it was computed from a small sample. The surprising thing is the noise actually helps — it shakes the ball out of small bad pockets and tends to find broader, flatter minima, which generalize better to new data. So mini-batching is faster AND finds better solutions. Win.

That's gradient descent. Every learning algorithm you'll meet — Adam, RMSprop, momentum, AdamW — is the same loop with smarter rules about how much to step and which direction to actually go (often a smoothed blend of recent gradients). The core idea — "compute the slope, step against it" — has not changed since 1847.

Challenge

Click "step downhill" slowly and watch the steps shrink near the bottom. Why do they shrink? (Hint: step = lr × gradient, and the gradient is the slope — what's the slope of a flat line?)

9. Scaling this up

Everything so far has been with maybe a few dozen neurons total. Real networks are gigantic:

NetworkYearRoughly how many weights
The tiny examples above~50
Animal guesser (this book, ch 7)~200
MNIST classifier (this book, ch 15)~110,000
LeNet-5 (digit recognizer)1998~60,000
ResNet-50 (image classifier)2015~25 million
GPT-3 (the base model under early ChatGPT)2020175 billion
Frontier LLMs today (Claude, GPT-4, Gemini, Llama-405B)2023–2026Hundreds of billions to trillions (exact numbers usually not public)

What stays the same when you scale up

The math. Genuinely. The forward pass of a 175-billion-parameter model is still: multiply inputs by weights, add bias, apply a non-linearity, repeat. Backprop still computes gradients the same way. SGD still steps against the gradient. If you understand neuro4.py, you understand the inner loop of GPT.

You'll see this for yourself in neuro9.py a few chapters from here — the MNIST classifier is the same shape as the iris classifier in neuro8.py, just with 784 inputs instead of 4 and two hidden layers instead of one.

What changes when you scale up

More layers. Toy nets have 2 layers (input → hidden → output). Image models like ResNet-50 have 50. Modern LLMs have dozens of layers stacked on top of each other — GPT-2 has between 12 and 48 depending on size, GPT-3 has 96, Llama-2-70B has 80, Llama-3-405B has 126. (GPT-4 and Claude depths aren't public.) Why does depth help? Each layer can build a more sophisticated representation on top of the last. Early layers might learn edges or syllables; middle layers learn shapes or words; late layers learn objects or whole concepts. This is called compositionality and it's the whole reason "deep" learning is called "deep."

More parameters. Going from 60,000 weights to 175 billion is a factor of 3 million. That's not a tweak. Each weight is a number stored in memory; each is updated every training step.

More data. Iris has 150 rows. MNIST has 60,000 images. ImageNet has 1.2 million images. GPT-3 was trained on roughly 300 billion tokens — about 500 million pages of text. Modern frontier models train on trillions of tokens. The rough empirical rule is: bigger models need proportionally more data to learn well. The 2020 Kaplan paper and the 2022 "Chinchilla" paper worked out what those proportions are; people call those rules scaling laws.

More compute. Training GPT-3 took thousands of GPUs running for months and reportedly cost millions of dollars. A GPU (Graphics Processing Unit) is a chip originally built for video games, but it's spectacularly good at doing thousands of multiply-and-add operations in parallel — which is exactly what neural network forward and backward passes are. Specialized AI chips like Google's TPUs and NVIDIA's H100 / B200 take that idea further. The math is unchanged; the hardware that does the math billions of times per second is what changed.

What gets harder when you scale up

Vanishing/exploding gradients. Remember the 10⁻⁴⁹ from the activations chapter? Stacking 100 layers of anything will misbehave unless you're careful. Modern networks use ReLU-family activations, careful weight initialization, and a trick called residual connections (shortcuts that let signal skip past layers) to keep gradients alive.

Overfitting. With 175 billion parameters, your network has plenty of capacity to memorize its training data instead of learning rules. People fight this with bigger datasets, dropout, weight decay, and early stopping.

Engineering. Training a model that doesn't fit on one GPU means splitting it across thousands. Failure becomes a daily fact of life. Saving checkpoints, restarting, monitoring loss curves, debugging numerical issues — most of a frontier ML team's work is engineering, not algorithm design.

Why bigger keeps working

Here's the unreasonably-effective fact at the heart of modern AI: when you make a model bigger and train it on more data, it just keeps getting better. Smoothly. Predictably. People expected diminishing returns around 1 billion parameters, then 10 billion, then 100 billion. Each time, scale kept paying off. Rich Sutton's "The Bitter Lesson" is the most-cited essay on why: clever human-engineered tricks tend to get beaten by simple, general algorithms that just consume more compute. It's bitter because researchers like cleverness. It's a lesson because it keeps happening.

Challenge

Estimate the weights in a tiny network with 3 inputs, 5 hidden neurons, and 2 outputs. Count weights first, then remember each neuron also has a bias. Then look up how many seconds you'd need to type out all 175 billion of GPT-3's weights if you typed one per second. (Hint: that's about 5,500 years.)

10. How AI chatbots actually write

Here's the leap from "neuron picks a snack" to "writes a whole story." The underlying model class is called a transformer, introduced in the 2017 paper "Attention Is All You Need". Every modern LLM is some descendant of that paper.

The tokens trick

ChatGPT doesn't see letters or words. Before anything happens, your message gets chopped into chunks called tokens. A token is usually a piece of a word (sometimes a whole short word, sometimes a single character). The chopping algorithm is called byte-pair encoding (BPE) — it learns which letter combinations are common in text and bundles them into single tokens. Common words like "the" become one token; rare words get split into pieces.

Each token has an ID number from a fixed list. The size of that list varies by model: GPT-2 had about 50,000 tokens; GPT-4's tokenizer has about 100,000; Llama-3 has 128,000. So when this chapter says "about 100,000 tokens," picture a frontier-model-sized vocabulary.

Each token becomes a long list of numbers

The model looks each token's ID up in a giant table and grabs a vector of numbers — the token's embedding. How long that vector is depends on the model: GPT-2 small used 768 numbers per token; GPT-3 (175B) used 12,288; smaller models use less, frontier models use more. Call it "thousands of numbers per token" and you're in the right ballpark.

Words with similar meanings have similar embeddings, because the model figured that out from reading huge amounts of text. The fingerprint for "dragon" is closer to "monster" than to "broccoli."

Here's the connection back to everything you've read so far: the embedding table is just learned weights. Same as the weights in chapter 2's neuron. Same as the weights in neuro4.py. It's one giant matrix — rows indexed by token ID, columns are the embedding numbers. During training, the same gradient descent that learns the OR rule also slowly nudges the embedding numbers until "king − man + woman" lands near "queen" and "dragon" lands near "monster." Nothing new in the algorithm; just a much bigger matrix.

The model's only job (during pre-training): predict the next token

Given the tokens so far, what's the most likely next token? That's literally the entire pre-training task. It's the snack-picker from chapter 6, but with about 100,000 classes instead of 3 — every possible token is one of the choices.

(We say "pre-training" because for products like ChatGPT and Claude there's a second training stage after this — covered below — that turns the raw next-token predictor into a polite assistant. The next-token prediction is the foundation everything else is built on, but it's not the whole story.)

▸ try it

Walk through the full pipeline

This shows what happens inside the model when it reads your prompt: "tell me a story about a dragon who likes pizza". Click "next stage" to step through it.

1. Tokenize
2. Embed
3. Attention
4. Predict

Stage 1 — chop the text into tokens

The model can't read letters. Your sentence gets chopped into chunks called tokens. Each one has an ID number from a list of about 100,000.

Once the model has picked the first token, it sticks it on the end of your prompt and runs the whole pipeline AGAIN to pick the next one. And again. And again. That's how a whole story comes out one token at a time.

▸ try it

Watch a story get written one token at a time

Your message: "tell me a story about a dragon who likes pizza". Click "next token" to see what the model picks. Each time you'll see its top 5 guesses with probabilities — then it picks one and adds it to the story.

(The probabilities below are illustrative — they show roughly what a real LLM's distribution looks like at each step, but they're hand-written for this widget, not live model output. Also, the top 5 you see is a slice from the full distribution over ~100,000 tokens; the rest of the mass is spread across thousands of tokens with tiny individual probabilities.)

The model does not keep a human-style outline with a fixed ending. At each step it picks a next token from the context so far, very fast, using patterns it learned from huge amounts of text, code, and conversations.

The surprising thing is that this simple-looking loop — "pick the next token, then pick the one after that" — can produce essays, code, math steps, and full conversations. To be genuinely good at "what's the next token", the model has to absorb deep patterns in language and the world.

Sampling: why the same prompt gives different answers

You may have noticed: if you ask ChatGPT the same question twice, you don't always get the same answer. That's on purpose. The model produces a probability for every possible next token, and then it samples — picks one at random, with higher-probability tokens being more likely. It doesn't always pick the top one.

The sampling has a few knobs:

So when the pipeline above showed "model picks Once (highest probability)" — that's a simplification. With temperature 0 it would always pick Once. With temperature 0.7 it would usually pick Once but sometimes pick Sure, In, or There. This is also why "regenerate" produces different answers and why the same prompt twice doesn't have to match.

Pre-training vs. ChatGPT-the-product (RLHF)

One more important wrinkle. Everything above describes a model trained to predict the next token in random text from the internet. That's the pre-trained base model. If you talked to a raw base model — and you can, lots of them are public on Hugging Face — it wouldn't act like an assistant. You'd type "What's 2+2?" and it might continue your text as if it were a math worksheet ("What's 2+2? ___ What's 3+3? ___") instead of answering you.

ChatGPT acts like an assistant because of a second training stage on top of pre-training. The most common technique is called RLHF — Reinforcement Learning from Human Feedback. The recipe:

  1. Show the base model a prompt. Have it generate several different answers.
  2. Have humans rate which answer was better.
  3. Train a smaller "reward model" to predict the human ratings.
  4. Use that reward model as a stand-in for human preferences, and nudge the original model's weights to produce answers the reward model rates highly.

RLHF is why ChatGPT acts polite, follows instructions, refuses harmful requests, and structures its replies. Without RLHF the underlying math is the same — it's still next-token prediction — but the model is "shaped" by examples of what humans liked into a conversational assistant. Anthropic's work on Claude's character is a good read on what this shaping looks like in practice.

If you want to see this whole pipeline built from scratch in real code, Andrej Karpathy's "Let's build GPT" video (about 2 hours) is the gold standard. He starts from numpy primitives much like ours and ends with a working tiny transformer. For the longer view, Stephen Wolfram's "What Is ChatGPT Doing… and Why Does It Work?" is the most-respected long-form pop-sci explainer of LLMs.

Challenge

Use the candidate bars to make a different story in your head. What would change if the model picked "In" instead of "Once" as the first token? Then think about what would happen at temperature 0 (always pick the top one) vs temperature 2 (almost flat probabilities).

11. What to remember so far

If you remember nothing else from the first ten chapters, remember this:

  1. A neuron is a tiny math equation: multiply, add, squish.
  2. A network is a bunch of neurons stacked into layers.
  3. Weights are the adjustable numbers inside each neuron. A trained model is its weights and nothing else.
  4. Training means showing examples and nudging weights (via gradient descent + backpropagation) to reduce mistakes, often millions or billions of times.
  5. Deep networks (lots of layers) can learn complicated patterns that single neurons can't. The XOR fail from chapter 4 is the small version of why.
  6. LLMs like ChatGPT use the same training idea at enormous scale, with a transformer architecture (covered in the "going deeper" chapters) built around next-token prediction plus a second RLHF stage that shapes the model into an assistant.
  7. There is no single magic ingredient. The power comes from simple math, lots of examples, lots of computer power, and clever engineering working together.

The same training idea that solved the screen-time puzzle in chapter 3 is part of what powers ChatGPT. The real systems are much bigger and add specialized architecture and engineering.

Don't stop here — there's more. The next three chapters are the "going deeper" sections. They cover training on real datasets (the iris flowers), using PyTorch (the framework every real ML project uses), and the architectures behind image models and LLMs (CNNs and transformers). They include some of the strongest material in the book. Keep going.
Challenge

Explain neural networks to someone else using only these words: input, weight, mistake, nudge, layer, prediction.

If you want to build the real toy version in Python, look at neuro1.py through neuro9.py in this folder. Each one adds one new idea on top of the previous. The later chapters cover the more advanced topics — real datasets, frameworks, and networks for images and language. For a free, deeper textbook when you're ready, the Goodfellow / Bengio / Courville Deep Learning book is available online at deeplearningbook.org.

Going deeper — training on a real dataset

Everything up to chapter 11 used hand-picked toy examples — 4 rows for the screen-time rule, 8 for the snack picker. Real machine learning trains on much bigger datasets. A classic benchmark called iris (collected by the statistician Ronald Fisher in 1936) has 150 flowers with 4 measurements each. MNIST (handwritten digits, curated by Yann LeCun) has 60,000 training images. ImageNet has 1.2 million labelled training images (14 million across all the variants). The text used to train GPT-3 was over 300 billion tokens — somewhere around 500 million pages of writing.

Once you stop being your own dataset author, four disciplines kick in. None are hard, but each matters.

1. Train / test split

You can't measure how well a network learned by testing it on the same data you trained on. That's like marking your own homework. Hold back, say, 20% of the data, train on the other 80%, and only check accuracy on the held-out portion. That number is the only honest measure.

If your network gets 100% on training data and 60% on held-out test data, you're overfitting — it memorized the answers instead of learning the rule.

2. Feature scaling

Iris flowers' petal length is in centimeters (1–7 cm range). Sepal width is also centimeters but a different range (0.1–4.5). If one input column has values 10× bigger than another, its weight update will dominate training. Standardize each column to mean 0, standard deviation 1 — compute the average and spread on the training set only, then apply the same shift to both train and test.

3. Mini-batches

Toy networks could feed all 4 (or 8) examples through at once each epoch. With 60,000 examples you'd run out of memory. Solution: slice the dataset into small batches (say, 16 or 128 rows). Run forward + backward + weight update on one batch. Move to next batch. After all batches, that's one epoch.

Mini-batches also make training smoother — many small updates beat one huge swing.

4. Watch loss AND accuracy

Loss is what the math optimizes. Accuracy is what humans care about. Print both. When training accuracy keeps climbing but test accuracy plateaus or drops — that's overfitting happening in real time.

Run neuro7.py in this folder to train iris for real (requires scikit-learn — install with pip install scikit-learn). After 200 epochs of mini-batch training, expect ~97% accuracy on the held-out test set. Same network shape as the snack picker (just 4 → 16 → 3 instead of 3 → 6 → 3), same math you already know, now eating real flower measurements. Iris is also a very easy dataset — pretty much any model gets above 95%, so don't read this as evidence you've built something amazing. It's the warm-up.

▸ run real python

Run iris training

This is neuro7.py. It uses numpy plus scikit-learn's built-in iris dataset.

Heads up: in the browser, the first time you click "run" here, the page downloads scikit-learn through Pyodide. That's roughly 10 megabytes and takes 20–40 seconds on a typical home connection. Later runs are fast.

neuro7.py Python loads on first run
(click "run" to execute)
Challenge

Change test_size=0.2 to 0.5. Does the test accuracy become noisier when the model has fewer training examples?

Going deeper — using a framework (PyTorch)

So far you've written every line yourself: x @ w1 + b1, sigmoid, manual gradients, weight updates. That's the right way to learn. But nobody works that way in real life. They use a framework called PyTorch (or its cousin TensorFlow; for production work on Google hardware, also JAX). The framework does the bookkeeping. You just describe the network shape.

The whole iris training loop in PyTorch looks like this:

model = nn.Sequential(
    nn.Linear(4, 16),
    nn.ReLU(),
    nn.Linear(16, 3),
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    for xb, yb in batches:
        logits = model(xb)
        loss = loss_fn(logits, yb)

        optimizer.zero_grad()
        loss.backward()       # autograd computes every gradient
        optimizer.step()      # applies them

Every piece replaces something you wrote by hand:

Hand-rolledPyTorch
np.maximum(0, x)nn.ReLU()
softmax + cross-entropy by handnn.CrossEntropyLoss()
Manual forward (h = ..., o = ...)model(x)
Manual backward (d_o, d_h)loss.backward() (autograd)
w -= lr * gradoptimizer.step()
Plain SGDtorch.optim.Adam (smarter)

Autograd is the powerful bookkeeping part. PyTorch tracks every operation you do during the forward pass. When you call loss.backward(), it walks the chain backward and computes every single gradient, exactly as you did by hand in chapter 5. You never write a chain-rule expression again.

Adam is the optimizer used in almost every modern model. It's gradient descent with two upgrades: momentum (keep moving in directions that worked recently) and per-weight learning rates (slow down weights that bounce around, speed up ones that crawl). For training transformers specifically, people now usually use a tweaked version called AdamW (the W stands for "weight decay") — but it's still recognizably the same algorithm.

Run neuro8.py locally (after pip install torch) to see this exact loop train iris in seconds. The official PyTorch 60-minute blitz is the most-recommended next tutorial. After that, the natural next step for everyone reading this book is Andrej Karpathy's "Neural Networks: Zero to Hero" YouTube series — he rebuilds backprop, micrograd, and a working GPT from scratch in real code. It's basically a more advanced version of this book.

▸ run locally

Run the PyTorch version

This is neuro8.py. PyTorch is too large for this single-file browser runner, so this panel keeps the code synced and gives the exact local command.

pip install torch scikit-learn
python3 neuro8.py
neuro8.py local Python script
Run locally with: python3 neuro8.py
Challenge

Change the hidden layer from 16 neurons to 4. Does PyTorch still train fast? Then try 64 and compare train accuracy to test accuracy.

Going deeper — convnets, transformers, and beyond

Same machinery scales up. Two big architectural ideas you'll meet from here:

Convolutional networks (CNNs) — for images

The "hello world" of deep learning is MNIST — 60,000 handwritten digit images (28×28 pixels, 10 classes). A plain fully-connected network like the snack picker can hit ~97% accuracy on MNIST. State of the art with convolutional networks is ~99.8%.

The change: replace nn.Linear in the early layers with nn.Conv2d. Each neuron only looks at a small spatial patch of the image (say, 3×3 pixels) and shares its weights across all positions. This means the same edge detector gets applied everywhere in the image — a great fit for pictures, which are translation-invariant (a cat is a cat whether it's in the top-left or bottom-right of the photo).

Pattern: Conv → ReLU → Pool → Conv → ReLU → Pool → ... → Linear. CNNs are everywhere images are: medical scans, face recognition, self-driving car perception, photo classification. The 1998 paper that started it all is LeCun et al.'s LeNet-5. The 2012 paper that proved CNNs could win the open competition for general image recognition is AlexNet. If you want to see what conv filters actually look like, Chris Olah's distill.pub feature visualization is a masterpiece.

Transformers — for sequences (and now everything)

You already met the basic idea in chapter 10. For sequences (text, audio, code), replace fully-connected hidden layers with attention blocks. Each token gets to look at every other token and pull a weighted blend of their values into its own representation. The output is still a softmax over a vocabulary, one token at a time.

Every modern LLM (GPT, Claude, Gemini, Llama) is a stack of transformer blocks. Plus a few practical tricks (positional embeddings, layer normalization, mixture of experts), but the core is just stacked attention + feed-forward layers, with softmax at the end. The original paper is Vaswani et al., "Attention Is All You Need" (2017). The kid-friendly walkthrough is Jay Alammar's "The Illustrated Transformer".

The honest summary

Whether you're building a snack classifier, a digit recognizer, or a frontier LLM, the same primitives keep showing up:

The math you wrote by hand in chapter 5 still shows up inside GPT-class models. The advanced systems add a lot of scale, data, architecture, and engineering, but the basic training idea should feel recognizable now.

▸ try it — comparison: how a not-neural-network does it

Draw a digit (template matcher, not a neural net)

Important: this widget is not a trained neural network. It's a "template matcher" — a much older, simpler trick where we hand-make a few example pictures of each digit and find the closest match to your drawing using cosine similarity. We've left it here as a comparison: drawing a few digits will show you exactly how brittle hand-coded matching is. The real MNIST neural network is the next script (neuro9.py) — to run it you'll need PyTorch locally.

?

Try drawing the same digit skinny, wide, tilted, or off-center. The score changes because the template matcher is brittle. A trained neural network learns many versions of each digit.

To run the bigger example yourself: neuro9.py in this folder trains a real MNIST classifier with PyTorch. Needs pip install torch torchvision and a few minutes of CPU (or seconds on a GPU). After 5 epochs you'll see ~98% test accuracy. That's your network classifying handwritten digits well — which, when you remember that we started this story with a single neuron deciding "screen time: yes or no", is a pretty good place to stop.

If you finished this book and want to go further

The short list of places to go next:

▸ run locally

Train the real MNIST model

This is neuro9.py. It downloads MNIST on first run and trains a real PyTorch digit classifier.

pip install torch torchvision
python3 neuro9.py
neuro9.py local Python script
Run locally with: python3 neuro9.py
Challenge

Run neuro9.py, then reduce the first hidden layer from 128 neurons to 16. Watch how the test accuracy changes.