How a neural network actually works

A tour for curious 13-year-olds, with things you can play with. Read in order. Drag the sliders. Press the buttons.

1. Tiny math map

Before the neural-network stuff starts, here is the tiny bit of math language this book uses over and over. None of it is advanced; it is just a map for reading the examples.

What you see	What it means
`0`	no, off, false, absent
`1`	yes, on, true, present
`[1, 0]`	two inputs: the first one is on, the second one is off
`(1, 0)`	the same two numbers, drawn as a dot on graph paper
`>`	"is bigger than?" — the question that turns a number into a yes/no answer

One more thing before we start: this book keeps coming back to a single running example called the screen-time rule. The house rule is: you get screen time if you practiced piano OR finished your homework. In chapter 4 we'll teach a neuron that rule. For now, just notice how one example from it can be written three ways.

A table row, a graph dot, and a line of code can all describe the same example. For the screen-time rule, this row:

piano = 1, homework = 0, answer = YES

can also be written as this input and label:

x = [1, 0]
y = 1

and it can also be drawn as the dot (1, 0) with a YES label. Same information, three costumes.

A weight is just "how much this input matters." A bias is just "how easy it is for the neuron to say yes before it sees the inputs." The neuron adds everything up, then asks one yes/no question: is the total big enough?

Reading tip: whenever a formula looks scary, translate it back to words. weight1 × input1 + weight2 × input2 + bias > 0 means "input 1 gets a vote, input 2 gets a vote, the bias adds a built-in lean, and the neuron asks whether the total crosses the yes line."

Challenge

Write the input [0, 1] in words for the screen-time example. Which thing is off? Which thing is on?

2. The big picture

A neural network is a very large pile of tiny math equations. Together they turn an input (like a picture, or a sentence) into an output (like "this is a cat" or the next word in a story).

Why is it called a neural network? The name is borrowed from the brain. Your brain is a web of cells called neurons that pass tiny signals to each other, and each one "fires" when the signals arriving at it add up to enough. The math version you're about to meet was loosely inspired by that — loosely is the key word. What's inside ChatGPT is not a simulated brain; it's arithmetic. But the name stuck.

Each tiny equation has some adjustable numbers called weights. At the start the weights are random and the network is terrible. You show it lots of examples where you already know the right answer. Each time it gets one wrong, you nudge every weight a tiny bit in the direction that would have made it slightly less wrong. (How does it know which direction is "less wrong"? In chapter 4 you'll do one of these nudges yourself, with real numbers — and chapters 8 and 12 explain how the network works the direction out for millions of weights at once.)

Do this millions or billions of times and the weights end up encoding a surprising amount of knowledge about the world.

That's the main idea. The rest of the book shows how far that idea can stretch.

If you like watching things move while you read, the 3Blue1Brown neural networks series on YouTube animates almost every idea in this book. It's the single best companion to what you're reading now.

Challenge

Before moving on, name three problems that could be turned into "input goes in, prediction comes out." Try one from school, one from a game, and one from real life.

3. One tiny neuron

Imagine a little machine. It has some inputs and one output. Each input is a number — could be the brightness of a pixel, or "did I practice piano? 1 for yes, 0 for no".

The machine does three things in order:

Multiply each input by its own "importance number" (a weight)
Add them all together (plus one extra fudge factor called a bias)
Squish the answer into a number between 0 and 1 using a curve called sigmoid

That output between 0 and 1 is how strongly the neuron "fires." It's basically how confident the neuron is that the answer is yes. That's one neuron. Yes, that's the whole thing.

What does sigmoid actually look like?

Sigmoid is just a curve that turns "any number" into "a number between 0 and 1". Feed it a very negative number and out comes nearly 0. Feed it a very positive number and out comes nearly 1. Feed it exactly 0 and out comes exactly 0.5 — the perfectly-unsure middle.

That middle point clears up something from chapter 1. There, the neuron's question was "is the total bigger than 0?" In this chapter, the decision is "is the output bigger than 0.5?" Those are the same question, because sigmoid turns a total of exactly 0 into an output of exactly 0.5 — a total above 0 always lands above 0.5, and a total below 0 always lands below it. Two costumes, one rule.

For the playground below, we'll use two simple yes/no inputs: piano done? and homework done? A value of 1 means yes, and 0 means no. The neuron is pretending to answer: "should screen time be allowed?" Chapter 4 will teach it that rule for real; here we are just moving the parts by hand so you can see what a neuron does.

So what does the bias actually do? The weights decide how much each input matters; the bias is the neuron's built-in lean — how eager it is to fire before any input arrives. Picture it as the neuron's mood. A big positive bias is an optimist: it leans YES and needs the inputs to talk it out of firing. A negative bias is a skeptic: it leans NO and needs the inputs to push it over the line. In fact, if every input were 0, the bias is the only thing left, so it alone decides which way the neuron leans. That's why setting the bias to -0.5 below turns the neuron into an OR gate: it starts off leaning NO, and it takes a high input to drag it past the 0.5 line.

One more thing the sliders below let you do: make a weight negative. A positive weight means "this input is evidence for YES." A negative weight means "this input argues against YES" — when that input switches on, it drags the total down, toward NO. The screen-time rule doesn't need any negative weights, but keep the idea in your pocket: it becomes the secret ingredient when we hit the XOR puzzle in chapter 7.

Why sigmoid and not some other shape? Hold that question — we'll meet sigmoid's modern replacement (ReLU) in a few chapters and show why the choice of "squisher" actually matters. If you want the full zoo right now, Wikipedia's activation function page lists every common one.

▸ try it

Play with a neuron

Drag the sliders. The output bar reacts live. Try setting both weights to +1 and the bias to -0.5 — now the neuron acts like an "OR" gate. Try setting them all to 0 — the neuron gives up and just outputs 0.5.

input 1 (piano done)1.00

input 2 (homework done)0.00

weight 11.00

weight 21.00

bias-0.50

output:

0.62

weighted sum: 0.50 decision: YES

Challenge

Make the neuron behave like AND instead of OR. Hint: both inputs should need to be high before the output crosses 0.5.

4. Training the neuron

Let's teach our one neuron to decide "can I have screen time?"

Rule: screen time is allowed if you practiced piano OR you finished your homework. Here are the examples we'll feed it:

piano done?	homework done?	screen time allowed?
0 (no)	0 (no)	0 (no)
1 (yes)	0 (no)	1 (yes)
0 (no)	1 (yes)	1 (yes)
1 (yes)	1 (yes)	1 (yes)

The training loop is the same idea every time:

Start with random weights (the neuron has no clue)
Run all 4 examples through the neuron, see what it predicts
For each one, figure out how wrong it was
Nudge every weight a tiny bit in whichever direction would have made the answer slightly more right
Repeat 10,000 times

Watch one nudge happen, with real numbers

"Nudge every weight in the right direction" sounds like magic until you watch a single nudge happen. So let's do one, slowly, by hand.

Say the neuron just started, with random numbers: weight 1 (piano) = −0.5, weight 2 (homework) = 0.2, bias = −0.1. We show it the example piano = 1, homework = 0, right answer = YES:

total      = (−0.5 × 1) + (0.2 × 0) + (−0.1)  =  −0.6
prediction = sigmoid(−0.6)                    ≈   0.35
truth      = 1
error      = 1 − 0.35                         =  +0.65   (too low!)

The prediction is too low, so we want the total to come out bigger next time. Go through the knobs one at a time:

Weight 1 (piano): piano was ON (1), so this weight took part in the total. Prediction too low → nudge it UP. (It was −0.5, dragging the answer down — exactly the wrong job for it. The nudge starts fixing that.)
Weight 2 (homework): homework was OFF (0), so this weight got multiplied by zero — it had no effect at all on this prediction. It gets no nudge. You can't blame a weight that wasn't involved.
Bias: the bias is always involved (it gets added every single time, like an input that's stuck at 1). Prediction too low → nudge it UP.

And if the prediction had been too high? Same logic, flipped: every weight whose input was on gets nudged DOWN. That's the whole secret. The recipe for one weight is just:

nudge = small step × error × that weight's input

The error sets the direction (too low → up, too high → down). The input decides who participates (an input of 0 wipes its weight's nudge out). And the small step keeps each change tiny, so no single example can yank the network around. The real scripts in this book multiply in one extra factor — how "squishy" sigmoid is at the current value, which chapter 12 explains — but the direction logic is exactly what you just did.

▸ try it

Do one nudge yourself

This is the same neuron, starting with the same numbers as above. Pick a training example, read the error, then press "nudge" and watch each weight move — or refuse to move — for exactly the reasons listed. Keep nudging (and switch examples between nudges, which is what real training does) and watch all four predictions crawl toward the truth. Keep an eye on the dark panel below the buttons too: that's the model itself — the actual data structure the computer stores — being rewritten live.

piano=0, homework=0 → NO

piano=1, homework=0 → YES

piano=0, homework=1 → YES

piano=1, homework=1 → YES

What the neuron says right now

The nudge this example asks for

The model itself, as the computer stores it

This is the entire model — three numbers in a data structure. Every nudge rewrites it in place; freshly changed numbers light up. When someone says "I downloaded a model," they mean a file holding a much, much longer list like this.

nudges so far: 0 w1: -0.50 w2: 0.20 bias: -0.10

The trainer below does exactly what you just did — it just applies the nudge for all four examples at once, then repeats thousands of times.

▸ try it

Train the neuron yourself

Press "step once" to do a single round of training. Or "train 1000" to fast-forward. Watch the predictions on each of the four examples crawl toward the right answer.

The model being rewritten

Training never touches the four examples — it only rewrites these three stored numbers, over and over, until they encode the rule.

epoch: 0 w1: ? w2: ? bias: ?

After a few thousand nudges, the neuron has learned the rule. Nobody told it "if piano OR homework is done then yes". It figured that out by being slightly less wrong over and over.

That's the trick. Everything from here on is "do this same thing, but much bigger." If you want a parallel walkthrough where every step is animated, the second video in 3Blue1Brown's neural networks series covers exactly this — gradient descent on a single neuron, with pictures.

▸ try it

Draw the line yourself

The neuron's whole job is to draw one straight line that puts YES dots on one side and NO dots on the other. Drag the sliders. The dashed line is the neuron's current decision boundary. The number tells you how many of the 4 dots are on the right side.

Try to get all 4 correct. Hint: weight 1 = 1.0, weight 2 = 1.0, bias = -0.5 is one good answer.

▸ run real python

Run the actual script — in your browser

This is neuro1.py from the folder, real Python with real numpy. First click downloads Python (takes a few seconds). Edit anything you like, then click run.

neuro1.py Python loads on first run

import numpy as np

# Step 1: single-layer perceptron, OR rule.
# Linearly separable -> one weighted sum + sigmoid is enough.
#
# Rule we're teaching: "Can I have screen time?"
#   Yes if I practiced piano OR I finished my homework.
#
# x: shape (4, 2). 4 samples, 2 binary features.
#   columns = [practiced_piano, finished_homework]
# y: shape (4, 1). 1 = screen time allowed, 0 = no screen time.
x = np.array([[0,0],[1,0],[0,1],[1,1]])
y = np.array([[0],[1],[1],[1]])

def sigmoid(x):
    return 1/(1 + np.exp(-x))

def sigmoid_derivative(x):
    # NOTE: this is sigmoid's derivative *evaluated at the post-sigmoid value*.
    # If `s = sigmoid(z)` then sigmoid'(z) = s * (1 - s). Below we always call
    # `sigmoid_derivative(outputs)` where outputs is already sigmoid'd, so the
    # math works out. If you ever call this on a raw pre-activation value you
    # will get the wrong answer.
    return x*(1-x)

np.random.seed(1)
weights = 2 * np.random.random((2, 1)) - 1

# NOTE: there's no explicit learning rate on the update below — we're
# effectively using lr = 1. The toy OR data is friendly enough that this
# works. Starting in neuro3.py we write the learning rate out explicitly
# (lr = 0.5) so you can see what it does.
for epoch in range(10000):
    input_layer = x
    weighted_sum = np.dot(input_layer, weights)
    outputs = sigmoid(weighted_sum)

    error = y - outputs
    adjustments = error * sigmoid_derivative(outputs)
    weights += np.dot(input_layer.T, adjustments)

print('Trained network — "Can I have screen time?"')
print('  rule learned: screen time allowed if piano OR homework is done')
print()
print(f'  {"piano":>7}  {"homework":>8}  {"prediction":>10}  decision  truth')
for (hi, ho), pred, truth in zip(x, outputs, y):
    decision = "YES" if pred[0] > 0.5 else "NO "
    want     = "YES" if truth[0] > 0.5 else "NO "
    print(f'  {hi:>7}  {ho:>8}  {pred[0]:>10.3f}  {decision:>8}  {want}')

(click "run" to execute)

Challenge

Edit the code so screen time is allowed only when both piano and homework are done. Then predict what the weights should do.

5. What training actually changes

Before we make the network harder, it is worth being very clear about what training is allowed to change.

Training does not change the examples. The input table stays the same. The right answers stay the same. If the row says [1, 0] → YES, training is not allowed to rewrite that row.

Training also does not grow new neurons in these tiny programs. If the code says there is one neuron, training has to work with one neuron. If the code says there are 4 hidden neurons, training has to work with those 4. The network does not secretly add a fifth neuron because it gets stuck.

The only things training changes are the weights and biases — the adjustable numbers inside the network. You can literally watch this: the dark "model" panels under chapter 4's trainers show the stored numbers being rewritten — and nothing else — while training runs.

Thing	Who changes it?
Input examples, like `[1, 0]`	The human writing the dataset
Correct answers, like `YES`	The human writing the dataset
Number of neurons	The human designing the network
Weights and biases	Training

So when a later exercise says to change hidden = 4 to hidden = 2, that is you changing the network design before training starts. After that, training tries to find weights and biases that make that chosen design work.

Short version: the human chooses the shape of the network. Training fills that shape with useful numbers.

Challenge

If a one-neuron network cannot solve a puzzle, what is training allowed to try changing? What is it not allowed to change?

6. Tables, dots, and lines

The next chapter is about XOR, which is where lots of people first get confused. The trick is to slow down and see how a truth table turns into dots on graph paper.

When there are two yes/no inputs, there are only four possible input pairs:

input 1	input 2	dot on graph paper
0	0	`(0, 0)`
1	0	`(1, 0)`
0	1	`(0, 1)`
1	1	`(1, 1)`

A rule like OR, AND, or XOR just labels each of those four dots as YES or NO.

A single neuron can only ask one question: "is this dot on this side of my line?" That works for OR and AND because one line can separate their YES dots from their NO dots.

But XOR needs a different kind of question: "is this dot in the top-left OR the bottom-right?" One straight line cannot ask an OR-shaped question about two separate corners. That is why XOR is the puzzle that forces us to add more neurons.

The phrase to keep: a rule is linearly separable when one straight line can split YES examples from NO examples. OR is linearly separable. AND is linearly separable. XOR is not.

Challenge

Cover the labels on the three mini-graphs. Which one must be XOR? Hint: look for the two YES dots sitting on opposite corners.

7. When one neuron isn't enough

Let's try a harder rule. Imagine a secret door with two buttons:

The door opens if EITHER the left button OR the right button is pressed. But if both buttons are pressed at the same time, the door stays shut.

This rule is called "exactly one." Here's the data:

left button?	right button?	door opens?
0	0	0 (no buttons)
1	0	1 (left only)
0	1	1 (right only)
1	1	0 (both buttons)

Try to teach a single neuron this. It will fail. Forever. Doesn't matter how many times you train it.

Why? A single neuron is like drawing one straight line on a piece of graph paper to separate the "yes" answers from the "no" answers. Look at our problem on paper:

Why only one line?

A single neuron always does the same kind of math: it takes the inputs, multiplies them by weights, adds a bias, and then asks, "is the answer big enough?"

For two inputs, that question looks like this:

weight1 × left + weight2 × right + bias > 0

The place where the answer changes from NO to YES is always a straight line. The neuron can move the line around, tilt it, or flip which side means YES, but it still only gets one straight line.

So linearly separable just means: "Can one straight line split the YES examples from the NO examples?"

OR can. AND can. XOR can't, because the YES answers are on opposite corners. No matter where you put one line, at least one corner ends up on the wrong side.

Can you draw ONE straight line that puts both YESes on one side and both NOs on the other? Try it on paper. You can't. A single neuron can only draw one line, so it has no shape that fits this problem. If you remember one piece of jargon from this book, make it linearly separable — it'll come up everywhere.

▸ try it

Watch a single neuron fail

Same trainer as before, new data. Click "train 1,000 steps" and watch — the predictions get stuck at 0.5 forever. The neuron literally gives up and guesses the average.

The model that can't win

Same three stored numbers as the screen-time model. Watch them wander and then sink toward zero — that's the neuron giving up and settling for "answer 0.5 to everything." No values of these three numbers can solve XOR.

epoch: 0 avg error: ?

This isn't a small detail. This is one of the most famous moments in the whole history of AI.

In 1969, two MIT researchers — Marvin Minsky and Seymour Papert — wrote a book called Perceptrons. The whole point of the book was to prove, in math, that a single-layer neural network (called a perceptron at the time) cannot solve XOR. The problem you just played with for 5 minutes.

This landed badly. People had been hyping perceptrons as the future of artificial intelligence. Perceptrons made the field look like a dead end. Funding dried up. Most researchers moved to other things. We now call this period the first AI winter — about 15 years where neural networks were considered a curiosity, not a serious tool.

Here's the twist: Minsky and Papert knew a multi-layer network could probably solve XOR. The actual problem was that nobody had a good way to train multi-layer networks yet. That training algorithm (called backpropagation — we'll explain it in chapter 12) wasn't worked out clearly until 1986, when Rumelhart, Hinton, and Williams published a paper showing how to do it. That paper is essentially why everything you read about today — ChatGPT, self-driving cars, image generators — exists. Almost two decades of progress had been stalled by one missing algorithm.

▸ try it

Try to draw a line — you can't

Same line-drawing widget as before, but with the secret-door labels. Drag the sliders. The YES dots are on opposite corners now. The best you can do is 3 out of 4 correct. Try as long as you want — no straight line will get all 4 right.

▸ run real python

Watch a real single-layer perceptron give up

This is neuro2.py. Same code as neuro1.py, just XOR data. The output will plateau at MSE ≈ 0.25 and every prediction ends up ≈ 0.5. (MSE is mean squared error: square each example's error, then average them. It's the script's single score for "how wrong am I overall?" Training is supposed to shrink it — and here it refuses to shrink.)

neuro2.py Python loads on first run

import numpy as np

# Step 2: same single-layer perceptron as neuro1.py, but XOR rule.
# This is INTENDED TO FAIL. The point is to see the failure mode
# and understand why a hidden layer is needed (see neuro3.py).
#
# Rule we're trying to teach: "Secret door" (XOR).
#   The door opens if the left button OR the right button is pressed,
#   but NOT if both are pressed at the same time.
#
# x: shape (4, 2). 4 samples, 2 binary features.
#   columns = [left_button, right_button]
# y: shape (4, 1). 1 = door opens, 0 = door stays shut.
#   Pattern: left_button XOR right_button.
# Diagonal corners share a class -> no single straight line separates
# them. A 1-layer net can only draw a single straight line, so it
# CANNOT solve this. Outputs collapse to ~0.5 for every input and
# error never shrinks.
x = np.array([[0,0],[1,0],[0,1],[1,1]])
y = np.array([[0],[1],[1],[0]])

def sigmoid(x):
    return 1/(1 + np.exp(-x))

def sigmoid_derivative(x):
    return x*(1-x)

np.random.seed(1)
weights = 2 * np.random.random((2, 1)) - 1

# Same effective learning rate (lr = 1) as neuro1.py. From neuro3.py
# onward we write the learning rate out explicitly.
for epoch in range(10000):
    input_layer = x
    weighted_sum = np.dot(input_layer, weights)
    outputs = sigmoid(weighted_sum)

    error = y - outputs
    adjustments = error * sigmoid_derivative(outputs)
    weights += np.dot(input_layer.T, adjustments)

    if epoch % 2000 == 0:
        mse = float(np.mean(error**2))
        print(f"epoch {epoch}: mse {mse:.4f}")

print()
print('After 10000 epochs — "Secret door (XOR)":')
print(f'  {"left":>7}  {"right":>7}  {"prediction":>10}  guess  truth  verdict')
for (left, right), pred, truth in zip(x, outputs, y):
    want    = "YES" if truth[0] > 0.5 else "NO "
    guess   = "YES" if pred[0]  > 0.5 else "NO "
    verdict = "OK   " if (pred[0] > 0.5) == (truth[0] > 0.5) else "WRONG"
    print(f'  {left:>7}  {right:>7}  {pred[0]:>10.3f}  {guess:>5}  {want:>5}  {verdict}')

print()
print("  Output is ~0.5 for every input — single-layer perceptron")
print("  cannot solve XOR. This is the failure that motivates hidden")
print("  layers. See neuro3.py for the fix.")

(click "run" to execute)

Challenge

Try to beat XOR with one line anyway. Change the sliders, then explain why your best answer still misclassifies at least one corner.

8. Hidden neurons fix it

The fix turned out to be: use MORE neurons, in layers.

Imagine you have a few neurons sitting in the middle of the network. Each one draws its own line on the graph paper. Now an output neuron looks at what those middle neurons said and combines their answers. Two lines, combined cleverly, can carve out the diagonal pattern.

Here is the trick on the same graph paper as before. One line couldn't split the XOR corners — but two lines together can fence off exactly the diagonal stripe where the YES dots live. Each hidden neuron draws one of the lines:

Hidden neuron 1 asks "is at least one button pressed?" — everything above the blue line. Hidden neuron 2 asks "is it NOT both buttons?" — everything below the green line. The output neuron then asks the easiest question of all: "did BOTH hidden neurons say yes?" The only dots that pass both tests are the two YES corners, sitting in the stripe between the lines. Three easy straight-line questions, chained together, make a shape that no single line could ever make.

We call the middle neurons hidden because we don't tell them what they should mean. They figure out their own job during training. After training, you might peek inside and find:

Hidden neuron 1 learned to mean "left button AND NOT right button"
Hidden neuron 2 learned to mean "right button AND NOT left button"
The output neuron learned: "fire if either hidden neuron fires"

Nobody told the network those rules. It discovered them by being a tiny bit less wrong over and over. (Notice that's a different two-line solution from the stripe picture above — there, the lines fenced off the middle; here, each hidden neuron fenced off one YES corner and the output ORs them together. There are several ways to carve out those corners, and training can land on any of them.)

▸ try it

Now watch it succeed

Same XOR data as last time. New network: 4 hidden neurons + 1 output neuron. Click "train" and watch the predictions actually converge.

The bigger model

Solving XOR costs 17 stored numbers instead of 3: 8 input→hidden weights, 4 hidden biases, 4 hidden→output weights, 1 output bias. More neurons just means a longer list — it's still nothing but numbers being rewritten.

epoch: 0

This is the central trick of deep learning. Layers of neurons build up more and more sophisticated ideas without anyone programming those ideas in directly.

The piece of math that makes this work: the chain rule

You might be wondering: how does the network know which way to nudge the hidden-layer weights? With one neuron it was kind of obvious — there's only one output, you can see which direction makes it more right. But the hidden neurons are buried in the middle. They don't talk to the output directly. How do you blame them for a mistake?

The trick is a piece of calculus called the chain rule. The kid-friendly version: imagine the error at the very end of the network, and then ask "how did each layer contribute to that error?" You start at the output, where you can see the error directly. You compute how the output layer's weights affected the error — same as in chapter 4. Then you push that blame backward through the network: each hidden neuron gets credit (or blame) proportional to how strongly it was connected to the output neurons that mattered. Then you can nudge the hidden weights too.

This whole "compute the error, then pass blame backward layer by layer" idea is called backpropagation. It's the algorithm that came out of that 1986 paper we mentioned in chapter 7 — the one that ended the first AI winter. The interactive above is running backprop on every "step + animate" click. You're literally watching the algorithm that unlocked modern AI.

If you want to see the chain rule drawn out beautifully, Chris Olah wrote "Calculus on Computational Graphs: Backpropagation" — the clearest gentle-but-real explainer of backprop on the open web.

▸ try it

Watch the wires light up

Same XOR network, but now you can SEE the neurons. Pick which example to feed in. The lines are the weights — thick blue = strong positive, thick red = strong negative, thin = weak. Click "step" to train for one round and watch the forward pass animate.

left=0, right=0 → want NO

left=1, right=0 → want YES

left=0, right=1 → want YES

left=1, right=1 → want NO

prediction: ? want: ? epoch: 0

Output math

How to read it: blue wires are positive weights and red wires are negative weights. Red does not mean wrong. A hidden neuron can still light up because of its bias and all incoming weights together. Then the right-side wires act like votes: active hidden neurons with blue output wires push the final answer toward YES; active hidden neurons with red output wires push it toward NO. The output is the combined vote from all hidden neurons. (The "YES/NO vote" picture works here because there is one output neuron deciding yes-or-no. When we add more output classes in chapter 9, the votes turn into a list of scores instead.)

▸ run real python

Run a real 2-layer network on XOR

This is neuro3.py. Same XOR data as before, but now with a hidden layer of 4 neurons. The predictions will converge to ~0.99 for YES and ~0.01 for NO.

Heads up: the first time you click "run" on any Python block in this book, your browser downloads Python itself (about 10 megabytes). That takes 10–20 seconds and requires an internet connection. After the first run on this page, every later run is instant.

neuro3.py Python loads on first run

import numpy as np

# Step 3: 2-layer net (input -> hidden -> output). Same XOR rule as
# neuro2.py, but the hidden layer + non-linearity solve what the
# single-layer perceptron couldn't.
#
# Rule: "Secret door" (XOR).
#   The door opens if the left button OR the right button is pressed,
#   but NOT if both are pressed at the same time.
#
# x: shape (4, 2). columns = [left_button, right_button]
# y: shape (4, 1). 1 = door opens, 0 = shut. Pattern is XOR.
x = np.array([[0,0],[1,0],[0,1],[1,1]])
y = np.array([[0],[1],[1],[0]])

def sigmoid(x): return 1/(1+np.exp(-x))
def dsigmoid(x): return x*(1-x)

np.random.seed(1)
hidden = 4   # try 4, then 2, then 1
w1 = 2*np.random.random((2, hidden)) - 1   # 2 inputs -> hidden
b1 = np.zeros((1, hidden))
w2 = 2*np.random.random((hidden, 1)) - 1   # hidden -> 1 out
b2 = np.zeros((1,1))
lr = 0.5

for epoch in range(20000):
    # forward
    h = sigmoid(x @ w1 + b1)
    o = sigmoid(h @ w2 + b2)

    # backward
    err   = y - o
    d_o   = err * dsigmoid(o)
    d_h   = (d_o @ w2.T) * dsigmoid(h)

    w2 += lr * h.T @ d_o
    b2 += lr * d_o.sum(axis=0, keepdims=True)
    w1 += lr * x.T @ d_h
    b1 += lr * d_h.sum(axis=0, keepdims=True)

print('Trained network — "Secret door (XOR)":')
print(f'  {"left":>7}  {"right":>7}  {"prediction":>10}  guess  truth  verdict')
for (left, right), pred, truth in zip(x, o, y):
    want    = "YES" if truth[0] > 0.5 else "NO "
    guess   = "YES" if pred[0]  > 0.5 else "NO "
    verdict = "OK   " if (pred[0] > 0.5) == (truth[0] > 0.5) else "WRONG"
    print(f'  {left:>7}  {right:>7}  {pred[0]:>10.3f}  {guess:>5}  {want:>5}  {verdict}')

(click "run" to execute)

Challenge

In the Python code, change the hidden layer from 4 neurons to 2. Does it still learn XOR? What happens if you try 1?

9. Picking from a list

So far the neuron has just answered yes or no. What if the question is "which snack should I grab?" with several options?

hungry?	after school?	weekend?	snack
0	0	0	fruit
1	0	0	fruit
0	1	0	chips
1	1	0	chips
0	0	1	cookie
1	0	1	cookie
0	1	1	cookie
1	1	1	cookie

Three options now: fruit, chips, cookie. The fix is small: have one output neuron per option. Each outputs a score for its option. A function called softmax turns the scores into probabilities that always add up to 1.

The math is small enough you could do it on paper. For three scores z₁, z₂, z₃, softmax is:

softmax_i = e^(z_i) / (e^(z_1) + e^(z_2) + e^(z_3))

(Haven't met e yet? It's just a famous fixed number, about 2.718 — math's favorite constant after π. And e^z means "e raised to the power z": e^2 = e × e ≈ 7.4, e^0 = 1, and e to a negative power is a small positive fraction. Only two things matter here: e^z is always positive, and it grows fast as z grows.)

The e^ part makes everything positive and exaggerates the differences (a bigger score becomes way more likely than a slightly smaller one). The dividing makes them add up to 1, so they're proper probabilities. Quick example:

softmax([2.0, 1.0, 0.1])  →  [0.66, 0.24, 0.10]

The model might output [0.10, 0.85, 0.05] meaning "10% chance fruit, 85% chance chips, 5% chance cookie". You pick the highest one (chips).

The labels we feed in for training look like [1,0,0] for fruit, [0,1,0] for chips, [0,0,1] for cookie. Exactly one slot is 1 (the right answer) and the rest are 0. That's called one-hot encoding — we'll meet it again in chapter 11's meal guesser and use the name more often.

▸ try it

The trained snack picker

This network was actually trained for you when this page loaded. Toggle the inputs and watch the probabilities update.

hungry

after school

weekend

🍎 fruit

0.80

🥔 chips

0.15

🍪 cookie

recommended: fruit

This same pattern — output one neuron per class, softmax to get probabilities — is how a network can recognize handwritten digits (10 classes: 0 through 9), pick a song to recommend, or pick the next word in a sentence (about 100,000 possible classes — one per possible word/word-piece). Same idea, just more output neurons. If you want the deeper "why" of softmax (where it comes from in statistics), Wikipedia's softmax function page is a solid first stop.

▸ run real python

Run the real multi-class snack picker

This is neuro4.py. 3 inputs, 6 hidden neurons, 3 output classes with softmax. Trains in a couple of seconds and prints a table with the right snack for each combination.

neuro4.py Python loads on first run

import numpy as np

# Step 4: multi-class classifier. 3 binary inputs -> 3 mutually
# exclusive output classes (fruit / chips / cookie).
# New concepts vs step 3: softmax output, cross-entropy loss,
# one-hot labels.
#
# Rule we're teaching: "Which snack should I grab?"
#   weekend                 -> cookie  (weekend treat)
#   after_school (not wknd) -> chips   (after-school energy)
#   otherwise               -> fruit   (boring-day default)
#   (the "hungry" input is intentionally noise — the network must
#    figure out by itself that it doesn't change the decision.)
#
# x: shape (8, 3). 8 samples, 3 binary features per sample.
#   columns = [hungry, after_school, weekend]
#   rows    = all 8 combinations.
#
# y: shape (8, 3). One-hot labels.
#   columns = [fruit, chips, cookie]   (exactly one column is 1 per row)
#
# Row-by-row decoding of (x, y):
#   [0,0,0] plain weekday         -> [1,0,0] fruit
#   [1,0,0] hungry weekday        -> [1,0,0] fruit
#   [0,1,0] after school          -> [0,1,0] chips
#   [1,1,0] hungry, after school  -> [0,1,0] chips
#   [0,0,1] weekend               -> [0,0,1] cookie
#   [1,0,1] hungry weekend        -> [0,0,1] cookie
#   [0,1,1] weekend + after-school-> [0,0,1] cookie  (weekend wins)
#   [1,1,1] hungry, both          -> [0,0,1] cookie
x = np.array([
    [0,0,0],[1,0,0],
    [0,1,0],[1,1,0],
    [0,0,1],[1,0,1],
    [0,1,1],[1,1,1],
])
y = np.array([
    [1,0,0],[1,0,0],
    [0,1,0],[0,1,0],
    [0,0,1],[0,0,1],
    [0,0,1],[0,0,1],
])

CLASS_NAMES = ["fruit", "chips", "cookie"]

def sigmoid(x): return 1/(1+np.exp(-x))
def dsigmoid(x): return x*(1-x)

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

np.random.seed(1)
w1 = 2*np.random.random((3,6)) - 1   # 3 inputs -> 6 hidden
b1 = np.zeros((1,6))
w2 = 2*np.random.random((6,3)) - 1   # 6 hidden -> 3 classes
b2 = np.zeros((1,3))
lr = 0.5

for epoch in range(20000):
    # forward
    h = sigmoid(x @ w1 + b1)
    o = softmax(h @ w2 + b2)

    # backward
    d_o = (o - y) / x.shape[0]            # softmax + cross-entropy combined gradient
    d_h = (d_o @ w2.T) * dsigmoid(h)

    # update
    w2 -= lr * h.T @ d_o
    b2 -= lr * d_o.sum(axis=0, keepdims=True)
    w1 -= lr * x.T @ d_h
    b1 -= lr * d_h.sum(axis=0, keepdims=True)

    if epoch % 2000 == 0:
        loss = -np.sum(y * np.log(o + 1e-9)) / x.shape[0]
        print(f"epoch {epoch}: loss {loss:.4f}")

print()
print('Trained — "Which snack should I grab?"  (fruit / chips / cookie)')
header = (
    f'  {"hungry":>6} {"after_school":>13} {"weekend":>8}  '
    f'{"fruit":>6} {"chips":>6} {"cookie":>7}   '
    f'{"predicted":>9}   {"truth":>6}'
)
print(header)
for (hu, sc, we), probs, truth in zip(x, o, y):
    pred_idx  = int(probs.argmax())
    truth_idx = int(truth.argmax())
    pred_name  = CLASS_NAMES[pred_idx]
    truth_name = CLASS_NAMES[truth_idx]
    print(
        f'  {hu:>6} {sc:>13} {we:>8}  '
        f'{probs[0]:>6.2f} {probs[1]:>6.2f} {probs[2]:>7.2f}   '
        f'{pred_name:>9}   {truth_name:>6}'
    )

(click "run" to execute)

Challenge

Add a fourth snack class in your notebook: smoothie. What new output neuron and labels would the network need?

10. What does that "squish" actually do?

You've been hearing about sigmoid — the function that squishes any number into the range 0-1. Time to actually look at what it does, and meet its modern replacement, ReLU.

Here's what these functions look like when you plot them on graph paper. The horizontal axis is what goes INTO the neuron (the weighted sum). The vertical axis is what comes OUT.

▸ try it

Drag the dot. Watch the curve.

Slide the input left/right. The red dot is where the neuron sits right now. The orange dashed line is the slope at that exact point — the derivative.

sigmoid

ReLU

input 0.00

output (y): 0.50 slope (derivative): 0.25

What does the derivative tell us?

The derivative is the slope of the curve at that point. Why this matters: during training, the network uses the slope to figure out which way to nudge the weights.

Steep slope = "big push, learn fast here"
Flat slope (close to 0) = "stuck, weights barely move"

Look at sigmoid's slope by dragging the slider:

At x = 0, slope is 0.25 — the biggest it ever gets. The neuron is most "unsure" here and learns fastest.
At x = +5 or x = −5, slope is almost zero. The neuron is super confident — learning has basically stopped.

Now switch to ReLU and compare

Toggle "ReLU" above. The curve becomes a straight line: zero below x=0, then a 45° ramp upward. Its slope is either exactly 1 (positive input) or exactly 0 (negative input). No squishing. No flat tails.

Why we replaced sigmoid in deep networks

The slope of sigmoid is at best 0.25, usually way less. When you stack 80 layers of sigmoid, backpropagation multiplies all the slopes together as the error travels backward. Picture it:

0.25 × 0.25 × 0.25 × ... (80 times) ≈ 0.0000000... (about 10⁻⁴⁹)

That's a 1 with 49 zeros in front of it. Smaller than the number of atoms in a teaspoon of water. The signal vanishes long before it reaches the early layers. Those layers stop learning. This is the famous vanishing gradient problem — a major reason deep networks didn't work for years even after backpropagation was rediscovered.

ReLU fixed it. Its slope of 1 keeps the signal alive through many layers. You'll find ReLU (and modern relatives like GELU, used in BERT and most LLMs; SiLU / Swish, used in some Google models; and LeakyReLU, which lets a tiny bit of signal through on the negative side to avoid "dead" neurons) inside every recent neural network: image recognition, voice assistants, ChatGPT, all of them. Wikipedia's activation function page has them all in one place.

▸ run real python

Run the ReLU version

This is neuro5.py. It keeps the snack picker but swaps the hidden activation from sigmoid to ReLU. The learning rate is smaller because ReLU can move faster.

neuro5.py Python loads on first run

import numpy as np

# Step 5: same snack classifier as neuro4.py, but ReLU hidden
# activation instead of sigmoid. ONE change to forward + ONE change
# to backward. Output layer stays softmax + cross-entropy.
#
# Why this matters:
#   Sigmoid saturates (slope -> 0) when input is very + or very -.
#   In deep nets, sigmoid gradients shrink layer by layer until they
#   vanish. ReLU = max(0, x). Slope is exactly 1 wherever x > 0,
#   so gradients pass through cleanly. Every modern deep net uses
#   ReLU (or a variant: GELU, SiLU, LeakyReLU) in hidden layers.
#
# Side note: ReLU is sensitive to learning rate. Sigmoid was forgiving
# at lr=0.5. With ReLU we drop to lr=0.1 to avoid blowing up the
# weighted sums on the first few steps.
#
# Same data as neuro4.py: "Which snack should I grab?"
#   inputs  = [hungry, after_school, weekend]
#   classes = [fruit, chips, cookie]
x = np.array([
    [0,0,0],[1,0,0],
    [0,1,0],[1,1,0],
    [0,0,1],[1,0,1],
    [0,1,1],[1,1,1],
])
y = np.array([
    [1,0,0],[1,0,0],
    [0,1,0],[0,1,0],
    [0,0,1],[0,0,1],
    [0,0,1],[0,0,1],
])

CLASS_NAMES = ["fruit", "chips", "cookie"]

def relu(x):  return np.maximum(0, x)
def drelu(x): return (x > 0).astype(float)

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

np.random.seed(1)
# He init would be sqrt(2/fan_in) scaling. Skipping it here, the small
# uniform init still converges on this toy problem.
w1 = 2*np.random.random((3,6)) - 1   # 3 inputs -> 6 hidden
b1 = np.zeros((1,6))
w2 = 2*np.random.random((6,3)) - 1   # 6 hidden -> 3 classes
b2 = np.zeros((1,3))
lr = 0.1

for epoch in range(20000):
    # forward
    h = relu(x @ w1 + b1)          # <- was sigmoid in neuro4.py
    o = softmax(h @ w2 + b2)

    # backward
    d_o = (o - y) / x.shape[0]     # softmax + cross-entropy combined gradient
    d_h = (d_o @ w2.T) * drelu(h)  # <- was * dsigmoid(h) in neuro4.py

    # update
    w2 -= lr * h.T @ d_o
    b2 -= lr * d_o.sum(axis=0, keepdims=True)
    w1 -= lr * x.T @ d_h
    b1 -= lr * d_h.sum(axis=0, keepdims=True)

    if epoch % 2000 == 0:
        loss = -np.sum(y * np.log(o + 1e-9)) / x.shape[0]
        print(f"epoch {epoch}: loss {loss:.4f}")

print()
print('Trained — "Which snack should I grab?"  (fruit / chips / cookie)')
header = (
    f'  {"hungry":>6} {"after_school":>13} {"weekend":>8}  '
    f'{"fruit":>6} {"chips":>6} {"cookie":>7}   '
    f'{"predicted":>9}   {"truth":>6}'
)
print(header)
for (hu, sc, we), probs, truth in zip(x, o, y):
    pred_idx  = int(probs.argmax())
    truth_idx = int(truth.argmax())
    pred_name  = CLASS_NAMES[pred_idx]
    truth_name = CLASS_NAMES[truth_idx]
    print(
        f'  {hu:>6} {sc:>13} {we:>8}  '
        f'{probs[0]:>6.2f} {probs[1]:>6.2f} {probs[2]:>7.2f}   '
        f'{pred_name:>9}   {truth_name:>6}'
    )

(click "run" to execute)

Challenge

In the ReLU script, try changing lr = 0.1 to 0.5. (lr is the learning rate — the "small step" from chapter 4's nudge recipe. Chapter 12 digs into it properly.) If training gets worse, you have found why the size of that step matters.

11. Meal guesser

The snack picker had three choices. Real classifiers often have more classes and messier inputs. So here is the snack picker's big sibling: a tiny network that guesses what meal you're describing from menu-style clues.

You met one-hot encoding back in chapter 9 (the snack labels). This time we'll do it on the inputs too. A neural network wants numbers, not words. So a word choice like course = dessert becomes a few yes/no inputs:

course_main = 0
course_snack = 0
course_dessert = 1

Same for temperature, taste, diet, and base. The network does not receive the word "pastry"; it receives base_pastry = 1 and the other base inputs set to 0. None of these clues is enough by itself; the network has to combine them.

meal	course	temperature	taste	diet	base
apple pie	dessert	hot	sweet	vegetarian	pastry
sausage roll	snack	hot	savory	meat	pastry
cottage pie	main	hot	savory	meat	potato
baked potato	main	hot	savory	vegetarian	potato
spaghetti	main	hot	savory	meat	pasta
mac & cheese	main	hot	savory	vegetarian	pasta
ice cream	dessert	cold	sweet	vegetarian	dairy
fruit salad	dessert	cold	sweet	vegetarian	fruit

Look closely at the table before training: it's full of near-misses. Cottage pie and baked potato match on everything except diet. Spaghetti and mac & cheese — same thing (the spaghetti is bolognese, that's where the meat is). Ice cream and fruit salad differ only on base. And apple pie and sausage roll share a pastry crust while disagreeing about everything else. No single clue gives the answer away; the network must combine them.

▸ try it

Train a meal guesser

Click train, then change the traits. The network will show probability bars for each meal. Try normal combinations, then weird ones like a cold sweet pastry that contains meat.

Best trick: set course = main, temperature = hot, taste = savory, base = pasta — and leave diet unknown. Watch spaghetti and mac & cheese split the vote almost 50/50. Then pick a diet and watch the tie break instantly. That's a network weighing evidence.

Traits

course

temperature

taste

diet

base

Network shape

hidden

epoch: 0 train accuracy: ?

Prediction

How the clues become numbers

What training saved — the model itself, live

The trained model is just learned numbers — and this is all of them: 200 numbers for the current network shape. Press the train buttons and watch them get rewritten; freshly changed numbers light up. The grey // notes aren't part of the model — they're labels so you can find your way around the grid.

Why this is useful: the network can learn clue-combinations. One hidden neuron might become useful for cold + sweet, another for pastry, another for main + pasta. We did not name those hidden neurons ourselves. Training nudged the weights until useful patterns appeared. (When researchers try to read what hidden neurons learned in real models, the field they're working in is called interpretability. It's hard, and it's some of the most interesting research in AI right now.)

How many hidden neurons? That number is a choice the builder makes before training. Training changes the weights, biases, and probabilities, but it does not grow new neurons in this tiny model. Try 2, 4, 8, and 12 hidden neurons: too few can run out of pattern space, while more gives the model more room to memorize the 8 training meals exactly instead of learning the rule. That's called overfitting, and it's a problem real machine learning has to fight constantly. We'll come back to it when we move to real datasets.

One more thing while you're here: hit the "show learned numbers" button further down. The trained model is just a bag of numbers — that button literally prints them out, the grown-up sibling of the dark "model" panels you've been watching since chapter 4. This is what's saved when someone says "the model weighs 10 GB" or "I downloaded the weights." A model is its weights and nothing else.

▸ run real python

Run the meal guesser

This is neuro6.py. It trains the same idea in numpy and prints predictions for the training meals plus one mixed-up custom meal.

neuro6.py Python loads on first run

import numpy as np

# Step 6: meal guesser. This is the snack picker idea with a more
# interesting dataset: guess a meal from menu-style clues.
#
# New concept: categorical features become one-hot inputs.
#   course = main / snack / dessert
# becomes three input columns:
#   course_main, course_snack, course_dessert
#
# The traits below are intentionally shared by several meals, so the
# network has to combine clues instead of relying on one giveaway.
# Three pairs differ by exactly ONE trait (near-miss pairs):
#   cottage pie vs baked potato  -> only diet differs
#   spaghetti   vs mac & cheese  -> only diet differs
#   ice cream   vs fruit salad   -> only base differs
# ("spaghetti" here is spaghetti bolognese — that's where the meat is.)
#
# Inputs:
#   course_main, course_snack, course_dessert,
#   temp_hot, temp_cold,
#   taste_sweet, taste_savory,
#   diet_veg, diet_meat,
#   base_pastry, base_potato, base_pasta, base_dairy, base_fruit
#
# Outputs:
#   apple pie, sausage roll, cottage pie, baked potato,
#   spaghetti, mac & cheese, ice cream, fruit salad

MEALS = [
    "apple pie", "sausage roll", "cottage pie", "baked potato",
    "spaghetti", "mac & cheese", "ice cream", "fruit salad",
]
FEATURES = [
    "course_main", "course_snack", "course_dessert",
    "temp_hot", "temp_cold",
    "taste_sweet", "taste_savory",
    "diet_veg", "diet_meat",
    "base_pastry", "base_potato", "base_pasta", "base_dairy", "base_fruit",
]

x = np.array([
    # main snk des  hot cold  swt sav  veg meat  pastry pot pasta dairy fruit
    [0,   0,  1,   1,  0,    1,  0,   1,  0,    1,     0,  0,    0,    0],  # apple pie
    [0,   1,  0,   1,  0,    0,  1,   0,  1,    1,     0,  0,    0,    0],  # sausage roll
    [1,   0,  0,   1,  0,    0,  1,   0,  1,    0,     1,  0,    0,    0],  # cottage pie
    [1,   0,  0,   1,  0,    0,  1,   1,  0,    0,     1,  0,    0,    0],  # baked potato
    [1,   0,  0,   1,  0,    0,  1,   0,  1,    0,     0,  1,    0,    0],  # spaghetti
    [1,   0,  0,   1,  0,    0,  1,   1,  0,    0,     0,  1,    0,    0],  # mac & cheese
    [0,   0,  1,   0,  1,    1,  0,   1,  0,    0,     0,  0,    1,    0],  # ice cream
    [0,   0,  1,   0,  1,    1,  0,   1,  0,    0,     0,  0,    0,    1],  # fruit salad
], dtype=float)

y = np.eye(len(MEALS))

# Silence cosmetic Accelerate/NumPy warnings seen on some macOS builds
# during tiny matrix multiplies. The values remain finite and training
# converges; the warning is not useful for this lesson.
np.seterr(divide="ignore", over="ignore", invalid="ignore")

def relu(z):
    return np.maximum(0, z)

def drelu(z):
    return (z > 0).astype(float)

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

np.random.seed(2)
w1 = np.random.randn(len(FEATURES), 10) * np.sqrt(2 / len(FEATURES))
b1 = np.zeros((1, 10))
w2 = np.random.randn(10, len(MEALS)) * np.sqrt(2 / 10)
b2 = np.zeros((1, len(MEALS)))
lr = 0.04

for epoch in range(6000):
    hidden_raw = x @ w1 + b1
    hidden = relu(hidden_raw)
    probs = softmax(hidden @ w2 + b2)

    d_out = (probs - y) / x.shape[0]
    d_hidden = (d_out @ w2.T) * drelu(hidden_raw)

    w2 -= lr * hidden.T @ d_out
    b2 -= lr * d_out.sum(axis=0, keepdims=True)
    w1 -= lr * x.T @ d_hidden
    b1 -= lr * d_hidden.sum(axis=0, keepdims=True)

    if epoch % 1000 == 0:
        loss = -np.sum(y * np.log(probs + 1e-9)) / x.shape[0]
        acc = (probs.argmax(axis=1) == y.argmax(axis=1)).mean()
        print(f"epoch {epoch}: loss {loss:.4f}  accuracy {acc:.3f}")

print()
print("Trained meal guesser")
print(f'  {"meal":>12}  {"prediction":>12}  confidence')
for meal, probs in zip(MEALS, probs):
    best = int(probs.argmax())
    print(f'  {meal:>12}  {MEALS[best]:>12}  {probs[best]:.3f}')

print()
print("Try a mixed-up meal: snack + cold + sweet + meat + pastry")
weird = np.array([[0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0]], dtype=float)
h = relu(weird @ w1 + b1)
p = softmax(h @ w2 + b2)[0]
for idx in p.argsort()[::-1][:4]:
    print(f"  {MEALS[idx]:>12}: {p[idx]:.3f}")

(click "run" to execute)

Challenge

Add lasagna in your notebook. Write out its traits — they come out identical to spaghetti's row. What happens to a network when two different answers have the exact same clues? What new trait group would fix it?

12. How learning actually works

You've been hearing "nudge the weights in the right direction" for eleven chapters — and back in chapter 4 you even performed a nudge by hand. Time to pin down the full rule behind it. This is the math that powers every other chapter in this book.

The loss landscape

Imagine every weight in the network is a knob you can turn. A tiny network might have 20 knobs; ChatGPT has hundreds of billions. For each combination of knob positions, you can measure how wrong the network is on your training data. That single number is called the loss (sometimes "cost" or "error" — same idea).

If you could plot loss against every possible knob setting, you'd get a landscape — bumpy, hilly terrain stretching out in millions of dimensions. High places are bad (the network gets stuff wrong); low places are good. Training a network is the search for a low spot in that landscape.

Humans can't picture a 175-billion-dimensional landscape. Nobody can. But the rule for getting downhill is the same whether you're in 2 dimensions or 2 billion: at your current spot, figure out which direction goes downward fastest, take a small step that way, repeat.

The gradient is just the slope

The word gradient sounds fancy. It just means "slope, in every direction at once." If you're standing on a hillside, the gradient is the arrow that points straight downhill from where you're standing, along with how steep that direction is.

For each weight in the network, the gradient says: "if you increase this weight a tiny bit, the loss goes up (or down) by this much." Calculus (specifically, the chain rule we met in chapter 8) is the tool that computes the gradient for every weight at once. The algorithm that does this efficiently for a deep network is — you guessed it — backpropagation.

Once you have the gradient, the update rule is comically simple:

new_weight = old_weight  −  (learning_rate × gradient)

That's it. Every line of "training" in every script in this book is doing some version of that. The minus sign is what makes it descent — you're moving the weight against the gradient, toward smaller loss. And compare it to chapter 4's hand-made recipe, nudge = small step × error × input: that was this exact formula in disguise. The "small step" is the learning rate, and "error × input" is the gradient for a one-neuron network.

▸ try it

Roll the ball downhill

The curve is the loss. Horizontal position is the value of one weight. The red dot is where the network currently sits. Click "step" to take one gradient-descent step. Watch the dot find the bottom — and watch the steps get smaller as the slope flattens out, because the gradient itself shrinks near the minimum.

Why we take small steps (learning rate)

That learning_rate in the update is the size of each step. It's one of the most important knobs in deep learning — and it's a knob you pick, not the network.

Too small (say 0.0001): the network learns, but excruciatingly slowly. It might take a million steps to get where a good learning rate would get in a thousand.
Too big (say 10): the network overshoots the minimum. Imagine rolling a ball down a hill so hard it flies up the other side, then back across, then up again. The loss bounces around and never settles. Sometimes it even explodes to infinity.
Just right (often 0.001 to 0.1, depending on the model): steady downhill progress.

In neuro1.py the learning rate was implicitly 1 (we didn't write it — the toy data was friendly). In neuro3.py we wrote lr = 0.5. In neuro5.py we used lr = 0.1 because ReLU is more sensitive. In real training people use schedules that shrink the learning rate over time — start big to make fast progress, end small to settle into a good minimum.

Local minima — why we don't usually care

The landscape isn't a simple bowl. It has lots of dips, valleys, plateaus, ridges. A "local minimum" is a dip that's lower than everything around it but not the lowest place in the whole landscape. For decades, people worried that gradient descent would get stuck in bad local minima.

Here's the surprising empirical finding: in really high-dimensional landscapes (the kind real networks live in), most local minima turn out to be roughly as good as the global minimum. Bad minima are rare. There's also a lot of geometry where the network can wiggle around until it finds a path out. Modern networks just train, find a low spot, and the low spot is good enough. Nobody has a clean theory for why this works as well as it does — but it does.

Mini-batches: noisy descent

So far we've described "compute the loss on all your training data, then take a step." That's called full-batch gradient descent. With 60,000 training examples it's wildly expensive — every step requires running the entire network on the entire dataset.

The fix everyone uses: shuffle the data, take a tiny chunk (say 32 or 128 examples — a mini-batch), compute the gradient on just that chunk, take a step, grab the next chunk, repeat. This is called stochastic gradient descent (SGD), or really mini-batch SGD.

Picture-wise: instead of a smooth ball rolling cleanly downhill, imagine a slightly drunk ball wobbling downhill. Each step's direction is approximately correct but jittery, because it was computed from a small sample. The surprising thing is the noise actually helps — it shakes the ball out of small bad pockets and tends to find broader, flatter minima, which generalize better to new data. So mini-batching is faster AND finds better solutions. Win.

That's gradient descent. Every learning algorithm you'll meet — Adam, RMSprop, momentum, AdamW — is the same loop with smarter rules about how much to step and which direction to actually go (often a smoothed blend of recent gradients). The core idea — "compute the slope, step against it" — has not changed since 1847.

Challenge

Click "step downhill" slowly and watch the steps shrink near the bottom. Why do they shrink? (Hint: step = lr × gradient, and the gradient is the slope — what's the slope of a flat line?)

13. Scaling this up

Everything so far has been with maybe a few dozen neurons total. Real networks are gigantic:

Network	Year	Roughly how many weights
The tiny examples above	—	~50
Meal guesser (this book, ch 11)	—	~200
MNIST classifier (this book, the MNIST script)	—	~110,000
LeNet-5 (digit recognizer)	1998	~60,000
ResNet-50 (image classifier)	2015	~25 million
GPT-3 (the base model under early ChatGPT)	2020	175 billion
Frontier LLMs today (Claude, GPT-4, Gemini, Llama-405B)	2023–2026	Hundreds of billions to trillions (exact numbers usually not public)

What stays the same when you scale up

The math. Genuinely. The forward pass of a 175-billion-parameter model is still: multiply inputs by weights, add bias, apply a non-linearity, repeat. Backprop still computes gradients the same way. SGD still steps against the gradient. If you understand neuro4.py, you understand the inner loop of GPT.

You'll see this for yourself in neuro9.py a few chapters from here — the MNIST classifier is the same shape as the iris classifier in neuro8.py, just with 784 inputs instead of 4 and two hidden layers instead of one.

What changes when you scale up

More layers. Toy nets have 2 layers (input → hidden → output). Image models like ResNet-50 have 50. Modern LLMs have dozens of layers stacked on top of each other — GPT-2 has between 12 and 48 depending on size, GPT-3 has 96, Llama-2-70B has 80, Llama-3-405B has 126. (GPT-4 and Claude depths aren't public.) Why does depth help? Each layer can build a more sophisticated representation on top of the last. Early layers might learn edges or syllables; middle layers learn shapes or words; late layers learn objects or whole concepts. This is called compositionality and it's the whole reason "deep" learning is called "deep." (The "why depth" chapter near the end of this book lets you watch it happen, on a puzzle that defeats any one-hidden-layer network of similar size.)

More parameters. Going from 60,000 weights to 175 billion is a factor of 3 million. That's not a tweak. Each weight is a number stored in memory; each is updated every training step.

More data. Iris has 150 rows. MNIST has 60,000 images. ImageNet has 1.2 million images. GPT-3 was trained on roughly 300 billion tokens — about 500 million pages of text. Modern frontier models train on trillions of tokens. The rough empirical rule is: bigger models need proportionally more data to learn well. The 2020 Kaplan paper and the 2022 "Chinchilla" paper worked out what those proportions are; people call those rules scaling laws.

More compute. Training GPT-3 took thousands of GPUs running for months and reportedly cost millions of dollars. A GPU (Graphics Processing Unit) is a chip originally built for video games, but it's spectacularly good at doing thousands of multiply-and-add operations in parallel — which is exactly what neural network forward and backward passes are. Specialized AI chips like Google's TPUs and NVIDIA's H100 / B200 take that idea further. The math is unchanged; the hardware that does the math billions of times per second is what changed.

What gets harder when you scale up

Vanishing/exploding gradients. Remember the 10⁻⁴⁹ from the activations chapter? Stacking 100 layers of anything will misbehave unless you're careful. Modern networks use ReLU-family activations, careful weight initialization, and a trick called residual connections (shortcuts that let signal skip past layers) to keep gradients alive.

Overfitting. With 175 billion parameters, your network has plenty of capacity to memorize its training data instead of learning rules. People fight this with bigger datasets, dropout, weight decay, and early stopping.

Engineering. Training a model that doesn't fit on one GPU means splitting it across thousands. Failure becomes a daily fact of life. Saving checkpoints, restarting, monitoring loss curves, debugging numerical issues — most of a frontier ML team's work is engineering, not algorithm design.

Why bigger keeps working

Here's the unreasonably-effective fact at the heart of modern AI: when you make a model bigger and train it on more data, it just keeps getting better. Smoothly. Predictably. People expected diminishing returns around 1 billion parameters, then 10 billion, then 100 billion. Each time, scale kept paying off. Rich Sutton's "The Bitter Lesson" is the most-cited essay on why: clever human-engineered tricks tend to get beaten by simple, general algorithms that just consume more compute. It's bitter because researchers like cleverness. It's a lesson because it keeps happening.

Challenge

Estimate the weights in a tiny network with 3 inputs, 5 hidden neurons, and 2 outputs. Count weights first, then remember each neuron also has a bias. Then look up how many seconds you'd need to type out all 175 billion of GPT-3's weights if you typed one per second. (Hint: that's about 5,500 years.)

14. How AI chatbots actually write

Here's the leap from "neuron picks a snack" to "writes a whole story." The underlying model class is called a transformer, introduced in the 2017 paper "Attention Is All You Need". Every modern LLM is some descendant of that paper.

The tokens trick

ChatGPT doesn't see letters or words. Before anything happens, your message gets chopped into chunks called tokens. A token is usually a piece of a word (sometimes a whole short word, sometimes a single character). The chopping algorithm is called byte-pair encoding (BPE) — it learns which letter combinations are common in text and bundles them into single tokens. Common words like "the" become one token; rare words get split into pieces.

Each token has an ID number from a fixed list. The size of that list varies by model: GPT-2 had about 50,000 tokens; GPT-4's tokenizer has about 100,000; Llama-3 has 128,000. So when this chapter says "about 100,000 tokens," picture a frontier-model-sized vocabulary.

So… is there a lookup table? (Yes — two.)

The tokenizer is, at heart, a giant two-way lookup table plus a chopping rule. Going in: your text gets chopped into chunks the table knows, and each chunk is swapped for its id number. Going out: the model hands back an id number, and the very same table is read in reverse to turn it back into text. This table is built once, before training ever starts, and never changes.

The second lookup is the embedding table (next heading): the id picks a row, and that row of numbers is what the network actually computes with. So the full input path is text → chunks → ids → rows of numbers, and the output path is the mirror image: a score for every id → pick one → look its text up backwards → glue it onto the story. The model never sees letters at any point. It lives entirely in id-land.

▸ try it

Tokenize anything — watch text become numbers (and come back)

Type anything. This toy tokenizer has a vocabulary of only 70 chunks (a real model's has ~100,000), but it works the same way: at every spot it grabs the longest chunk it knows. Common words are a single token; anything unusual gets built out of smaller pieces — try playing, happiest, or pterodactyl.

Lookup #1 — what the model actually receives

Lookup #2 — each id grabs its row of the embedding table

Each id is a row number in a giant table of learned numbers. Showing the first 16 numbers of each row as colors: blue = positive, red = negative. (These rows are random in this toy; in a real model they're weights, learned by the same nudging as everything else.)

And backwards — output is the same table, read the other way

When the model writes, it never outputs letters. It outputs one id, and the tokenizer table turns it back into text. Pretend you're the model — output an id:

Toy shortcuts to know about: we lowercase everything and treat spaces as separators, while real tokenizers bake capital letters and spaces right into the tokens. And a real tokenizer has a byte-level fallback, so nothing is ever truly "not in the table" — type an emoji or a digit here to see what our toy can't handle.

Each token becomes a long list of numbers

The model looks each token's ID up in a giant table and grabs a vector of numbers — and "vector" is just math-speak for a list of numbers, like [0.2, −1.4, 0.7, …]. That list is the token's embedding. How long that vector is depends on the model: GPT-2 small used 768 numbers per token; GPT-3 (175B) used 12,288; smaller models use less, frontier models use more. Call it "thousands of numbers per token" and you're in the right ballpark.

Words with similar meanings have similar embeddings, because the model figured that out from reading huge amounts of text. The fingerprint for "dragon" is closer to "monster" than to "broccoli."

Here's the connection back to everything you've read so far: the embedding table is just learned weights. Same as the weights in chapter 3's neuron. Same as the weights in neuro4.py. It's one giant matrix — and a "matrix" is just a grid of numbers, a list of lists — with one row per token ID, each row holding that token's embedding numbers. During training, the same gradient descent that learns the OR rule also slowly nudges the embedding numbers until "king − man + woman" lands near "queen" and "dragon" lands near "monster." Nothing new in the algorithm; just a much bigger matrix.

The model's only job (during pre-training): predict the next token

Given the tokens so far, what's the most likely next token? That's literally the entire pre-training task. It's the snack-picker from chapter 9, but with about 100,000 classes instead of 3 — every possible token is one of the choices.

(We say "pre-training" because for products like ChatGPT and Claude there's a second training stage after this — covered below — that turns the raw next-token predictor into a polite assistant. The next-token prediction is the foundation everything else is built on, but it's not the whole story.)

▸ try it

Walk through the full pipeline

This shows what happens inside the model when it reads your prompt: "tell me a story about a dragon who likes pizza". Click "next stage" to step through it.

1. Tokenize

2. Embed

3. Attention

4. Predict

Stage 1 — chop the text into tokens

The model can't read letters. Your sentence gets chopped into chunks called tokens. Each one has an ID number from a list of about 100,000.

Once the model has picked the first token, it sticks it on the end of your prompt and runs the whole pipeline AGAIN to pick the next one. And again. And again. That's how a whole story comes out one token at a time.

▸ try it

Watch a story get written one token at a time

Your message: "tell me a story about a dragon who likes pizza". Click "next token" to see what the model picks. Each time you'll see its top 5 guesses with probabilities — then it picks one and adds it to the story.

(The probabilities below are illustrative — they show roughly what a real LLM's distribution looks like at each step, but they're hand-written for this widget, not live model output. Also, the top 5 you see is a slice from the full distribution over ~100,000 tokens; the rest of the mass is spread across thousands of tokens with tiny individual probabilities.)

The model does not keep a human-style outline with a fixed ending. At each step it picks a next token from the context so far, very fast, using patterns it learned from huge amounts of text, code, and conversations.

The surprising thing is that this simple-looking loop — "pick the next token, then pick the one after that" — can produce essays, code, math steps, and full conversations. To be genuinely good at "what's the next token", the model has to absorb deep patterns in language and the world.

Sampling: why the same prompt gives different answers

You may have noticed: if you ask ChatGPT the same question twice, you don't always get the same answer. That's on purpose. The model produces a probability for every possible next token, and then it samples — picks one at random, with higher-probability tokens being more likely. It doesn't always pick the top one.

The sampling has a few knobs:

Temperature. A number, usually 0 to 2. Low temperature (near 0) makes the model very predictable — it almost always picks the most likely token. High temperature (above 1) flattens out the probabilities, so it's willing to pick unusual tokens. Default for chat is often around 0.7 — a balance between coherent and creative.
Top-k. Only consider the top k candidates and ignore the rest. Top-5 means "only ever pick from the 5 most likely next tokens, no matter what" — useful for stopping the model from suddenly veering into a low-probability weird token.
Top-p (also called nucleus sampling). Like top-k but adaptive: keep the most likely tokens until their probabilities add up to p (say 0.9), then sample from that group. This adjusts to context — sometimes the model is sure (one token gets 99% of the mass) and sometimes it's spread out (it considers many).

So when the pipeline above showed "model picks Once (highest probability)" — that's a simplification. With temperature 0 it would always pick Once. With temperature 0.7 it would usually pick Once but sometimes pick Sure, In, or There. This is also why "regenerate" produces different answers and why the same prompt twice doesn't have to match.

Pre-training vs. ChatGPT-the-product (RLHF)

One more important wrinkle. Everything above describes a model trained to predict the next token in random text from the internet. That's the pre-trained base model. If you talked to a raw base model — and you can, lots of them are public on Hugging Face — it wouldn't act like an assistant. You'd type "What's 2+2?" and it might continue your text as if it were a math worksheet ("What's 2+2? ___ What's 3+3? ___") instead of answering you.

ChatGPT acts like an assistant because of a second training stage on top of pre-training. The most common technique is called RLHF — Reinforcement Learning from Human Feedback. The recipe:

Show the base model a prompt. Have it generate several different answers.
Have humans rate which answer was better.
Train a smaller "reward model" to predict the human ratings.
Use that reward model as a stand-in for human preferences, and nudge the original model's weights to produce answers the reward model rates highly.

RLHF is why ChatGPT acts polite, follows instructions, refuses harmful requests, and structures its replies. Without RLHF the underlying math is the same — it's still next-token prediction — but the model is "shaped" by examples of what humans liked into a conversational assistant. Anthropic's work on Claude's character is a good read on what this shaping looks like in practice.

If you want to see this whole pipeline built from scratch in real code, Andrej Karpathy's "Let's build GPT" video (about 2 hours) is the gold standard. He starts from numpy primitives much like ours and ends with a working tiny transformer. For the longer view, Stephen Wolfram's "What Is ChatGPT Doing… and Why Does It Work?" is the most-respected long-form pop-sci explainer of LLMs.

Challenge

Use the candidate bars to make a different story in your head. What would change if the model picked "In" instead of "Once" as the first token? Then think about what would happen at temperature 0 (always pick the top one) vs temperature 2 (almost flat probabilities).

15. What to remember so far

If you remember nothing else from the first fourteen chapters, remember this:

A neuron is a tiny math equation: multiply, add, squish.
A network is a bunch of neurons stacked into layers.
Weights are the adjustable numbers inside each neuron. A trained model is its weights and nothing else.
Training means showing examples and nudging weights (via gradient descent + backpropagation) to reduce mistakes, often millions or billions of times.
Deep networks (lots of layers) can learn complicated patterns that single neurons can't. The XOR fail from chapter 7 is the small version of why.
LLMs like ChatGPT use the same training idea at enormous scale, with a transformer architecture (covered in the "going deeper" chapters) built around next-token prediction plus a second RLHF stage that shapes the model into an assistant.
There is no single magic ingredient. The power comes from simple math, lots of examples, lots of computer power, and clever engineering working together.

The same training idea that solved the screen-time puzzle in chapter 4 is part of what powers ChatGPT. The real systems are much bigger and add specialized architecture and engineering.

Don't stop here — there's more. The next four chapters are the "going deeper" sections. They cover training on real datasets (the iris flowers), using PyTorch (the framework every real ML project uses), stacking hidden layers (watch a deep network crack a puzzle a shallow one can't), and the architectures behind image models and LLMs (CNNs and transformers). They include some of the strongest material in the book. Keep going.

Challenge

Explain neural networks to someone else using only these words: input, weight, mistake, nudge, layer, prediction.

If you want to build the real toy version in Python, look at neuro1.py through neuro9.py in this folder. Each one adds one new idea on top of the previous. The later chapters cover the more advanced topics — real datasets, frameworks, and networks for images and language. For a free, deeper textbook when you're ready, the Goodfellow / Bengio / Courville Deep Learning book is available online at deeplearningbook.org.

Going deeper — training on a real dataset

Everything up to chapter 15 used hand-picked toy examples — 4 rows for the screen-time rule, 8 for the snack picker. Real machine learning trains on much bigger datasets. A classic benchmark called iris (collected by the statistician Ronald Fisher in 1936) has 150 flowers with 4 measurements each. MNIST (handwritten digits, curated by Yann LeCun) has 60,000 training images. ImageNet has 1.2 million labelled training images (14 million across all the variants). The text used to train GPT-3 was over 300 billion tokens — somewhere around 500 million pages of writing.

Once you stop being your own dataset author, four disciplines kick in. None are hard, but each matters.

1. Train / test split

You can't measure how well a network learned by testing it on the same data you trained on. That's like marking your own homework. Hold back, say, 20% of the data, train on the other 80%, and only check accuracy on the held-out portion. That number is the only honest measure.

If your network gets 100% on training data and 60% on held-out test data, you're overfitting — it memorized the answers instead of learning the rule.

2. Feature scaling

Iris flowers' petal length is in centimeters (1–7 cm range). Sepal width is also centimeters but a different range (0.1–4.5). If one input column has values 10× bigger than another, its weight update will dominate training. Standardize each column to mean 0, standard deviation 1 — compute the average and spread on the training set only, then apply the same shift to both train and test.

3. Mini-batches

Toy networks could feed all 4 (or 8) examples through at once each epoch. With 60,000 examples you'd run out of memory. Solution: slice the dataset into small batches (say, 16 or 128 rows). Run forward + backward + weight update on one batch. Move to next batch. After all batches, that's one epoch.

Mini-batches also make training smoother — many small updates beat one huge swing.

4. Watch loss AND accuracy

Loss is what the math optimizes. Accuracy is what humans care about. Print both. When training accuracy keeps climbing but test accuracy plateaus or drops — that's overfitting happening in real time.

Run neuro7.py in this folder to train iris for real (requires scikit-learn — install with pip install scikit-learn). After 200 epochs of mini-batch training, expect ~97% accuracy on the held-out test set. Same network shape as the snack picker (just 4 → 16 → 3 instead of 3 → 6 → 3), same math you already know, now eating real flower measurements. Iris is also a very easy dataset — pretty much any model gets above 95%, so don't read this as evidence you've built something amazing. It's the warm-up.

▸ run real python

Run iris training

This is neuro7.py. It uses numpy plus scikit-learn's built-in iris dataset.

Heads up: in the browser, the first time you click "run" here, the page downloads scikit-learn through Pyodide. That's roughly 10 megabytes and takes 20–40 seconds on a typical home connection. Later runs are fast.

neuro7.py Python loads on first run

# Requires: pip install numpy scikit-learn
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Step 7: real dataset, still hand-rolled backprop in numpy.
# Iris flowers: 150 samples, 4 numeric features, 3 species.
#
# New concepts vs the toy examples:
#   - Real-valued inputs (not just 0/1) -> need feature scaling
#   - Train/test split -> measure generalization, not memorization
#   - Mini-batches -> update weights more often, smoother training
#   - Accuracy metric -> easier to interpret than raw loss
#
# Dataset details:
#   x_raw: shape (150, 4). Columns = [sepal_len, sepal_wid, petal_len, petal_wid] in cm.
#   y_raw: shape (150,).   Values = 0 (setosa), 1 (versicolor), 2 (virginica).
#
# After preprocessing:
#   x_train: (120, 4) standardized features (mean 0, std 1).
#   y_train: (120, 3) one-hot labels.
#   x_test:  (30, 4)  held-out samples we NEVER train on.
#   y_test:  (30, 3)  held-out one-hot labels.

iris = load_iris()
x_raw = iris.data            # (150, 4)
y_raw = iris.target          # (150,)

# One-hot encode labels: 0 -> [1,0,0], 1 -> [0,1,0], 2 -> [0,0,1].
y_onehot = np.zeros((y_raw.shape[0], 3))
y_onehot[np.arange(y_raw.shape[0]), y_raw] = 1

# Train/test split. Hold out 20% of samples to measure generalization.
x_train, x_test, y_train, y_test = train_test_split(
    x_raw, y_onehot, test_size=0.2, random_state=1, stratify=y_raw,
)

# Feature scaling. Each feature now mean=0, std=1.
# Compute stats on TRAIN only, apply to both, so we don't leak test info.
mean = x_train.mean(axis=0)
std  = x_train.std(axis=0)
x_train = (x_train - mean) / std
x_test  = (x_test  - mean) / std

# Silence a cosmetic numpy warning. On Apple Silicon / Accelerate, the
# small `d_o @ w2.T` matmul in the backward pass can trigger spurious
# "divide/overflow/invalid" FP flags from inside SIMD code even though
# every input and output is finite and training converges fine.
np.seterr(divide="ignore", over="ignore", invalid="ignore")

def relu(x):  return np.maximum(0, x)
def drelu(x): return (x > 0).astype(float)

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

def accuracy(x, y):
    h = relu(x @ w1 + b1)
    o = softmax(h @ w2 + b2)
    return (o.argmax(axis=1) == y.argmax(axis=1)).mean()

np.random.seed(1)
# Wider hidden layer: 4 inputs -> 16 hidden -> 3 classes.
# He initialization: scale by sqrt(2/fan_in) for ReLU layers. Keeps
# activations from exploding or collapsing on the very first forward
# pass. Without it, the first batch can NaN out before training settles.
w1 = np.random.randn(4, 16) * np.sqrt(2.0 / 4)
b1 = np.zeros((1, 16))
w2 = np.random.randn(16, 3) * np.sqrt(2.0 / 16)
b2 = np.zeros((1, 3))
lr = 0.02            # ReLU + He init can amplify early gradients; small lr avoids NaN spikes
batch_size = 16
epochs = 200

n = x_train.shape[0]
for epoch in range(epochs):
    # Shuffle indices each epoch so mini-batches differ.
    idx = np.random.permutation(n)
    for start in range(0, n, batch_size):
        batch = idx[start:start+batch_size]
        xb = x_train[batch]
        yb = y_train[batch]

        # forward
        h = relu(xb @ w1 + b1)
        o = softmax(h @ w2 + b2)

        # backward
        d_o = (o - yb) / xb.shape[0]
        d_h = (d_o @ w2.T) * drelu(h)

        # update
        w2 -= lr * h.T @ d_o
        b2 -= lr * d_o.sum(axis=0, keepdims=True)
        w1 -= lr * xb.T @ d_h
        b1 -= lr * d_h.sum(axis=0, keepdims=True)

    if epoch % 20 == 0:
        train_acc = accuracy(x_train, y_train)
        test_acc  = accuracy(x_test,  y_test)
        print(f"epoch {epoch:3d}: train_acc {train_acc:.3f}  test_acc {test_acc:.3f}")

print()
print(f"final train accuracy: {accuracy(x_train, y_train):.3f}")
print(f"final test  accuracy: {accuracy(x_test,  y_test):.3f}")

(click "run" to execute)

Challenge

Change test_size=0.2 to 0.5. Does the test accuracy become noisier when the model has fewer training examples?

Going deeper — using a framework (PyTorch)

So far you've written every line yourself: x @ w1 + b1, sigmoid, manual gradients, weight updates. That's the right way to learn. But nobody works that way in real life. They use a framework called PyTorch (or its cousin TensorFlow; for production work on Google hardware, also JAX). The framework does the bookkeeping. You just describe the network shape.

The whole iris training loop in PyTorch looks like this:

hidden = 16
model = nn.Sequential(
    nn.Linear(4, hidden),
    nn.ReLU(),
    nn.Linear(hidden, 3),
)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(200):
    for xb, yb in batches:
        logits = model(xb)
        loss = loss_fn(logits, yb)

        optimizer.zero_grad()
        loss.backward()       # autograd computes every gradient
        optimizer.step()      # applies them

Every piece replaces something you wrote by hand:

Hand-rolled	PyTorch
`np.maximum(0, x)`	`nn.ReLU()`
softmax + cross-entropy by hand	`nn.CrossEntropyLoss()`
Manual forward (`h = ...`, `o = ...`)	`model(x)`
Manual backward (`d_o`, `d_h`)	`loss.backward()` (autograd)
`w -= lr * grad`	`optimizer.step()`
Plain SGD	`torch.optim.Adam` (smarter)

Autograd is the powerful bookkeeping part. PyTorch tracks every operation you do during the forward pass. When you call loss.backward(), it walks the chain backward and computes every single gradient, exactly as you did by hand in chapter 8. You never write a chain-rule expression again.

Adam is the optimizer used in almost every modern model. It's gradient descent with two upgrades: momentum (keep moving in directions that worked recently) and per-weight learning rates (slow down weights that bounce around, speed up ones that crawl). For training transformers specifically, people now usually use a tweaked version called AdamW (the W stands for "weight decay") — but it's still recognizably the same algorithm.

Run neuro8.py locally (after pip install torch) to see this exact loop train iris in seconds. The official PyTorch 60-minute blitz is the most-recommended next tutorial. After that, the natural next step for everyone reading this book is Andrej Karpathy's "Neural Networks: Zero to Hero" YouTube series — he rebuilds backprop, micrograd, and a working GPT from scratch in real code. It's basically a more advanced version of this book.

▸ run locally

Run the PyTorch version

This is neuro8.py. PyTorch is too large for this single-file browser runner, so this panel keeps the code synced and gives the exact local command.

pip install torch scikit-learn
python3 neuro8.py

neuro8.py local Python script

"""
Step 8: same iris problem as neuro7.py, rewritten in PyTorch.

Goal: see what a framework hides. Every line below replaces something
you wrote by hand in neuro7.py.

  numpy version (neuro7.py)            PyTorch version (this file)
  ---------------------------          -----------------------------
  np.maximum(0, x)                ->   nn.ReLU
  softmax + cross-entropy by hand ->   nn.CrossEntropyLoss
  manual forward (h = ..., o = ..)->   model(x)
  manual backward (d_o, d_h)      ->   loss.backward()    (autograd)
  manual weight updates           ->   optimizer.step()
  plain SGD                       ->   torch.optim.Adam   (smarter)
  CPU only                        ->   model.to('cuda') if available

The math is identical. PyTorch just stops you typing it.

Run:
    pip install torch scikit-learn
    python3 neuro8.py
"""

import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# ---- data ----
iris = load_iris()
x_raw = iris.data           # (150, 4)
y_raw = iris.target         # (150,)  integer labels 0/1/2

x_train, x_test, y_train, y_test = train_test_split(
    x_raw, y_raw, test_size=0.2, random_state=1, stratify=y_raw,
)

# Standardize features using train stats only.
mean = x_train.mean(axis=0)
std  = x_train.std(axis=0)
x_train = (x_train - mean) / std
x_test  = (x_test  - mean) / std

# Convert numpy arrays to torch tensors.
# Note: CrossEntropyLoss expects integer class labels, NOT one-hot.
x_train_t = torch.tensor(x_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.long)
x_test_t  = torch.tensor(x_test,  dtype=torch.float32)
y_test_t  = torch.tensor(y_test,  dtype=torch.long)

# ---- model ----
# nn.Sequential = stack of layers, runs them in order.
# nn.Linear(in, out) = weights w (in, out) + bias b (out). Same as
# `x @ w + b` you wrote by hand.
hidden = 16   # try 16, then 4, then 64
model = nn.Sequential(
    nn.Linear(4, hidden),
    nn.ReLU(),
    nn.Linear(hidden, 3),
)

# CrossEntropyLoss applies log-softmax internally, then computes
# cross-entropy. Don't put a softmax in the model. Tensors fed in are
# raw scores ("logits"), labels are integer class indices.
loss_fn = nn.CrossEntropyLoss()

# Adam = SGD with momentum + per-parameter learning rate. Almost
# always the default for new projects.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# ---- training ----
batch_size = 16
epochs = 200
n = x_train_t.shape[0]

for epoch in range(epochs):
    # Shuffle indices each epoch.
    idx = torch.randperm(n)
    for start in range(0, n, batch_size):
        batch = idx[start:start+batch_size]
        xb = x_train_t[batch]
        yb = y_train_t[batch]

        logits = model(xb)            # forward
        loss = loss_fn(logits, yb)

        optimizer.zero_grad()         # clear leftover grads
        loss.backward()               # autograd: compute every gradient
        optimizer.step()              # apply gradients

    if epoch % 20 == 0:
        with torch.no_grad():
            train_pred = model(x_train_t).argmax(dim=1)
            test_pred  = model(x_test_t).argmax(dim=1)
            train_acc = (train_pred == y_train_t).float().mean().item()
            test_acc  = (test_pred  == y_test_t).float().mean().item()
            print(f"epoch {epoch:3d}: train_acc {train_acc:.3f}  test_acc {test_acc:.3f}")

# ---- final eval ----
with torch.no_grad():
    train_acc = (model(x_train_t).argmax(dim=1) == y_train_t).float().mean().item()
    test_acc  = (model(x_test_t ).argmax(dim=1) == y_test_t ).float().mean().item()

print()
print(f"final train accuracy: {train_acc:.3f}")
print(f"final test  accuracy: {test_acc:.3f}")

Run locally with: python3 neuro8.py

Challenge

Change the hidden layer from 16 neurons to 4. Does PyTorch still train fast? Then try 64 and compare train accuracy to test accuracy.

Going deeper — stacking hidden layers

Every network in this book so far has had exactly one hidden layer: input → hidden → output. Real networks stack more: input → hidden → hidden → output, and the big ones go dozens of layers deep. The wiring rule doesn't change at all — the outputs of hidden layer 1 simply become the inputs of hidden layer 2. Every neuron in layer 2 does the same multiply-add-squish from chapter 3; it just chews on layer 1's answers instead of on the raw input.

Why does that help? Chapter 8's network built its shape out of straight lines, because every hidden neuron looks at the raw inputs, and a single neuron's question is always a line. A second hidden layer builds shapes out of those shapes: layer 1 makes lines, layer 2 bends and combines them into curves and pockets. That's the "edges → shapes → objects" idea from chapter 13, shrunk down small enough to watch.

Here's a puzzle that makes depth visible: two spirals, tangled around each other. Every dot is a training example — its position on the graph paper is the input (just two numbers, x and y), and its color is the right answer. This is chapter 6's dots-on-graph-paper all grown up. XOR needed one hidden layer; the spiral is XOR's final boss.

▸ try it

One hidden layer vs two — fight

Pick a brain, then train it. The background shows the network's current opinion about every point on the map; the dots are the training examples it has to match. The shallow brains get as many (or more) neurons as the deep one — watch who actually learns the spiral.

1 hidden layer × 8

1 hidden layer × 16

2 hidden layers × 8 + 8

epochs: 0 accuracy: ? stored numbers: ?

What each hidden neuron is looking for

Each little map shows where on the graph paper one hidden neuron gets excited (blue) or pushes back (red). After training, layer 1's maps are mostly stripes — straight-line questions. Layer 2's maps are bent: shapes built out of layer 1's stripes.

The model, as stored

Train each brain for 5,000 epochs and compare. The shallow brains get stuck near coin-flip accuracy — more neurons in the same single layer barely helps, because one layer of lines, squished once, can't follow a curve that keeps curling. The two-layer brain finds the spiral. Then check the neuron maps: layer 1's stay stripey, while layer 2's are bent — each layer-2 neuron is a combination of layer-1 stripes, and bent stripes are exactly what a spiral is made of.

One honesty note: in theory a single hidden layer can fit almost any shape if you make it absurdly wide — a famous result called the universal approximation theorem. But "possible in theory" and "actually findable by training" are very different things. Depth makes good solutions reachable with fewer neurons and less training, and that gap grows as problems get harder. That's the practical reason every serious network is deep.

Two footnotes for the curious. First, the hidden squisher here is tanh — sigmoid's cousin that outputs −1 to +1 instead of 0 to 1. It trains more smoothly on this kind of puzzle, but it's the same "squish" idea. Second, you already own a real two-hidden-layer network: neuro9.py (784 → 128 → 64 → 10) is this widget's architecture scaled up to read handwritten digits. The spiral is the classic depth demo from TensorFlow Playground, where you can build networks up to six layers deep in your browser.

Challenge

What's the smallest brain that gets above 95% on the spiral? And give the one-layer × 16 brain a truly fair shot: press "train 5,000" four times in a row. Does it ever escape?

Going deeper — convnets, transformers, and beyond

Same machinery scales up. Two big architectural ideas you'll meet from here:

Convolutional networks (CNNs) — for images

The "hello world" of deep learning is MNIST — 60,000 handwritten digit images (28×28 pixels, 10 classes). A plain fully-connected network like the snack picker can hit ~97% accuracy on MNIST. State of the art with convolutional networks is ~99.8%.

The change: replace nn.Linear in the early layers with nn.Conv2d. Each neuron only looks at a small spatial patch of the image (say, 3×3 pixels) and shares its weights across all positions. This means the same edge detector gets applied everywhere in the image — a great fit for pictures, which are translation-invariant (a cat is a cat whether it's in the top-left or bottom-right of the photo).

Pattern: Conv → ReLU → Pool → Conv → ReLU → Pool → ... → Linear. CNNs are everywhere images are: medical scans, face recognition, self-driving car perception, photo classification. The 1998 paper that started it all is LeCun et al.'s LeNet-5. The 2012 paper that proved CNNs could win the open competition for general image recognition is AlexNet. If you want to see what conv filters actually look like, Chris Olah's distill.pub feature visualization is a masterpiece.

Transformers — for sequences (and now everything)

You already met the basic idea in chapter 14. For sequences (text, audio, code), replace fully-connected hidden layers with attention blocks. Each token gets to look at every other token and pull a weighted blend of their values into its own representation. The output is still a softmax over a vocabulary, one token at a time.

Every modern LLM (GPT, Claude, Gemini, Llama) is a stack of transformer blocks. Plus a few practical tricks (positional embeddings, layer normalization, mixture of experts), but the core is just stacked attention + feed-forward layers, with softmax at the end. The original paper is Vaswani et al., "Attention Is All You Need" (2017). The kid-friendly walkthrough is Jay Alammar's "The Illustrated Transformer".

The honest summary

Whether you're building a snack classifier, a digit recognizer, or a frontier LLM, the same primitives keep showing up:

Weighted sums (matrix multiplies)
Non-linearities (ReLU and its cousins)
Softmax at the output
Backpropagation through everything

The math you wrote by hand in chapter 8 still shows up inside GPT-class models. The advanced systems add a lot of scale, data, architecture, and engineering, but the basic training idea should feel recognizable now.

▸ try it — comparison: how a not-neural-network does it

Draw a digit (template matcher, not a neural net)

Important: this widget is not a trained neural network. It's a "template matcher" — a much older, simpler trick where we hand-make a few example pictures of each digit and find the closest match to your drawing using cosine similarity. We've left it here as a comparison: drawing a few digits will show you exactly how brittle hand-coded matching is. The real MNIST neural network is the next script (neuro9.py) — to run it you'll need PyTorch locally.

Try drawing the same digit skinny, wide, tilted, or off-center. The score changes because the template matcher is brittle. A trained neural network learns many versions of each digit.

To run the bigger example yourself: neuro9.py in this folder trains a real MNIST classifier with PyTorch. Needs pip install torch torchvision and a few minutes of CPU (or seconds on a GPU). After 5 epochs you'll see ~98% test accuracy. That's your network classifying handwritten digits well — which, when you remember that we started this story with a single neuron deciding "screen time: yes or no", is a pretty good place to stop.

If you finished this book and want to go further

The short list of places to go next:

Watch 3Blue1Brown's neural networks series. Same ideas, animated beautifully.
Do Andrej Karpathy's "Neural Networks: Zero to Hero". From-scratch backprop in Python, building up to a working GPT. Direct continuation of this book.
Read Karpathy's older essay "The Unreasonable Effectiveness of Recurrent Neural Networks" — written before GPT existed but it's the essay that made a lot of people fall in love with the field.
If you want a real textbook: Goodfellow, Bengio, and Courville's Deep Learning is free online.
If you want a course: fast.ai's "Practical Deep Learning for Coders" is the most-recommended hands-on next step.

▸ run locally

Train the real MNIST model

This is neuro9.py. It downloads MNIST on first run and trains a real PyTorch digit classifier.

pip install torch torchvision
python3 neuro9.py

neuro9.py local Python script

"""
Step 9: scale up to MNIST. 60k handwritten digits, 28x28 pixels each,
10 classes (0-9). Real-size problem, real benchmark, real dataset.

Same shape as neuro8.py, just bigger:
  - input layer: 784  (28 * 28 flattened pixels)
  - hidden 1:    128
  - hidden 2:    64
  - output:      10   (one logit per digit class)

New compared to neuro8.py:
  - torchvision for dataset download / preprocessing
  - DataLoader for batching + shuffling + parallel loading
  - Two hidden layers (not one) -> our first "deep" net
  - Optional GPU via .to(device)
  - Train accuracy AND test accuracy each epoch -> gap = overfitting signal

This is the canonical "hello world" of deep learning.
Expect ~97-98% test accuracy after a few epochs.

Run:
    pip install torch torchvision
    python3 neuro9.py
"""

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ---- device ----
# Picks GPU when available, else CPU. PyTorch moves any tensor or
# module with `.to(device)` to that hardware.
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"using device: {device}")

# ---- data ----
# transforms.ToTensor() converts PIL image -> float tensor in [0, 1].
# Normalize subtracts mean and divides by std (MNIST conventional values).
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

# Downloads to ./data on first run (~12MB). After that it's cached.
train_set = datasets.MNIST(root="./data", train=True,  download=True, transform=transform)
test_set  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# DataLoader handles batching, shuffling, parallel CPU loading.
# num_workers=0 keeps everything in the main process. On macOS with
# Python 3.9, num_workers>0 spawns subprocesses that re-import this
# file, which can crash. 0 is slower but safe everywhere.
train_loader = DataLoader(train_set, batch_size=128, shuffle=True,  num_workers=0)
test_loader  = DataLoader(test_set,  batch_size=512, shuffle=False, num_workers=0)

# ---- model ----
# nn.Flatten turns (batch, 1, 28, 28) into (batch, 784).
hidden1 = 128   # try 128, then 16
hidden2 = 64
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, hidden1),
    nn.ReLU(),
    nn.Linear(hidden1, hidden2),
    nn.ReLU(),
    nn.Linear(hidden2, 10),
).to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ---- helpers ----
def evaluate(loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            pred = model(x).argmax(dim=1)
            correct += (pred == y).sum().item()
            total   += y.size(0)
    return correct / total

# ---- training ----
epochs = 5
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        loss = loss_fn(logits, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    train_acc = evaluate(train_loader)
    test_acc  = evaluate(test_loader)
    avg_loss  = running_loss / len(train_loader)
    print(
        f"epoch {epoch+1}: "
        f"loss {avg_loss:.4f}  "
        f"train_acc {train_acc:.4f}  "
        f"test_acc {test_acc:.4f}"
    )

# ---- inspect a few predictions ----
print("\nsample predictions on test set:")
model.eval()
with torch.no_grad():
    x, y = next(iter(test_loader))
    x, y = x.to(device), y.to(device)
    pred = model(x).argmax(dim=1)
    for i in range(10):
        ok = "OK " if pred[i] == y[i] else "BAD"
        print(f"  {ok}  true={y[i].item()}  predicted={pred[i].item()}")

Run locally with: python3 neuro9.py

Challenge

Run neuro9.py, then reduce the first hidden layer from 128 neurons to 16. Watch how the test accuracy changes.

How a neural network actually works

1. Tiny math map

2. The big picture

3. One tiny neuron

What does sigmoid actually look like?

Play with a neuron

4. Training the neuron

Watch one nudge happen, with real numbers

Do one nudge yourself

What the neuron says right now

The nudge this example asks for

The model itself, as the computer stores it

Train the neuron yourself

The model being rewritten

Draw the line yourself

Run the actual script — in your browser

5. What training actually changes

6. Tables, dots, and lines

7. When one neuron isn't enough

Why only one line?

Watch a single neuron fail

The model that can't win

Try to draw a line — you can't

Watch a real single-layer perceptron give up

8. Hidden neurons fix it

Now watch it succeed

The bigger model

The piece of math that makes this work: the chain rule

Watch the wires light up

Output math

Run a real 2-layer network on XOR

9. Picking from a list

The trained snack picker

Run the real multi-class snack picker

10. What does that "squish" actually do?

Drag the dot. Watch the curve.

What does the derivative tell us?

Now switch to ReLU and compare

Why we replaced sigmoid in deep networks

Run the ReLU version

11. Meal guesser

Train a meal guesser

Traits

Network shape

Prediction

How the clues become numbers

What training saved — the model itself, live

Run the meal guesser

12. How learning actually works

The loss landscape

The gradient is just the slope

Roll the ball downhill

Why we take small steps (learning rate)

Local minima — why we don't usually care

Mini-batches: noisy descent

13. Scaling this up

What stays the same when you scale up

What changes when you scale up

What gets harder when you scale up

Why bigger keeps working

14. How AI chatbots actually write

The tokens trick

So… is there a lookup table? (Yes — two.)

Tokenize anything — watch text become numbers (and come back)

Lookup #1 — what the model actually receives

Lookup #2 — each id grabs its row of the embedding table

And backwards — output is the same table, read the other way

Each token becomes a long list of numbers

The model's only job (during pre-training): predict the next token

Walk through the full pipeline

Stage 1 — chop the text into tokens

Stage 2 — look up each token's "fingerprint"

Stage 3 — tokens look at each other (attention)

Stage 4 — score every possible next token

Watch a story get written one token at a time

Sampling: why the same prompt gives different answers

Pre-training vs. ChatGPT-the-product (RLHF)

15. What to remember so far

Going deeper — training on a real dataset

1. Train / test split