Last modified: Jan 31 2026 at 10:09 PM • 5 mins read

Explanation for Vectorized Implementation

Introduction
The Core Question
Step-by-Step Derivation
Matrix Form: Stacking Examples
Adding Back the Bias
Generalizing to All Layers
Complete Recap
Symmetry in Neural Network Layers
Key Takeaways

Introduction

In the previous lesson, we saw how to vectorize forward propagation by stacking training examples as columns in matrix $X$. Now let’s provide a mathematical justification for why this vectorization is correct.

The Core Question

Why does this equation work for multiple examples?

\[Z^{[1]} = W^{[1]} X + b^{[1]}\]

Let’s prove it step by step.

Step-by-Step Derivation

Individual Example Computations

For individual training examples, we compute:

Example 1:

\[z^{[1](1)} = W^{[1]} x^{(1)} + b^{[1]}\]

Example 2:

\[z^{[1](2)} = W^{[1]} x^{(2)} + b^{[1]}\]

Example 3:

\[z^{[1](3)} = W^{[1]} x^{(3)} + b^{[1]}\]

Simplification: Ignoring Bias (Temporarily)

To make the explanation clearer, let’s temporarily assume $b^{[1]} = 0$. We’ll add it back later:

\[z^{[1](1)} = W^{[1]} x^{(1)}\] \[z^{[1](2)} = W^{[1]} x^{(2)}\] \[z^{[1](3)} = W^{[1]} x^{(3)}\]

Understanding the Matrix Multiplication

Each computation produces a column vector:

\[W^{[1]} x^{(1)} = \text{column vector}\] \[W^{[1]} x^{(2)} = \text{column vector}\] \[W^{[1]} x^{(3)} = \text{column vector}\]

For example, if $W^{[1]}$ has shape $(4, 3)$ and $x^{(i)}$ has shape $(3, 1)$, then each result is a $(4, 1)$ column vector.

Matrix Form: Stacking Examples

Building Matrix $X$

Now, create matrix $X$ by stacking examples as columns:

\[X = \begin{bmatrix} | & | & | \\ x^{(1)} & x^{(2)} & x^{(3)} \\ | & | & | \end{bmatrix}\]

Shape: $(n_x, 3)$ for 3 examples with $n_x$ features.

Computing $W^{[1]} X$

When we multiply $W^{[1]} X$, matrix multiplication produces:

\[W^{[1]} X = W^{[1]} \begin{bmatrix} | & | & | \\ x^{(1)} & x^{(2)} & x^{(3)} \\ | & | & | \end{bmatrix} = \begin{bmatrix} | & | & | \\ W^{[1]} x^{(1)} & W^{[1]} x^{(2)} & W^{[1]} x^{(3)} \\ | & | & | \end{bmatrix}\]

Key insight: Matrix multiplication applies $W^{[1]}$ to each column of $X$ independently!

Result is $Z^{[1]}$

The result is exactly the matrix $Z^{[1]}$:

\[Z^{[1]} = \begin{bmatrix} | & | & | \\ z^{[1](1)} & z^{[1](2)} & z^{[1](3)} \\ | & | & | \end{bmatrix}\]

where each column is the result for one training example.

Adding Back the Bias

Python Broadcasting

When we add $b^{[1]}$ back:

\[Z^{[1]} = W^{[1]} X + b^{[1]}\]

Python’s broadcasting automatically adds $b^{[1]}$ to each column of $W^{[1]} X$.

How Broadcasting Works

If $b^{[1]}$ has shape $(4, 1)$ and $W^{[1]} X$ has shape $(4, 3)$:

\[\begin{bmatrix} w_{11}x^{(1)}_1 + \cdots & w_{11}x^{(2)}_1 + \cdots & w_{11}x^{(3)}_1 + \cdots \\ w_{21}x^{(1)}_1 + \cdots & w_{21}x^{(2)}_1 + \cdots & w_{21}x^{(3)}_1 + \cdots \\ w_{31}x^{(1)}_1 + \cdots & w_{31}x^{(2)}_1 + \cdots & w_{31}x^{(3)}_1 + \cdots \\ w_{41}x^{(1)}_1 + \cdots & w_{41}x^{(2)}_1 + \cdots & w_{41}x^{(3)}_1 + \cdots \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ b_4 \end{bmatrix}\]

becomes:

\[\begin{bmatrix} (\cdots) + b_1 & (\cdots) + b_1 & (\cdots) + b_1 \\ (\cdots) + b_2 & (\cdots) + b_2 & (\cdots) + b_2 \\ (\cdots) + b_3 & (\cdots) + b_3 & (\cdots) + b_3 \\ (\cdots) + b_4 & (\cdots) + b_4 & (\cdots) + b_4 \end{bmatrix}\]

Broadcasting adds $b^{[1]}$ to every column automatically!

Generalizing to All Layers

The same logic applies to all four equations:

For-Loop Version (Non-Vectorized)

for i in range(m):
    z[1](i) = W[1] @ x(i) + b[1]
    a[1](i) = sigmoid(z[1](i))
    z[2](i) = W[2] @ a[1](i) + b[2]
    a[2](i) = sigmoid(z[2](i))

Vectorized Version

Z1 = np.dot(W1, X) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)

Each equation works by the same principle:

Stack inputs as columns in a matrix
Matrix multiplication processes all columns simultaneously
Broadcasting adds bias to all columns
Element-wise operations (like sigmoid) apply to each element

Complete Recap

Non-Vectorized (One Example at a Time)

\[\text{for } i = 1 \text{ to } m:\] \[z^{[1](i)} = W^{[1]} x^{(i)} + b^{[1]}\] \[a^{[1](i)} = \sigma(z^{[1](i)})\] \[z^{[2](i)} = W^{[2]} a^{[1](i)} + b^{[2]}\] \[a^{[2](i)} = \sigma(z^{[2](i)})\]

Vectorized (All Examples Simultaneously)

Stack examples as columns:

\[X = [x^{(1)} \ x^{(2)} \ \cdots \ x^{(m)}]\] \[Z^{[1]} = [z^{[1](1)} \ z^{[1](2)} \ \cdots \ z^{[1](m)}]\] \[A^{[1]} = [a^{[1](1)} \ a^{[1](2)} \ \cdots \ a^{[1](m)}]\] \[Z^{[2]} = [z^{[2](1)} \ z^{[2](2)} \ \cdots \ z^{[2](m)}]\] \[A^{[2]} = [a^{[2](1)} \ a^{[2](2)} \ \cdots \ a^{[2](m)}]\]

Then compute:

\[Z^{[1]} = W^{[1]} X + b^{[1]}\] \[A^{[1]} = \sigma(Z^{[1]})\] \[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\] \[A^{[2]} = \sigma(Z^{[2]})\]

Symmetry in Neural Network Layers

Notation: $X = A^{[0]}$

Recall that the input features can be written as:

\[X = A^{[0]}\]

This means $x^{(i)} = a^{0}$ for each training example.

Layer 1 Equations

\[Z^{[1]} = W^{[1]} A^{[0]} + b^{[1]}\] \[A^{[1]} = \sigma(Z^{[1]})\]

Layer 2 Equations

\[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\] \[A^{[2]} = \sigma(Z^{[2]})\]

Pattern Recognition

Notice the repeating pattern:

\[Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}\] \[A^{[l]} = \sigma(Z^{[l]})\]

Each layer performs the same computation:

Linear transformation: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$
Non-linear activation: $A^{[l]} = \sigma(Z^{[l]})$

This pattern extends to deep neural networks with many layers - each layer just repeats these two steps!

Key Takeaways

Matrix multiplication naturally implements the “loop” over training examples
Broadcasting adds bias vectors to all columns automatically
Stacking as columns: $X = [x^{(1)} \ x^{(2)} \ \cdots \ x^{(m)}]$ enables vectorization
All four equations vectorize the same way - stack inputs, apply operations
Neural network layers follow a repeating pattern: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$, then $A^{[l]} = \sigma(Z^{[l]})$
Deep networks just repeat this pattern more times
Vectorization is mathematically equivalent to for-loops but much faster