Last modified: Jan 31 2026 at 10:09 PM • 5 mins read

Vectorizing Across Multiple Examples

Introduction
Review: Single Example Forward Propagation
The Problem: Multiple Examples
Non-Vectorized Implementation (Slow)
Vectorized Implementation (Fast)
Implementation
Understanding Matrix Dimensions
How to Think About Matrix Layout
Comparison: For-Loop vs Vectorized
Why This Works
Key Takeaways

Introduction

In the previous lesson, we learned how to compute predictions for a single training example. Now we’ll vectorize across multiple training examples to process the entire dataset simultaneously - similar to what we did for logistic regression.

By stacking training examples as columns in a matrix, we can transform our 4 equations with minimal changes to compute outputs for all examples at once.

Review: Single Example Forward Propagation

For a single training example $x$, we computed:

\[z^{[1]} = W^{[1]} x + b^{[1]}\] \[a^{[1]} = \sigma(z^{[1]})\] \[z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}\] \[a^{[2]} = \sigma(z^{[2]}) = \hat{y}\]

The Problem: Multiple Examples

With $m$ training examples, we need predictions for each:

\[x^{(1)} \rightarrow \hat{y}^{(1)} = a^{[2](1)}\] \[x^{(2)} \rightarrow \hat{y}^{(2)} = a^{[2](2)}\] \[\vdots\] \[x^{(m)} \rightarrow \hat{y}^{(m)} = a^{[2](m)}\]

Notation Clarification

\[a^{[l](i)}\]

Square brackets $[l]$: layer number
Round brackets $(i)$: training example number

So $a^{2}$ means “activation from layer 2 for training example 3”.

Non-Vectorized Implementation (Slow)

The naive approach uses a for-loop:

for i in range(m):
    # Hidden layer
    z[1](i) = W[1] @ x(i) + b[1]
    a[1](i) = sigmoid(z[1](i))
    
    # Output layer
    z[2](i) = W[2] @ a[1](i) + b[2]
    a[2](i) = sigmoid(z[2](i))

Problem: Processing one example at a time is inefficient. Let’s vectorize!

Vectorized Implementation (Fast)

Step 1: Stack Training Examples as Columns

Create the input matrix $X$ by stacking examples horizontally:

\[X = \begin{bmatrix} | & | & & | \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ | & | & & | \end{bmatrix}\]

Shape: $(n_x, m)$ where $n_x$ = number of features, $m$ = number of examples

Step 2: Use Capital Letter Matrices

Replace lowercase vectors with uppercase matrices:

Single example notation:

\[x, \quad z^{[1]}, \quad a^{[1]}, \quad z^{[2]}, \quad a^{[2]}\]

Multiple examples notation (stacked as columns):

\[X = [x^{(1)} \ x^{(2)} \ \cdots \ x^{(m)}], \quad \text{shape: } (n_x, m)\] \[Z^{[1]} = [z^{[1](1)} \ z^{[1](2)} \ \cdots \ z^{[1](m)}], \quad \text{shape: } (n^{[1]}, m)\] \[A^{[1]} = [a^{[1](1)} \ a^{[1](2)} \ \cdots \ a^{[1](m)}], \quad \text{shape: } (n^{[1]}, m)\] \[Z^{[2]} = [z^{[2](1)} \ z^{[2](2)} \ \cdots \ z^{[2](m)}], \quad \text{shape: } (n^{[2]}, m)\] \[A^{[2]} = [a^{[2](1)} \ a^{[2](2)} \ \cdots \ a^{[2](m)}], \quad \text{shape: } (n^{[2]}, m)\]

Step 3: Vectorized Forward Propagation

Simply replace lowercase with uppercase:

\[Z^{[1]} = W^{[1]} X + b^{[1]}\] \[A^{[1]} = \sigma(Z^{[1]})\] \[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\] \[A^{[2]} = \sigma(Z^{[2]})\]

That’s it! The same 4 equations, just with capital letters.

Implementation

# Vectorized forward propagation for m examples
# Hidden layer
Z1 = np.dot(W1, X) + b1      # Shape: (4, m)
A1 = sigmoid(Z1)              # Shape: (4, m)

# Output layer
Z2 = np.dot(W2, A1) + b2     # Shape: (1, m)
A2 = sigmoid(Z2)              # Shape: (1, m)

# A2 contains predictions for all m examples
y_hat = A2

Understanding Matrix Dimensions

Let’s verify with $n_x = 3$ features, $n^{[1]} = 4$ hidden units, and $m = 100$ examples:

Computation	Dimensions	Result
$W^{[1]} X$	$(4 \times 3) \cdot (3 \times 100)$	$(4 \times 100)$
$Z^{[1]} = W^{[1]} X + b^{[1]}$	$(4 \times 100) + (4 \times 1)$	$(4 \times 100)$
$A^{[1]} = \sigma(Z^{[1]})$	$\sigma((4 \times 100))$	$(4 \times 100)$
$W^{[2]} A^{[1]}$	$(1 \times 4) \cdot (4 \times 100)$	$(1 \times 100)$
$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$	$(1 \times 100) + (1 \times 1)$	$(1 \times 100)$
$A^{[2]} = \sigma(Z^{[2]})$	$\sigma((1 \times 100))$	$(1 \times 100)$

Note: Broadcasting automatically handles adding $b^{[1]}$ (shape $(4,1)$) to each column of the $(4, 100)$ matrix.

How to Think About Matrix Layout

Horizontal Axis: Training Examples

Moving left to right across columns = scanning through training examples.

Vertical Axis: Nodes/Features

Moving top to bottom down rows = scanning through nodes (or features).

Example: Matrix $A^{[1]}$ with 4 Hidden Units and 5 Examples

\[A^{[1]} = \begin{bmatrix} a^{[1](1)}_1 & a^{[1](2)}_1 & a^{[1](3)}_1 & a^{[1](4)}_1 & a^{[1](5)}_1 \\ a^{[1](1)}_2 & a^{[1](2)}_2 & a^{[1](3)}_2 & a^{[1](4)}_2 & a^{[1](5)}_2 \\ a^{[1](1)}_3 & a^{[1](2)}_3 & a^{[1](3)}_3 & a^{[1](4)}_3 & a^{[1](5)}_3 \\ a^{[1](1)}_4 & a^{[1](2)}_4 & a^{[1](3)}_4 & a^{[1](4)}_4 & a^{[1](5)}_4 \end{bmatrix}\]

Interpretation:

Top-left $a^{1}_1$: Activation of hidden unit 1 on example 1
Moving down: Hidden unit 2, 3, 4 on example 1
Moving right: Hidden unit 1 on examples 2, 3, 4, 5
Bottom-right $a^{1}_4$: Activation of hidden unit 4 on example 5

Same Pattern for All Matrices

Matrix	Rows (vertical)	Columns (horizontal)
$X$	Input features ($x_1, x_2, \ldots, x_{n_x}$)	Training examples
$Z^{[1]}, A^{[1]}$	Hidden units (nodes 1, 2, 3, 4)	Training examples
$Z^{[2]}, A^{[2]}$	Output units	Training examples

Comparison: For-Loop vs Vectorized

Non-Vectorized (Slow)

# Process one example at a time
for i in range(m):
    z1_i = np.dot(W1, x_i) + b1
    a1_i = sigmoid(z1_i)
    z2_i = np.dot(W2, a1_i) + b2
    a2_i = sigmoid(z2_i)

Time: $O(m)$ iterations

Vectorized (Fast)

# Process all examples simultaneously
Z1 = np.dot(W1, X) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)

Time: $O(1)$ operation (parallelized)

Why This Works

The vectorization works because matrix multiplication naturally implements the “loop” over training examples:

\[W^{[1]} X = W^{[1]} \begin{bmatrix} x^{(1)} & x^{(2)} & \cdots & x^{(m)} \end{bmatrix} = \begin{bmatrix} W^{[1]} x^{(1)} & W^{[1]} x^{(2)} & \cdots & W^{[1]} x^{(m)} \end{bmatrix}\]

Each column of the result corresponds to processing one training example - exactly what the for-loop did, but computed in parallel!

Key Takeaways

Stack training examples as columns to create matrices $X$, $Z^{[l]}$, $A^{[l]}$
Replace lowercase with uppercase: $x \rightarrow X$, $z^{[l]} \rightarrow Z^{[l]}$, $a^{[l]} \rightarrow A^{[l]}$
Same 4 equations work for both single and multiple examples (just change case)
Matrix dimensions: $(n^{[l]}, m)$ where $n^{[l]}$ = units in layer $l$, $m$ = examples
Horizontal (columns): Different training examples
Vertical (rows): Different nodes/features in the layer
Broadcasting handles adding bias vectors to matrices automatically
Vectorization eliminates for-loops and leverages parallel computation