Last modified: Jan 31 2026 at 10:09 PM • 10 mins read

Introduction

In the previous lesson, we defined deep L-layer neural networks and established notation for describing them. Now we’ll see how to implement forward propagation in deep networks.

We’ll cover forward propagation in two steps:

Single training example: Compute activations for one example $x$
Vectorized version: Process entire training set simultaneously

Key Insight: Forward propagation in deep networks is just repeating the same computation layer by layer!

Forward Propagation for a Single Example

Layer-by-Layer Computation

Given a single training example $x$, we compute activations layer by layer:

Layer 1 (First hidden layer):

\[z^{[1]} = W^{[1]} x + b^{[1]}\] \[a^{[1]} = g^{[1]}(z^{[1]})\]

where:

$W^{[1]}, b^{[1]}$ are the parameters for layer 1
$g^{[1]}$ is the activation function for layer 1

Layer 2 (Second hidden layer):

\[z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}\] \[a^{[2]} = g^{[2]}(z^{[2]})\]

Notice: Layer 2 uses the activations from layer 1 as input!

Layer 3 (Third hidden layer):

\[z^{[3]} = W^{[3]} a^{[2]} + b^{[3]}\] \[a^{[3]} = g^{[3]}(z^{[3]})\]

Layer 4 (Output layer):

\[z^{[4]} = W^{[4]} a^{[3]} + b^{[4]}\] \[a^{[4]} = g^{[4]}(z^{[4]}) = \hat{y}\]

The final layer activation $a^{[4]}$ is our prediction $\hat{y}$!

Key Observation: Replacing $x$ with $a^{[0]}$

Notice that $x$ is the input to the first layer. We can write:

\[a^{[0]} = x\]

Now all equations look uniform:

\[\boxed{\begin{align} z^{[1]} &= W^{[1]} a^{[0]} + b^{[1]} \\ z^{[2]} &= W^{[2]} a^{[1]} + b^{[2]} \\ z^{[3]} &= W^{[3]} a^{[2]} + b^{[3]} \\ z^{[4]} &= W^{[4]} a^{[3]} + b^{[4]} \end{align}}\]

General Forward Propagation Equations

For any layer $l$ (where $l = 1, 2, \ldots, L$):

\[\boxed{\begin{align} z^{[l]} &= W^{[l]} a^{[l-1]} + b^{[l]} \\ a^{[l]} &= g^{[l]}(z^{[l]}) \end{align}}\]

Interpretation:

Take activations from previous layer: $a^{[l-1]}$
Apply linear transformation: $W^{[l]} a^{[l-1]} + b^{[l]}$
Apply activation function: $g^{[l]}(z^{[l]})$
Get activations for current layer: $a^{[l]}$

Complete Algorithm: Single Example

def forward_propagation_single_example(x, parameters, L):
    """
    Forward propagation for a single training example

    Args:
        x: input features (n^[0], 1) vector
        parameters: dict with W1, b1, W2, b2, ..., WL, bL
        L: number of layers

    Returns:
        y_hat: prediction (scalar or vector)
        cache: dict with all z and a values for backprop
    """
    cache = {}

    # Initialize: input is activation of layer 0
    a = x
    cache['a0'] = x

    # Loop through layers
    for l in range(1, L + 1):
        # Get parameters for this layer
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']

        # Linear step
        z = np.dot(W, a) + b

        # Activation step
        if l == L:
            # Output layer: use sigmoid for binary classification
            a = sigmoid(z)
        else:
            # Hidden layers: use ReLU
            a = relu(z)

        # Store for backpropagation
        cache[f'z{l}'] = z
        cache[f'a{l}'] = a

    y_hat = a
    return y_hat, cache

Vectorized Forward Propagation

Processing All Training Examples

Now let’s extend to handle $m$ training examples simultaneously using vectorization.

Notation reminder:

Lowercase: single example ($z^{[l]}, a^{[l]}$)
Uppercase: all examples ($Z^{[l]}, A^{[l]}$)

Matrix Form

For the entire training set:

Layer 1:

\[Z^{[1]} = W^{[1]} X + b^{[1]}\] \[A^{[1]} = g^{[1]}(Z^{[1]})\]

where:

$X = A^{[0]}$ has shape $(n^{[0]}, m)$ - all training examples stacked as columns
$Z^{[1]}$ has shape $(n^{[1]}, m)$
$A^{[1]}$ has shape $(n^{[1]}, m)$

Layer 2:

\[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\] \[A^{[2]} = g^{[2]}(Z^{[2]})\]

Layer 3:

\[Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]}\] \[A^{[3]} = g^{[3]}(Z^{[3]})\]

Layer 4 (Output):

\[Z^{[4]} = W^{[4]} A^{[3]} + b^{[4]}\] \[A^{[4]} = g^{[4]}(Z^{[4]}) = \hat{Y}\]

where $\hat{Y}$ is the $(n^{[4]}, m)$ matrix of predictions for all examples.

Understanding the Matrices

Matrix $Z^{[l]}$ stacks the $z$ vectors for all examples:

\[Z^{[l]} = \begin{bmatrix} | & | & & | \\ z^{[l](1)} & z^{[l](2)} & \cdots & z^{[l](m)} \\ | & | & & | \end{bmatrix}\]

Each column is the $z$ vector for one training example.

Matrix $A^{[l]}$ stacks the activation vectors:

\[A^{[l]} = \begin{bmatrix} | & | & & | \\ a^{[l](1)} & a^{[l](2)} & \cdots & a^{[l](m)} \\ | & | & & | \end{bmatrix}\]

General Vectorized Equations

For any layer $l$ (where $l = 1, 2, \ldots, L$):

\[\boxed{\begin{align} Z^{[l]} &= W^{[l]} A^{[l-1]} + b^{[l]} \\ A^{[l]} &= g^{[l]}(Z^{[l]}) \end{align}}\]

where $A^{[0]} = X$ (the input data matrix).

Complete Algorithm: Vectorized

def forward_propagation_vectorized(X, parameters, L):
    """
    Vectorized forward propagation for all training examples

    Args:
        X: input data (n^[0], m) - m training examples
        parameters: dict with W1, b1, W2, b2, ..., WL, bL
        L: number of layers

    Returns:
        Y_hat: predictions (n^[L], m)
        caches: list of (Z, A_prev, W, b) for each layer
    """
    caches = []

    # Initialize: input is activation of layer 0
    A = X

    # Loop through layers
    for l in range(1, L + 1):
        A_prev = A

        # Get parameters for this layer
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']

        # Linear step: Z = WA + b
        Z = np.dot(W, A_prev) + b

        # Activation step
        if l == L:
            # Output layer: sigmoid for binary classification
            A = sigmoid(Z)
        else:
            # Hidden layers: ReLU
            A = relu(Z)

        # Cache for backpropagation
        cache = (Z, A_prev, W, b)
        caches.append(cache)

    Y_hat = A
    return Y_hat, caches

The For Loop is Necessary!

Iterating Through Layers

Looking at the vectorized implementation, you might notice:

for l in range(1, L + 1):
    # Compute layer l activations

There’s a for loop! Don’t we want to avoid for loops in neural networks?

When For Loops Are Acceptable

Answer: This is one case where a for loop is perfectly fine!

Why?

The loop iterates over layers (1 through $L$), not training examples
Number of layers is typically small (2-100)
There’s no way to avoid this - layers must be computed sequentially
Layer $l$ depends on layer $l-1$, so we can’t parallelize across layers

Vectorization achieved where it matters:

✅ Across training examples (the $m$ dimension)
✅ Within each layer (matrix operations)
❌ Across layers (inherently sequential)

What We Vectorize

Vectorized (good for performance):

Z = np.dot(W, A_prev) + b  # Processes all m examples at once

Sequential (unavoidable):

for l in range(1, L + 1):  # Must compute layer 1, then 2, then 3...

Summary

Forward propagation = For loop over layers (sequential) + Matrix operations within each layer (vectorized)

Comparison: Single vs Vectorized

Aspect	Single Example	Vectorized (All Examples)
Input	$x$ (vector)	$X$ (matrix)
Notation	Lowercase: $z^{[l]}, a^{[l]}$	Uppercase: $Z^{[l]}, A^{[l]}$
Dimensions	$(n^{[l]}, 1)$	$(n^{[l]}, m)$
Computation	One example at a time	All $m$ examples at once
Speed	Slow (m iterations needed)	Fast (vectorized)
Use case	Understanding, debugging	Training, prediction

Dimension Check Example

Let’s verify dimensions for a 4-layer network with $m = 100$ examples:

Architecture:

$n^{[0]} = 3$ (input)
$n^{[1]} = 5$ (hidden)
$n^{[2]} = 4$ (hidden)
$n^{[3]} = 3$ (hidden)
$n^{[4]} = 1$ (output)

Layer 1

\[Z^{[1]} = W^{[1]} A^{[0]} + b^{[1]}\]

Dimensions:

$W^{[1]}$: $(5, 3)$
$A^{[0]} = X$: $(3, 100)$
$b^{[1]}$: $(5, 1)$ (broadcasts to $(5, 100)$)
$Z^{[1]}$: $(5, 100)$ ✓

\[A^{[1]} = g^{[1]}(Z^{[1]})\]

$A^{[1]}$: $(5, 100)$ ✓

Layer 2

\[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\]

Dimensions:

$W^{[2]}$: $(4, 5)$
$A^{[1]}$: $(5, 100)$
$b^{[2]}$: $(4, 1)$ (broadcasts to $(4, 100)$)
$Z^{[2]}$: $(4, 100)$ ✓

Complete Dimension Summary

Layer	$W^{[l]}$	$b^{[l]}$	$A^{[l-1]}$	$Z^{[l]}$	$A^{[l]}$
1	$(5, 3)$	$(5, 1)$	$(3, 100)$	$(5, 100)$	$(5, 100)$
2	$(4, 5)$	$(4, 1)$	$(5, 100)$	$(4, 100)$	$(4, 100)$
3	$(3, 4)$	$(3, 1)$	$(4, 100)$	$(3, 100)$	$(3, 100)$
4	$(1, 3)$	$(1, 1)$	$(3, 100)$	$(1, 100)$	$(1, 100)$

Pattern:

$W^{[l]}$ is $(n^{[l]}, n^{[l-1]})$
$A^{[l]}$ is $(n^{[l]}, m)$

Relationship to Shallow Networks

It’s Just Repetition!

If forward propagation in deep networks looks familiar, that’s because it is!

What you learned in Week 3 (2-layer network):

\[\begin{align} Z^{[1]} &= W^{[1]} X + b^{[1]} \\ A^{[1]} &= g^{[1]}(Z^{[1]}) \\ Z^{[2]} &= W^{[2]} A^{[1]} + b^{[2]} \\ A^{[2]} &= g^{[2]}(Z^{[2]}) \end{align}\]

Deep networks (L-layer network):

Just repeat the same pattern $L$ times!

Modularity

Think of each layer as a reusable module:

Input → [Layer Module] → [Layer Module] → [Layer Module] → Output
        (Layer 1)         (Layer 2)         (Layer 3)

Each module does:

Linear transformation: $Z = WA + b$
Activation: $A = g(Z)$
Pass output to next module

Implementation Best Practices

1. Always Check Dimensions

When debugging, verify matrix dimensions at each step:

print(f"W{l} shape:", W.shape)
print(f"A{l-1} shape:", A_prev.shape)
print(f"Z{l} shape:", Z.shape)
print(f"A{l} shape:", A.shape)

Common bugs:

❌ Wrong dimension order: $(m, n)$ instead of $(n, m)$
❌ Missing transpose
❌ Shape mismatch in matrix multiplication

2. Cache Values for Backpropagation

Store all intermediate values:

cache = {
    'Z1': Z1, 'A1': A1,
    'Z2': Z2, 'A2': A2,
    # ... for all layers
}

You’ll need these for backpropagation!

3. Use Consistent Notation

Match your code to the mathematical notation:

# Good: Clear and matches notation
Z1 = np.dot(W1, A0) + b1
A1 = relu(Z1)

# Bad: Hard to follow
temp = weights[0] @ inputs + biases[0]
output = activation(temp)

Complete Example

Let’s trace through forward propagation for a 4-layer network:

import numpy as np

# Network architecture
layer_dims = [3, 5, 4, 3, 1]  # n^[0] through n^[4]
m = 100  # training examples
L = 4    # layers

# Initialize random parameters
parameters = initialize_parameters(layer_dims)

# Create random input data
X = np.random.randn(3, 100)

# Forward propagation
A = X  # A^[0]

print(f"Input X (A^[0]) shape: {A.shape}")  # (3, 100)

for l in range(1, L + 1):
    W = parameters[f'W{l}']
    b = parameters[f'b{l}']

    Z = np.dot(W, A) + b

    if l < L:
        A = relu(Z)  # Hidden layers
    else:
        A = sigmoid(Z)  # Output layer

    print(f"Layer {l}:")
    print(f"  W^[{l}] shape: {W.shape}")
    print(f"  Z^[{l}] shape: {Z.shape}")
    print(f"  A^[{l}] shape: {A.shape}")

Y_hat = A
print(f"\nFinal predictions shape: {Y_hat.shape}")  # (1, 100)

Output:

Input X (A^[0]) shape: (3, 100)
Layer 1:
  W^[1] shape: (5, 3)
  Z^[1] shape: (5, 100)
  A^[1] shape: (5, 100)
Layer 2:
  W^[2] shape: (4, 5)
  Z^[2] shape: (4, 100)
  A^[2] shape: (4, 100)
Layer 3:
  W^[3] shape: (3, 4)
  Z^[3] shape: (3, 100)
  A^[3] shape: (3, 100)
Layer 4:
  W^[4] shape: (1, 3)
  Z^[4] shape: (1, 100)
  A^[4] shape: (1, 100)

Final predictions shape: (1, 100)

Key Takeaways

Forward propagation formula: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$, then $A^{[l]} = g^{[l]}(Z^{[l]})$
Single example: Use lowercase $z^{[l]}, a^{[l]}$ (vectors)
Vectorized: Use uppercase $Z^{[l]}, A^{[l]}$ (matrices with $m$ columns)
Input notation: $x = a^{[0]}$ or $X = A^{[0]}$
Output notation: $\hat{y} = a^{[L]}$ or $\hat{Y} = A^{[L]}$
For loop is OK: Iterating through layers is unavoidable and acceptable
Vectorize training examples: Process all $m$ examples simultaneously (fast!)
Sequential layers: Layers must be computed in order (inherently sequential)
Same pattern repeated: Deep networks just repeat the 2-layer logic many times
Dimension checking: Always verify matrix shapes to catch bugs
Cache intermediate values: Store $Z^{[l]}, A^{[l]}$ for backpropagation
Matrix dimensions: $W^{[l]}$ is $(n^{[l]}, n^{[l-1]})$, $A^{[l]}$ is $(n^{[l]}, m)$
Broadcasting: $b^{[l]}$ with shape $(n^{[l]}, 1)$ broadcasts to $(n^{[l]}, m)$
Modular thinking: Each layer is a reusable computation unit
Implementation matches math: Keep code notation consistent with equations

Forward Propagation in a Deep Network

Table of contents