Last modified: Jan 31 2026 at 10:09 PM • 8 mins read

Forward and Backward Propagation

Introduction
Forward Propagation Implementation
Backward Propagation Implementation
Complete Example: 3-Layer Neural Network
Summary: Forward and Backward Functions
Practical Advice
Key Takeaways

Introduction

You’ve learned about the building blocks of deep neural networks - forward and backward propagation for each layer. Now let’s see the complete implementation details with all the equations you need.

This lesson provides:

Complete forward propagation equations (single example + vectorized)
Complete backward propagation equations (single example + vectorized)
How to initialize the forward and backward passes
A concrete 3-layer network example

Forward Propagation Implementation

Single Layer Forward Function

Recall from the previous lesson, the forward function:

Input: $a^{[l-1]}$ (activations from previous layer)

Output: $a^{[l]}$ (activations for this layer), cache

Cache: $z^{[l]}, W^{[l]}, b^{[l]}, a^{[l-1]}$ (needed for backprop)

Forward Propagation Equations

Single Example

For a single training example:

\[z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}\] \[a^{[l]} = g^{[l]}(z^{[l]})\]

Where $g^{[l]}$ is the activation function for layer $l$.

Vectorized Implementation

For all $m$ training examples:

\[Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}\] \[A^{[l]} = g^{[l]}(Z^{[l]})\]

Note: $b^{[l]}$ uses Python broadcasting to add to each column of $W^{[l]} A^{[l-1]}$.

Forward Propagation Algorithm

def forward_propagation_layer_l(A_prev, W, b, activation):
    """
    Implement forward propagation for layer l
    
    Args:
        A_prev: activations from previous layer, shape (n[l-1], m)
        W: weight matrix, shape (n[l], n[l-1])
        b: bias vector, shape (n[l], 1)
        activation: "relu" or "sigmoid"
    
    Returns:
        A: activations for this layer, shape (n[l], m)
        cache: tuple containing (Z, A_prev, W, b)
    """
    # Linear step
    Z = np.dot(W, A_prev) + b
    
    # Activation step
    if activation == "relu":
        A = relu(Z)
    elif activation == "sigmoid":
        A = sigmoid(Z)
    
    # Cache for backpropagation
    cache = (Z, A_prev, W, b)
    
    return A, cache

Initializing Forward Propagation

Starting point: $a^{[0]} = X$

For a single example: $a^{[0]} = x$ (input features)
For vectorized: $A^{[0]} = X$ (entire training set)

Chain of computations:

X = A^[0] → [Layer 1] → A^[1] → [Layer 2] → A^[2] → ... → A^[L] = Ŷ

This left-to-right chain computes predictions by repeatedly applying the forward function.

Backward Propagation Implementation

Single Layer Backward Function

Input: $dA^{[l]}$ (gradient of cost w.r.t. activations)

Output: $dA^{[l-1]}$ (gradient to pass back), $dW^{[l]}$, $db^{[l]}$ (gradients for this layer’s parameters)

Backward Propagation Equations

The Four Key Equations (Single Example)

These four equations implement backpropagation for layer $l$:

Equation 1: Gradient through activation function

\[dZ^{[l]} = dA^{[l]} \odot g'^{[l]}(Z^{[l]})\]

Equation 2: Gradient w.r.t. weights

\[dW^{[l]} = dZ^{[l]} \cdot (a^{[l-1]})^T\]

Equation 3: Gradient w.r.t. bias

\[db^{[l]} = dZ^{[l]}\]

Equation 4: Gradient w.r.t. previous activations

\[dA^{[l-1]} = (W^{[l]})^T \cdot dZ^{[l]}\]

Note: $\odot$ denotes element-wise multiplication

Connection to Previous Formula

If you substitute Equation 4 into Equation 1 for layer $l-1$, you get:

\[dZ^{[l]} = (W^{[l+1]})^T dZ^{[l+1]} \odot g'^{[l]}(Z^{[l]})\]

This matches the backpropagation equations from Week 3!

Vectorized Implementation

For all $m$ training examples:

Equation 1: Activation gradient

\[dZ^{[l]} = dA^{[l]} \odot g'^{[l]}(Z^{[l]})\]

Equation 2: Weight gradient

\[dW^{[l]} = \frac{1}{m} dZ^{[l]} (A^{[l-1]})^T\]

Equation 3: Bias gradient

\[db^{[l]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[l]}\]

(sum over training examples, keeping dimensions)

Equation 4: Previous activation gradient

\[dA^{[l-1]} = (W^{[l]})^T dZ^{[l]}\]

Backward Propagation Algorithm

def backward_propagation_layer_l(dA, cache, activation):
    """
    Implement backward propagation for layer l
    
    Args:
        dA: gradient of cost w.r.t. activations, shape (n[l], m)
        cache: tuple (Z, A_prev, W, b) from forward prop
        activation: "relu" or "sigmoid"
    
    Returns:
        dA_prev: gradient w.r.t. previous activations, shape (n[l-1], m)
        dW: gradient w.r.t. weights, shape (n[l], n[l-1])
        db: gradient w.r.t. bias, shape (n[l], 1)
    """
    # Unpack cache
    Z, A_prev, W, b = cache
    m = A_prev.shape[1]
    
    # Equation 1: Activation gradient
    if activation == "relu":
        dZ = dA * relu_derivative(Z)
    elif activation == "sigmoid":
        dZ = dA * sigmoid_derivative(Z)
    
    # Equation 2: Weight gradient
    dW = (1/m) * np.dot(dZ, A_prev.T)
    
    # Equation 3: Bias gradient
    db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
    
    # Equation 4: Previous activation gradient
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

NumPy implementation detail:

db = (1/m) * np.sum(dZ, axis=1, keepdims=True)

axis=1: Sum across training examples (columns)
keepdims=True: Keep result as $(n^{[l]}, 1)$ instead of $(n^{[l]},)$

Initializing Backward Propagation

Starting point: $dA^{[L]}$ (gradient of loss w.r.t. final predictions)

For binary classification with logistic loss:

\[\frac{\partial \mathcal{L}}{\partial a^{[L]}} = -\frac{y}{a^{[L]}} + \frac{1-y}{1-a^{[L]}}\]

Single Example

\[dA^{[L]} = -\frac{y}{a^{[L]}} + \frac{1-y}{1-a^{[L]}}\]

Vectorized (All Examples)

\[dA^{[L]} = \begin{bmatrix} -\frac{y^{(1)}}{a^{[L](1)}} + \frac{1-y^{(1)}}{1-a^{[L](1)}} & \cdots & -\frac{y^{(m)}}{a^{[L](m)}} + \frac{1-y^{(m)}}{1-a^{[L](m)}} \end{bmatrix}\]

Or more concisely in Python:

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

Chain of computations:

dA^[L] ← [Layer L] ← dA^[L-1] ← [Layer L-1] ← ... ← dA^[1] ← [Layer 1] ← dA^[0]
   ↓                    ↓                              ↓                  ↓
dW^[L], db^[L]      dW^[L-1], db^[L-1]           dW^[1], db^[1]    (not used)

This right-to-left chain computes all gradients by repeatedly applying the backward function.

Complete Example: 3-Layer Neural Network

Let’s see how forward and backward propagation work together in a complete network.

Network Architecture

Input X (features)
   ↓
Layer 1 (ReLU)
   ↓
Layer 2 (ReLU)
   ↓
Layer 3 (Sigmoid) → binary classification
   ↓
Output Ŷ (predictions)

Forward Propagation (Left to Right)

Layer 1:

\[Z^{[1]} = W^{[1]} X + b^{[1]}\] \[A^{[1]} = \text{ReLU}(Z^{[1]})\]

Cache: $(Z^{[1]}, X, W^{[1]}, b^{[1]})$

Layer 2:

\[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\] \[A^{[2]} = \text{ReLU}(Z^{[2]})\]

Cache: $(Z^{[2]}, A^{[1]}, W^{[2]}, b^{[2]})$

Layer 3:

\[Z^{[3]} = W^{[3]} A^{[2]} + b^{[3]}\] \[A^{[3]} = \sigma(Z^{[3]}) = \hat{Y}\]

Cache: $(Z^{[3]}, A^{[2]}, W^{[3]}, b^{[3]})$

Compute Loss:

\[\mathcal{L} = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})]\]

Backward Propagation (Right to Left)

Initialize:

\[dA^{[3]} = -\frac{Y}{A^{[3]}} + \frac{1-Y}{1-A^{[3]}}\]

Layer 3 Backward:

dA3, cache3 = # from forward prop
dA2, dW3, db3 = backward_propagation_layer_l(dA3, cache3, "sigmoid")

Computes: $dW^{[3]}, db^{[3]}, dA^{[2]}$

Layer 2 Backward:

dA1, dW2, db2 = backward_propagation_layer_l(dA2, cache2, "relu")

Computes: $dW^{[2]}, db^{[2]}, dA^{[1]}$

Layer 1 Backward:

dA0, dW1, db1 = backward_propagation_layer_l(dA1, cache1, "relu")

Computes: $dW^{[1]}, db^{[1]}, dA^{[0]}$

Note: We don’t use $dA^{[0]}$ (gradient w.r.t. input features) for training, so we can discard it.

Complete Training Iteration

# Forward propagation
A1, cache1 = forward_propagation_layer_l(X, W1, b1, "relu")
A2, cache2 = forward_propagation_layer_l(A1, W2, b2, "relu")
A3, cache3 = forward_propagation_layer_l(A2, W3, b3, "sigmoid")

# Compute loss
cost = compute_cost(A3, Y)

# Initialize backward propagation
dA3 = - (np.divide(Y, A3) - np.divide(1 - Y, 1 - A3))

# Backward propagation
dA2, dW3, db3 = backward_propagation_layer_l(dA3, cache3, "sigmoid")
dA1, dW2, db2 = backward_propagation_layer_l(dA2, cache2, "relu")
dA0, dW1, db1 = backward_propagation_layer_l(dA1, cache1, "relu")

# Update parameters
W3 = W3 - alpha * dW3
b3 = b3 - alpha * db3
W2 = W2 - alpha * dW2
b2 = b2 - alpha * db2
W1 = W1 - alpha * dW1
b1 = b1 - alpha * db1

Visual Flow Diagram

FORWARD PROPAGATION:
X ──→ [Layer 1: ReLU] ──→ [Layer 2: ReLU] ──→ [Layer 3: Sigmoid] ──→ Ŷ ──→ Loss
      ↓ cache Z^[1]       ↓ cache Z^[2]       ↓ cache Z^[3]

BACKWARD PROPAGATION:
      dW^[1], db^[1] ←── dW^[2], db^[2] ←── dW^[3], db^[3] ←── dA^[3]
      ↑ use cache         ↑ use cache         ↑ use cache

Summary: Forward and Backward Functions

Forward Function Summary

Component	Equation	Purpose
Linear	$Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$	Compute pre-activation
Activation	$A^{[l]} = g^{[l]}(Z^{[l]})$	Apply activation function
Cache	Store $(Z^{[l]}, A^{[l-1]}, W^{[l]}, b^{[l]})$	Save for backprop

Backward Function Summary

Component	Equation	Purpose
Activation Gradient	$dZ^{[l]} = dA^{[l]} \odot g’^{[l]}(Z^{[l]})$	Gradient through activation
Weight Gradient	$dW^{[l]} = \frac{1}{m} dZ^{[l]} (A^{[l-1]})^T$	Parameter update for $W^{[l]}$
Bias Gradient	$db^{[l]} = \frac{1}{m} \sum dZ^{[l]}$	Parameter update for $b^{[l]}$
Previous Gradient	$dA^{[l-1]} = (W^{[l]})^T dZ^{[l]}$	Pass gradient back

Practical Advice

Don’t Worry If It Seems Complex!

Important: If these equations feel abstract or confusing, that’s completely normal!

Why it will become clearer:

Programming exercise: Implementing these equations yourself makes them concrete
Working code: Seeing the equations actually work is enlightening
Practice: The more you implement, the more intuitive it becomes

The “Magic” of Deep Learning

Even experienced practitioners are sometimes surprised when deep learning works! Here’s why:

Complexity comes from data, not code:

Deep learning code is often just 100-500 lines
Not 10,000 or 100,000 lines of complex logic
The data does most of the heavy lifting

The equations are calculus:

Forward prop: Chain function compositions
Backward prop: Chain rule for derivatives
The derivation is one of the harder ones in machine learning

It’s okay to not derive everything:

Focus on implementing correctly
Understand conceptually what each step does
Trust the math (it’s been thoroughly verified)

Key Takeaways

Forward propagation: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$, then $A^{[l]} = g^{[l]}(Z^{[l]})$
Vectorized forward prop: Same equations work for all $m$ examples simultaneously
Initialize forward prop: Start with $A^{[0]} = X$ (input data)
Four backward equations: $dZ^{[l]}, dW^{[l]}, db^{[l]}, dA^{[l-1]}$ (in that order)
Vectorized backward prop: Add $\frac{1}{m}$ factor for $dW^{[l]}$ and $db^{[l]}$
Initialize backward prop: Start with $dA^{[L]} = -\frac{Y}{A^{[L]}} + \frac{1-Y}{1-A^{[L]}}$
Cache is essential: Store $Z^{[l]}, A^{[l-1]}, W^{[l]}, b^{[l]}$ during forward prop
Element-wise multiplication: $\odot$ in $dZ^{[l]} = dA^{[l]} \odot g’^{[l]}(Z^{[l]})$
Gradient flow: Forward goes left-to-right, backward goes right-to-left
Don’t need $dA^{[0]}$: Gradient w.r.t. input features is not used for training
Complete iteration: Forward → Loss → Backward → Update
NumPy details: Use keepdims=True for np.sum to maintain dimensions
Activation functions: ReLU for hidden layers, sigmoid for binary classification output
The math works: These are standard calculus equations (chain rule + matrix derivatives)
Practice makes perfect: Implementing in code makes abstract equations concrete!

Forward and Backward Propagation

Table of contents

Introduction

Forward Propagation Implementation

Single Layer Forward Function

Forward Propagation Equations

Single Example

Vectorized Implementation

Forward Propagation Algorithm

Initializing Forward Propagation

Backward Propagation Implementation

Single Layer Backward Function

Backward Propagation Equations

The Four Key Equations (Single Example)

Connection to Previous Formula

Vectorized Implementation

Backward Propagation Algorithm

Initializing Backward Propagation

Single Example

Vectorized (All Examples)

Complete Example: 3-Layer Neural Network

Network Architecture

Forward Propagation (Left to Right)

Backward Propagation (Right to Left)

Complete Training Iteration

Visual Flow Diagram

Summary: Forward and Backward Functions

Forward Function Summary

Backward Function Summary

Practical Advice

Don’t Worry If It Seems Complex!

The “Magic” of Deep Learning

Key Takeaways