Last modified: Jan 31 2026 at 10:09 PM • 10 mins read

Building Blocks of Deep Neural Networks

Introduction
The Single Layer Building Block
Building the Complete Deep Network
One Training Iteration
Implementation Details: The Cache
What’s Next
Key Takeaways

Introduction

You’ve already learned the fundamentals of forward propagation and backpropagation from previous lessons. Now let’s see how to modularize these components to build a complete deep neural network.

The key insight: Think of each layer as a reusable building block with:

A forward function (compute activations)
A backward function (compute gradients)
A cache (store values for backpropagation)

Design principle: Building modular, composable functions makes your code clean, debuggable, and reusable!

The Single Layer Building Block

Focusing on One Layer

Let’s zoom in on a single layer $l$ in a deep network and understand what it needs to do.

... → [Layer l-1] → [Layer l] → [Layer l+1] → ...
          ↓            ↓            ↓
         a^[l-1]      a^[l]       a^[l+1]

Layer $l$ has:

Parameters: $W^{[l]}, b^{[l]}$
Input: $a^{[l-1]}$ (activations from previous layer)
Output: $a^{[l]}$ (activations for this layer)

Forward Propagation: Single Layer

The Forward Function

Function signature:

def forward_propagation_layer_l(A_prev, W, b, activation):
    """
    Forward propagation for a single layer
    
    Args:
        A_prev: activations from previous layer (a^[l-1])
        W: weight matrix for this layer (W^[l])
        b: bias vector for this layer (b^[l])
        activation: activation function ("relu" or "sigmoid")
    
    Returns:
        A: activations for this layer (a^[l])
        cache: stored values needed for backprop
    """

Forward Computation Steps

Step 1: Linear transformation

\[z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}\]

Step 2: Activation function

\[a^{[l]} = g^{[l]}(z^{[l]})\]

Step 3: Cache values for backprop

Store $z^{[l]}$ (and optionally $W^{[l]}, b^{[l]}, a^{[l-1]}$) for later use.

Why Cache $z^{[l]}$?

Critical: We need $z^{[l]}$ during backpropagation to compute gradients!

Reason: The gradient of the activation function depends on $z^{[l]}$:

\[\frac{\partial a^{[l]}}{\partial z^{[l]}} = g'^{[l]}(z^{[l]})\]

Complete Forward Function

def forward_propagation_layer_l(A_prev, W, b, activation):
    """
    Implement forward propagation for a single layer
    """
    # Step 1: Linear transformation
    Z = np.dot(W, A_prev) + b
    
    # Step 2: Activation function
    if activation == "relu":
        A = relu(Z)
    elif activation == "sigmoid":
        A = sigmoid(Z)
    
    # Step 3: Cache values for backprop
    cache = {
        'Z': Z,
        'A_prev': A_prev,
        'W': W,
        'b': b
    }
    
    return A, cache

Backward Propagation: Single Layer

The Backward Function

Function signature:

def backward_propagation_layer_l(dA, cache, activation):
    """
    Backward propagation for a single layer
    
    Args:
        dA: gradient of cost w.r.t. activations (da^[l])
        cache: values stored during forward prop (Z, A_prev, W, b)
        activation: activation function used ("relu" or "sigmoid")
    
    Returns:
        dA_prev: gradient w.r.t. previous activations (da^[l-1])
        dW: gradient w.r.t. weights (dW^[l])
        db: gradient w.r.t. bias (db^[l])
    """

Backward Computation Steps

Given: $\frac{\partial \mathcal{L}}{\partial a^{[l]}}$ (gradient flowing back from next layer)

Step 1: Gradient of activation

\[\frac{\partial \mathcal{L}}{\partial z^{[l]}} = \frac{\partial \mathcal{L}}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} = da^{[l]} \cdot g'^{[l]}(z^{[l]})\]

Step 2: Gradient w.r.t. weights

\[\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial z^{[l]}} \cdot (a^{[l-1]})^T\]

Step 3: Gradient w.r.t. bias

\[\frac{\partial \mathcal{L}}{\partial b^{[l]}} = \frac{\partial \mathcal{L}}{\partial z^{[l]}}\]

(sum across training examples for vectorized implementation)

Step 4: Gradient w.r.t. previous activations

\[\frac{\partial \mathcal{L}}{\partial a^{[l-1]}} = (W^{[l]})^T \cdot \frac{\partial \mathcal{L}}{\partial z^{[l]}}\]

This gradient flows back to the previous layer!

Complete Backward Function

def backward_propagation_layer_l(dA, cache, activation):
    """
    Implement backward propagation for a single layer
    """
    # Retrieve cached values
    Z = cache['Z']
    A_prev = cache['A_prev']
    W = cache['W']
    b = cache['b']
    m = A_prev.shape[1]  # number of examples
    
    # Step 1: Gradient of activation function
    if activation == "relu":
        dZ = dA * relu_derivative(Z)
    elif activation == "sigmoid":
        dZ = dA * sigmoid_derivative(Z)
    
    # Step 2: Gradient w.r.t. weights
    dW = (1/m) * np.dot(dZ, A_prev.T)
    
    # Step 3: Gradient w.r.t. bias
    db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
    
    # Step 4: Gradient w.r.t. previous activations
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

Summary: Single Layer Functions

Forward Function

Input	Processing	Output
$a^{[l-1]}$	$z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$	$a^{[l]}$
$W^{[l]}, b^{[l]}$	$a^{[l]} = g^{[l]}(z^{[l]})$	cache

Cache contains: $z^{[l]}, a^{[l-1]}, W^{[l]}, b^{[l]}$

Backward Function

Input	Processing	Output
$da^{[l]}$	$dz^{[l]} = da^{[l]} \cdot g’^{[l]}(z^{[l]})$	$da^{[l-1]}$
cache	$dW^{[l]} = \frac{1}{m} dz^{[l]} (a^{[l-1]})^T$	$dW^{[l]}$
	$db^{[l]} = \frac{1}{m} \sum dz^{[l]}$	$db^{[l]}$
	$da^{[l-1]} = (W^{[l]})^T dz^{[l]}$

Building the Complete Deep Network

Forward Propagation: Full Network

Now let’s chain these layer building blocks together!

Forward Pass Through All Layers

Starting point: $a^{[0]} = X$ (input features)

X (a^[0])
   ↓
[Layer 1: forward] → a^[1], cache₁
   ↓
[Layer 2: forward] → a^[2], cache₂
   ↓
[Layer 3: forward] → a^[3], cache₃
   ↓
   ...
   ↓
[Layer L: forward] → a^[L] = ŷ, cache_L

Result: Predictions $\hat{Y} = a^{[L]}$ and caches for all layers

Forward Propagation Algorithm

def L_layer_forward(X, parameters):
    """
    Forward propagation through all L layers
    
    Args:
        X: input data (n^[0], m)
        parameters: dict with W1, b1, W2, b2, ..., WL, bL
    
    Returns:
        AL: final predictions (n^[L], m)
        caches: list of caches for all layers
    """
    caches = []
    A = X
    L = len(parameters) // 2  # number of layers
    
    # Forward through layers 1 to L-1 (ReLU activation)
    for l in range(1, L):
        A_prev = A
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        
        A, cache = forward_propagation_layer_l(A_prev, W, b, "relu")
        caches.append(cache)
    
    # Forward through layer L (sigmoid activation for binary classification)
    WL = parameters[f'W{L}']
    bL = parameters[f'b{L}']
    AL, cache = forward_propagation_layer_l(A, WL, bL, "sigmoid")
    caches.append(cache)
    
    return AL, caches

What Gets Cached?

For each layer $l = 1, 2, \ldots, L$, we store:

\[\text{cache}_l = \{z^{[l]}, a^{[l-1]}, W^{[l]}, b^{[l]}\}\]

Why cache all these values?

Value	Why Cache It?
$z^{[l]}$	Needed to compute $dz^{[l]} = da^{[l]} \cdot g’^{[l]}(z^{[l]})$
$a^{[l-1]}$	Needed to compute $dW^{[l]} = dz^{[l]} (a^{[l-1]})^T$
$W^{[l]}$	Needed to compute $da^{[l-1]} = (W^{[l]})^T dz^{[l]}$
$b^{[l]}$	Convenient for implementation (optional)

Backward Propagation: Full Network

After forward propagation, we have predictions $\hat{Y} = A^{[L]}$. Now we compute gradients!

Backward Pass Through All Layers

Starting point: $\frac{\partial \mathcal{L}}{\partial A^{[L]}}$ (gradient of loss w.r.t. predictions)

dA^[L] (from loss function)
   ↓
[Layer L: backward] → dA^[L-1], dW^[L], db^[L]
   ↓
[Layer L-1: backward] → dA^[L-2], dW^[L-1], db^[L-1]
   ↓
[Layer L-2: backward] → dA^[L-3], dW^[L-2], db^[L-2]
   ↓
   ...
   ↓
[Layer 1: backward] → dA^[0], dW^[1], db^[1]
                        ↑
                   (not used for training)

Result: Gradients $dW^{[l]}, db^{[l]}$ for all layers

Note: We don’t actually need $dA^{[0]}$ (gradient w.r.t. input features) for supervised learning, so we can stop at layer 1.

Backward Propagation Algorithm

def L_layer_backward(AL, Y, caches):
    """
    Backward propagation through all L layers
    
    Args:
        AL: final predictions (n^[L], m)
        Y: true labels (n^[L], m)
        caches: list of caches from forward prop
    
    Returns:
        grads: dict with dW1, db1, dW2, db2, ..., dWL, dbL
    """
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)
    
    # Gradient of loss w.r.t. AL (for binary cross-entropy)
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    # Backward through layer L (sigmoid activation)
    current_cache = caches[L-1]
    dA_prev, dW, db = backward_propagation_layer_l(dAL, current_cache, "sigmoid")
    grads[f'dW{L}'] = dW
    grads[f'db{L}'] = db
    
    # Backward through layers L-1 to 1 (ReLU activation)
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        dA_prev, dW, db = backward_propagation_layer_l(dA_prev, current_cache, "relu")
        grads[f'dW{l+1}'] = dW
        grads[f'db{l+1}'] = db
    
    return grads

Complete Network Diagram

Forward and Backward Flow

Data flow:

Forward: Activations flow left-to-right, caches stored
Backward: Gradients flow right-to-left, using cached values

One Training Iteration

Complete Training Step

One iteration of gradient descent involves:

Step 1: Forward Propagation

# Compute predictions
AL, caches = L_layer_forward(X, parameters)

Step 2: Compute Loss

# Binary cross-entropy loss
cost = compute_cost(AL, Y)

Step 3: Backward Propagation

# Compute gradients
grads = L_layer_backward(AL, Y, caches)

Step 4: Update Parameters

# Gradient descent update
for l in range(1, L + 1):
    parameters[f'W{l}'] = parameters[f'W{l}'] - alpha * grads[f'dW{l}']
    parameters[f'b{l}'] = parameters[f'b{l}'] - alpha * grads[f'db{l}']

Training Loop

def train_deep_network(X, Y, layer_dims, learning_rate=0.01, num_iterations=3000):
    """
    Train a deep neural network
    
    Args:
        X: training data (n^[0], m)
        Y: labels (1, m)
        layer_dims: list [n^[0], n^[1], ..., n^[L]]
        learning_rate: alpha for gradient descent
        num_iterations: number of training iterations
    
    Returns:
        parameters: trained weights and biases
    """
    # Initialize parameters
    parameters = initialize_parameters(layer_dims)
    costs = []
    
    # Training loop
    for i in range(num_iterations):
        # Forward propagation
        AL, caches = L_layer_forward(X, parameters)
        
        # Compute cost
        cost = compute_cost(AL, Y)
        
        # Backward propagation
        grads = L_layer_backward(AL, Y, caches)
        
        # Update parameters
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print progress
        if i % 100 == 0:
            costs.append(cost)
            print(f"Cost after iteration {i}: {cost}")
    
    return parameters, costs

Implementation Details: The Cache

What Actually Goes in the Cache?

Conceptually: We said the cache stores $z^{[l]}$ for backpropagation.

In practice: It’s convenient to store more!

Minimal Cache (Conceptual)

cache = {
    'Z': Z  # Just z^[l]
}

This is enough mathematically, but you’d need to pass parameters separately.

Practical Cache (Implementation)

cache = {
    'Z': Z,          # z^[l] - needed for gradient of activation
    'A_prev': A_prev, # a^[l-1] - needed for dW calculation
    'W': W,          # W^[l] - needed for dA_prev calculation
    'b': b           # b^[l] - convenient for consistency
}

Why include $W^{[l]}$ and $b^{[l]}$?

Convenience: Don’t need to pass parameters separately to backward function
Cleaner API: Backward function is self-contained
Minimal overhead: Storing references to existing arrays

Implementation tip: In your programming exercise, you’ll see this extended cache structure—it’s just a practical design choice!

Cache Flow Example

# Forward propagation
def forward_layer(A_prev, W, b):
    Z = np.dot(W, A_prev) + b
    A = relu(Z)
    
    # Store everything we might need later
    cache = (Z, A_prev, W, b)
    return A, cache

# Backward propagation
def backward_layer(dA, cache):
    # Unpack cache
    Z, A_prev, W, b = cache
    
    # Now we have everything we need!
    dZ = dA * relu_derivative(Z)
    dW = np.dot(dZ, A_prev.T)
    db = np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

Module Summary

Each layer is a self-contained module:

Component	Input	Output	Uses
Forward	$a^{[l-1]}$	$a^{[l]}$, cache	$W^{[l]}, b^{[l]}$
Backward	$da^{[l]}$, cache	$da^{[l-1]}, dW^{[l]}, db^{[l]}$	cache
Cache	-	$z^{[l]}, a^{[l-1]}, W^{[l]}, b^{[l]}$	-

What’s Next

You now understand the modular architecture of deep neural networks! Each layer is a building block with:

Forward function
Backward function
Cache for communication

In the next lesson, we’ll see the actual implementation of these building blocks with real code examples.

Preview: You’ll implement a complete deep neural network from scratch, putting all these pieces together!

Key Takeaways

Modular design: Each layer is a reusable forward-backward block
Forward function: Takes $a^{[l-1]}$, outputs $a^{[l]}$ and cache
Backward function: Takes $da^{[l]}$ and cache, outputs $da^{[l-1]}, dW^{[l]}, db^{[l]}$
Cache is crucial: Stores $z^{[l]}$ (and often $a^{[l-1]}, W^{[l]}, b^{[l]}$) for backprop
Why cache $z^{[l]}$: Needed to compute activation function gradient $g’^{[l]}(z^{[l]})$
Forward propagation: Chain layers left-to-right, accumulate caches
Backward propagation: Chain layers right-to-left, use cached values
Complete iteration: Forward → Loss → Backward → Update
Parameter updates: $W^{[l]} \leftarrow W^{[l]} - \alpha \, dW^{[l]}$, $b^{[l]} \leftarrow b^{[l]} - \alpha \, db^{[l]}$
No $da^{[0]}$ needed: Gradient w.r.t. input features not used in training
Cache contents: Minimal = $z^{[l]}$; Practical = $z^{[l]}, a^{[l-1]}, W^{[l]}, b^{[l]}$
Self-contained modules: Forward and backward functions have clean interfaces
Composability: Stack any number of layers using the same building blocks
Gradient flow: Activations forward, gradients backward
Implementation convenience: Extended cache makes backward function self-contained