Last modified: Jan 31 2026 at 10:09 PM • 8 mins read

Random Initialization

Introduction
Why Not Initialize to Zero?
The Solution: Random Initialization
Why Small Random Values?
Complete Initialization Example
Comparison: Zero vs Random Initialization
Advanced: Better Constants Than 0.01
Visualization: Symmetry Breaking
Implementation Checklist
What’s Next
Key Takeaways

Introduction

When training a neural network, weight initialization is critical for successful learning. Unlike logistic regression (where initializing weights to zero works fine), neural networks require random initialization of weights. This lesson explains why zero initialization fails and how to initialize properly.

Key Point: Biases can be initialized to zero, but weights must be random!

Why Not Initialize to Zero?

The Problem: Symmetry

Let’s examine what happens with zero initialization using a simple example:

Network architecture:

Input features: $n^{[0]} = 2$
Hidden units: $n^{[1]} = 2$
Output units: $n^{[2]} = 1$

Zero initialization:

\[W^{[1]} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\] \[W^{[2]} = \begin{bmatrix} 0 & 0 \end{bmatrix}, \quad b^{[2]} = 0\]

What Goes Wrong?

Forward propagation:

\[Z^{[1]} = W^{[1]} X + b^{[1]} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\] \[A^{[1]} = g(Z^{[1]}) = \begin{bmatrix} g(0) \\ g(0) \end{bmatrix}\]

Result: Both hidden units compute identical activations!

\[a_1^{[1]} = a_2^{[1]}\]

This means:

Hidden unit 1 and hidden unit 2 compute the same function
They have the same influence on the output
They are completely symmetric

The Symmetry Persists During Training

Backpropagation:

Because the hidden units are symmetric, their gradients are also identical:

\[dZ_1^{[1]} = dZ_2^{[1]}\]

This means the gradient matrix $dW^{[1]}$ has identical rows:

\[dW^{[1]} = \begin{bmatrix} \text{same values} \\ \text{same values} \end{bmatrix}\]

Weight update:

\[W^{[1]} := W^{[1]} - \alpha \, dW^{[1]}\]

After the update, $W^{[1]}$ still has identical rows!

Proof by Induction

We can prove that symmetry persists forever:

Base case (iteration 0):

Both hidden units are identical: $w_1 = w_2 = 0$

Inductive step:

If hidden units are identical at iteration $t$, then:
- They compute the same function
- They produce the same gradients
- Weight updates keep them identical
- They remain identical at iteration $t+1$

Conclusion: No matter how long you train, hidden units remain symmetric!

Why This Is Bad

If all hidden units compute the same function, then having $n^{[1]}$ hidden units is no better than having just 1 hidden unit!

\[\text{Multiple identical units} = \text{Wasted computation}\]

The network cannot learn diverse features, which is the whole point of having multiple hidden units.

The Solution: Random Initialization

Breaking Symmetry

To make different hidden units learn different functions, initialize weights randomly:

# Initialize weights randomly (small values)
W1 = np.random.randn(n1, n0) * 0.01
b1 = np.zeros((n1, 1))  # Biases can be zero

W2 = np.random.randn(n2, n1) * 0.01
b2 = np.zeros((n2, 1))  # Biases can be zero

Why This Works

Random weights → Different initial values → Different computations → Symmetry broken!

\[W^{[1]} = \begin{bmatrix} 0.0053 & -0.0023 \\ 0.0097 & 0.0041 \end{bmatrix}\]

Now:

$w_1 \neq w_2$ → Different weights for each unit
$a_1^{[1]} \neq a_2^{[1]}$ → Different activations
$dW_1 \neq dW_2$ → Different gradients
Units evolve differently during training ✅

Why Biases Can Be Zero

Important distinction:

\[b^{[1]} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \quad \text{← This is okay!}\]

Why? As long as $W^{[1]}$ is random, the hidden units compute different functions:

\[z_1^{[1]} = w_1^T x + 0 \neq w_2^T x + 0 = z_2^{[1]}\]

The symmetry is already broken by different $w_1$ and $w_2$, so bias initialization doesn’t matter.

Why Small Random Values?

The Scaling Factor: 0.01

You might wonder: why multiply by 0.01? Why not 100 or 1000?

W1 = np.random.randn(n1, n0) * 0.01  # Why 0.01?

Reason 1: Avoiding Saturation

For sigmoid and tanh activation functions:

Forward propagation:

\[Z^{[1]} = W^{[1]} X + b^{[1]}\] \[A^{[1]} = g(Z^{[1]})\]

If $W$ is too large → $Z$ has very large/small values → Activations saturate!

Sigmoid Saturation

Sigmoid saturation diagram

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Problem regions:

When $z \gg 0$: $\sigma(z) \approx 1$, $\sigma’(z) \approx 0$
When $z \ll 0$: $\sigma(z) \approx 0$, $\sigma’(z) \approx 0$

Gradient in saturated region:

\[\sigma'(z) = \sigma(z)(1 - \sigma(z)) \approx 0\]

Result: Vanishing gradients → Very slow learning!

Tanh Saturation

\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]

Problem regions:

When $z \gg 0$: $\tanh(z) \approx 1$, $\tanh’(z) \approx 0$
When $z \ll 0$: $\tanh(z) \approx -1$, $\tanh’(z) \approx 0$

Gradient in saturated region:

\[\tanh'(z) = 1 - \tanh^2(z) \approx 0\]

Summary: Large Weights → Slow Learning

Chain of events:

\[\text{Large } W \rightarrow \text{Large } |Z| \rightarrow \text{Saturated activations} \rightarrow \text{Small gradients} \rightarrow \text{Slow learning}\]

Solution: Initialize with small values (like 0.01) to keep $Z$ in the responsive region.

Reason 2: Output Layer Concerns

For binary classification, the output uses sigmoid:

\[\hat{y} = \sigma(z^{[L]}) = \sigma(W^{[L]} a^{[L-1]} + b^{[L]})\]

If $W^{[L]}$ is large → $z^{[L]}$ is large → Output saturates at 0 or 1 immediately → No learning!

When Is This Less Critical?

For ReLU activation functions, saturation is less of an issue:

\[\text{ReLU}(z) = \max(0, z)\]

Why?

No saturation for $z > 0$ (gradient is always 1)
Only “dies” for $z < 0$ (gradient is 0)

But small initialization is still generally recommended!

Complete Initialization Example

import numpy as np

def initialize_parameters(n_x, n_h, n_y):
    """
    Initialize parameters with random weights and zero biases
    
    Args:
        n_x: size of input layer
        n_h: size of hidden layer
        n_y: size of output layer
    
    Returns:
        parameters: dictionary containing W1, b1, W2, b2
    """
    # Random initialization for weights (small values)
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))
    
    parameters = {
        "W1": W1,
        "b1": b1,
        "W2": W2,
        "b2": b2
    }
    
    return parameters

# Example usage
n_x = 3  # input features
n_h = 4  # hidden units
n_y = 1  # output units

params = initialize_parameters(n_x, n_h, n_y)

print("W1 shape:", params["W1"].shape)  # (4, 3)
print("b1 shape:", params["b1"].shape)  # (4, 1)
print("W2 shape:", params["W2"].shape)  # (1, 4)
print("b2 shape:", params["b2"].shape)  # (1, 1)

print("\nW1 sample values:")
print(params["W1"])  # Small random values around 0

Comparison: Zero vs Random Initialization

Aspect	Zero Initialization	Random Initialization
Symmetry	All hidden units identical	All hidden units different
Learning	No learning (units stay same)	Successful learning
Gradients	All rows identical	Different for each unit
Effective units	Only 1 (others redundant)	All $n^{[1]}$ units useful
Feature diversity	❌ No diverse features	✅ Learns diverse features
Use case	❌ Never use for NN	✅ Always use for NN

Advanced: Better Constants Than 0.01

For Shallow Networks

For networks with one hidden layer (shallow networks), 0.01 works well:

W = np.random.randn(n_out, n_in) * 0.01  # Good for shallow networks

For Deep Networks

For very deep networks (many layers), you might need different initialization strategies:

Xavier Initialization (for sigmoid/tanh):

\[W^{[l]} = \text{np.random.randn}(n^{[l]}, n^{[l-1]}) \times \sqrt{\frac{1}{n^{[l-1]}}}\]

He Initialization (for ReLU):

\[W^{[l]} = \text{np.random.randn}(n^{[l]}, n^{[l-1]}) \times \sqrt{\frac{2}{n^{[l-1]}}}\]

These scale the initialization based on layer size to maintain stable gradients.

Note: We’ll cover advanced initialization strategies in next week’s material on deep neural networks!

Visualization: Symmetry Breaking

Zero Initialization (Bad)

Before training:
Hidden Unit 1: [0, 0] → same function
Hidden Unit 2: [0, 0] → same function

After 1000 iterations:
Hidden Unit 1: [0.5, 0.3] → STILL same function
Hidden Unit 2: [0.5, 0.3] → STILL same function

Result: Wasted capacity! ❌