Last modified: Jan 31 2026 at 10:09 PM • 6 mins read

Why Do You Need Non-Linear Activation Functions?

Introduction
The Experiment: What if We Use Linear Activation?
The Problem: Collapsing to Linear Regression
What About Deep Networks?
Mixed Activation Functions
Why Non-Linearity is Essential
The One Exception: Regression Output Layer
Summary of Rules
Complete Example: Housing Price Prediction
Key Takeaways

Introduction

We’ve learned about various activation functions like ReLU, tanh, and sigmoid. But why do we need them at all? Why not just use linear functions (or no activation function)? Let’s explore why non-linear activation functions are essential for neural networks to work.

The Experiment: What if We Use Linear Activation?

Linear (Identity) Activation Function

Consider using a linear activation function:

\[g(z) = z\]

This is also called the identity activation function because it simply outputs whatever is input.

Let’s apply this to our 2-layer neural network:

\[z^{[1]} = W^{[1]} x + b^{[1]}\] \[a^{[1]} = g(z^{[1]}) = z^{[1]} \quad \text{(linear activation)}\] \[z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}\] \[a^{[2]} = g(z^{[2]}) = z^{[2]} \quad \text{(linear activation)}\] \[\hat{y} = a^{[2]}\]

Seems reasonable, right? Let’s see what actually happens…

The Problem: Collapsing to Linear Regression

Mathematical Proof

Let’s substitute $a^{[1]} = z^{[1]}$ into the equation for $z^{[2]}$:

Step 1: Hidden layer output

\[a^{[1]} = z^{[1]} = W^{[1]} x + b^{[1]}\]

Step 2: Output layer computation

\[a^{[2]} = z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}\]

Step 3: Substitute $a^{[1]}$ from Step 1

\[a^{[2]} = W^{[2]} (W^{[1]} x + b^{[1]}) + b^{[2]}\]

Step 4: Distribute $W^{[2]}$

\[a^{[2]} = W^{[2]} W^{[1]} x + W^{[2]} b^{[1]} + b^{[2]}\]

Step 5: Simplify by defining new parameters

Let:

$W’ = W^{[2]} W^{[1]}$ (matrix multiplication)
$b’ = W^{[2]} b^{[1]} + b^{[2]}$

Then:

\[a^{[2]} = W' x + b'\]

The Devastating Result

\[\hat{y} = W' x + b'\]

This is just linear regression! The neural network with a hidden layer is computing exactly the same function as a model with no hidden layers at all.

Key insight: The composition of two linear functions is itself a linear function. No matter how many hidden layers you add, if all activations are linear, the entire network is equivalent to a single linear transformation.

What About Deep Networks?

Many Layers, Same Problem

Even with 10, 100, or 1000 hidden layers, if all use linear activation:

\[a^{[L]} = W^{[L]} W^{[L-1]} \cdots W^{[2]} W^{[1]} x + b'\]

This simplifies to:

\[a^{[L]} = W_{\text{combined}} x + b_{\text{combined}}\]

Result: Still just linear regression, regardless of depth!

Hidden Layers Become Useless

With linear activations:

Adding hidden layers provides no benefit
The network cannot learn complex patterns
You might as well use simple linear regression

Mixed Activation Functions

Linear Hidden + Sigmoid Output

What if we use linear activation in hidden layers but sigmoid at the output?

\[a^{[1]} = W^{[1]} x + b^{[1]} \quad \text{(linear)}\] \[a^{[2]} = \sigma(W^{[2]} a^{[1]} + b^{[2]}) = \sigma(W' x + b')\]

Result: This is just logistic regression without any hidden layer! The model is no more expressive than standard logistic regression.

Why Non-Linearity is Essential

Breaking the Linear Composition

Non-linear activation functions like ReLU, tanh, or sigmoid break the chain of linear compositions:

\[a^{[1]} = \text{ReLU}(W^{[1]} x + b^{[1]})\] \[a^{[2]} = \text{ReLU}(W^{[2]} a^{[1]} + b^{[2]})\]

Now the composition is non-linear, allowing the network to:

Learn complex decision boundaries
Approximate any function (universal approximation theorem)
Extract hierarchical features
Benefit from depth

Visual Intuition

Linear Network (useless):
Input → Linear → Linear → Linear → Output
  x  →   Wx+b →  Wx+b →  Wx+b → W'x+b'  (collapsed!)

Non-Linear Network (powerful):
Input → ReLU → ReLU → ReLU → Output
  x  →  max(0,Wx+b) → (non-linear) → complex function

The One Exception: Regression Output Layer

When Linear Activation is Acceptable

There is one place where linear activation makes sense: the output layer for regression problems.

Scenario: Predicting continuous real values (e.g., housing prices)

\[y \in (-\infty, \infty) \quad \text{or} \quad y \in [0, \infty)\]

Example Architecture:

# Hidden layers: Use non-linear activations
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)                    # Non-linear!

Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)                    # Non-linear!

# Output layer: Linear activation for regression
Z3 = np.dot(W3, A2) + b3
A3 = Z3                          # Linear (or identity)

Why this works:

Hidden layers use non-linear activations (ReLU, tanh, etc.)
They extract complex features
Output layer uses linear activation to produce any real number

Alternative for Non-Negative Outputs

For predictions that should be non-negative (e.g., housing prices ≥ 0):

# Output layer: ReLU for non-negative values
Z3 = np.dot(W3, A2) + b3
A3 = relu(Z3)                    # Ensures y_hat ≥ 0

Summary of Rules

Hidden Layers

Activation	Use in Hidden Layers?
Linear (identity)	❌ Almost never (except rare compression cases)
ReLU	✅ Default choice
Leaky ReLU	✅ Good alternative
tanh	✅ Sometimes useful
Sigmoid	❌ Rarely recommended

Output Layer

Problem Type	Output Activation
Binary classification	Sigmoid
Multi-class classification	Softmax
Regression (any real value)	Linear (identity)
Regression (non-negative)	ReLU or Linear

Complete Example: Housing Price Prediction

Problem Setup

Input: Features like square footage, bedrooms, location
Output: Price (continuous, non-negative)

Correct Architecture

# Layer 1: Non-linear activation
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)                    # ✅ Non-linear

# Layer 2: Non-linear activation  
Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)                    # ✅ Non-linear

# Output layer: ReLU for non-negative prices
Z3 = np.dot(W3, A2) + b3
A3 = relu(Z3)                    # ✅ Ensures price ≥ 0

y_hat = A3

Wrong Architecture (Don’t Do This!)

# Layer 1: Linear activation
Z1 = np.dot(W1, X) + b1
A1 = Z1                          # ❌ Linear (useless!)

# Layer 2: Linear activation
Z2 = np.dot(W2, A1) + b2
A2 = Z2                          # ❌ Linear (useless!)

# Output layer: Linear
Z3 = np.dot(W3, A2) + b3
A3 = Z3                          # ❌ Entire network collapses to W'x + b'

y_hat = A3  # This is just linear regression!

Key Takeaways

Linear activations are useless in hidden layers - they collapse the network to linear regression
Composition of linear functions is linear - $W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} = W’x + b’$
Non-linearity is essential for neural networks to learn complex functions
Hidden layers must use non-linear activations (ReLU, tanh, Leaky ReLU, etc.)
Output layer can use linear activation for regression problems with unbounded outputs
Even with 1000 layers, all-linear activation = simple linear regression
Breaking the linear chain with non-linearity enables deep learning’s power
Universal approximation theorem requires non-linear activations to work
Use ReLU for non-negative regression outputs (e.g., prices, quantities)
Exception is rare: Linear activations in hidden layers are only used in special compression scenarios