Last modified: Jan 31 2026 at 10:09 PM • 11 mins read

Why Deep Representations?

Introduction
Intuition 1: Hierarchical Feature Learning
Connection to Neuroscience
Intuition 2: Circuit Theory
Practical Considerations
Summary: Why Depth Matters
Key Takeaways

Introduction

Deep neural networks consistently outperform shallow networks across many applications. But why does depth matter so much? It’s not just about having more parameters—there’s something special about having many layers.

In this lesson, we’ll explore:

Hierarchical feature learning: How deep networks build complex features from simple ones
Circuit theory perspective: Mathematical reasons for preferring depth
Practical insights: When and why to use deep architectures

Key Question: Why can’t we just use a shallow network with more hidden units instead of a deep network?

Intuition 1: Hierarchical Feature Learning

Example: Face Recognition

Let’s understand what a deep neural network computes when performing face recognition or detection.

Layer-by-Layer Feature Hierarchy

Layer 1: Edge Detection

The first layer acts as an edge detector:

Input: Raw image pixels (face photo)
         ↓
Layer 1: Detect edges (20 hidden units)
         • Vertical edges: |
         • Horizontal edges: —
         • Diagonal edges: / \
         • Various orientations

Each hidden unit learns to detect edges at different orientations in small regions of the image.

Note: When we study convolutional neural networks in a later course, this visualization will make even more sense!

How it works:

First layer groups pixels → edges
Looks at small local regions of the image
Each hidden unit specializes in one edge orientation

Layer 2: Facial Parts Detection

The second layer combines edges to detect facial parts:

Edges (from Layer 1)
         ↓
Layer 2: Detect facial parts
         • Eyes
         • Nose
         • Mouth
         • Ears
         • Eyebrows

By combining multiple edges, the network learns to recognize meaningful parts of a face.

Layer 3 & 4: Complete Face Recognition

Deeper layers combine facial parts to recognize complete faces:

Facial parts (from Layer 2)
         ↓
Layer 3 & 4: Detect faces
         • Different face shapes
         • Various expressions
         • Different people
         • Face orientations

By composing eyes, nose, ears, and chin together, the network recognizes complete faces and identifies individuals.

The Hierarchical Pattern

Deep Network Feature Hierarchy:

\[\text{Pixels} \xrightarrow{\text{Layer 1}} \text{Edges} \xrightarrow{\text{Layer 2}} \text{Parts} \xrightarrow{\text{Layers 3-4}} \text{Faces}\]

Key insight: Earlier layers detect simple features, later layers compose them into complex features.

Layer	Complexity	What it Detects	Receptive Field Size
1	Simple	Edges (vertical, horizontal, diagonal)	Small (local regions)
2	Medium	Facial parts (eyes, nose, mouth)	Medium
3-4	Complex	Complete faces, identities	Large (whole image)

Technical Detail: Early layers look at small regions (edges), while deeper layers look at progressively larger areas of the image.

Example: Speech Recognition

The same hierarchical pattern applies to audio data!

Audio Feature Hierarchy

Layer 1: Low-Level Waveform Features

Input: Audio waveform
         ↓
Layer 1: Detect audio features
         • Tone going up? ↗
         • Tone going down? ↘
         • Pitch (high/low)
         • White noise
         • Specific sounds (sniffing, breathing)

Layer 2: Phonemes (Basic Sound Units)

Waveform features (from Layer 1)
         ↓
Layer 2: Detect phonemes
         • "C" sound in "cat"
         • "A" sound in "cat"
         • "T" sound in "cat"
         • Other basic speech sounds

Phoneme: The smallest unit of sound in linguistics. The word “cat” has 3 phonemes: /k/, /æ/, /t/.

Layer 3: Words

Phonemes (from Layer 2)
         ↓
Layer 3: Detect words
         • "cat"
         • "dog"
         • "hello"
         • Other vocabulary

Layer 4+: Phrases and Sentences

Words (from Layer 3)
         ↓
Layer 4+: Detect phrases/sentences
         • "Hello, how are you?"
         • "The cat sat on the mat"
         • Complete utterances

Speech Recognition Hierarchy

\[\text{Waveform} \xrightarrow{\text{L1}} \text{Audio Features} \xrightarrow{\text{L2}} \text{Phonemes} \xrightarrow{\text{L3}} \text{Words} \xrightarrow{\text{L4+}} \text{Sentences}\]

The Power of Composition

Early layers: Compute seemingly simple functions

“Where are the edges?”
“What audio features are present?”

Deep layers: Compute surprisingly complex functions

“Is this person’s face in the image?”
“What sentence is being spoken?”

Magic of deep learning: By stacking many simple operations, we can compute incredibly complex functions!

Compositional Representation

This simple-to-complex hierarchical representation is also called compositional representation:

Key properties:

Modularity: Each layer builds on the previous one
Reusability: Low-level features (edges) are reused by high-level features (faces)
Abstraction: Each layer operates at a different level of abstraction
Efficiency: Complex features are built by combining simpler ones

Mathematical view:

\[\text{Complex Function} = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(\text{input})\]

Each layer $f_l$ performs a relatively simple transformation, but their composition creates powerful representations!

Connection to Neuroscience

Human Brain Analogy

Many people draw analogies between deep neural networks and the human visual system:

How the brain processes vision (neuroscientist hypothesis):

Retina (eyes)
    ↓
V1 (Primary Visual Cortex): Detect edges and orientations
    ↓
V2: Detect contours, textures
    ↓
V4: Detect object parts
    ↓
IT (Inferotemporal Cortex): Recognize complete objects, faces

This hierarchical processing mirrors deep neural networks!

A Word of Caution

Important: While the analogy to the brain is inspiring, we should be careful not to take it too far.

Why the analogy is useful:

✅ Brain does process information hierarchically
✅ Simple features → Complex features pattern exists in biology
✅ Provides intuition for why depth helps
✅ Serves as loose inspiration for architecture design

Why we should be cautious:

⚠️ We don’t fully understand how the brain works
⚠️ Neural networks are simplified models
⚠️ Biological neurons are far more complex than artificial ones
⚠️ Learning algorithms differ significantly

Bottom line: The brain analogy is a helpful starting point, but deep learning is its own field with its own principles.

Intuition 2: Circuit Theory

Mathematical Perspective

Beyond intuitive examples, there’s a theoretical reason why deep networks are powerful, coming from circuit theory.

Circuit theory: Studies what functions can be computed with logic gates (AND, OR, NOT).

Key Result from Circuit Theory

Theorem (informal): Some functions that are computable with a small but deep network require an exponentially large shallow network.

Translation: Depth provides exponential advantages for certain functions!

Example: Computing XOR (Parity Function)

Let’s see a concrete example: computing the XOR (exclusive OR) of $n$ input bits.

Problem Statement

Compute the parity function:

\[y = x_1 \oplus x_2 \oplus x_3 \oplus \cdots \oplus x_n\]

where $\oplus$ is the XOR operation.

XOR truth table (for 2 inputs):

$x_1$	$x_2$	$x_1 \oplus x_2$
0	0	0
0	1	1
1	0	1
1	1	0

Goal: Compute this for $n$ inputs efficiently.

Solution 1: Deep Network (XOR Tree)

Architecture: Build an XOR tree with $O(\log n)$ layers:

Layer 1:  x₁⊕x₂    x₃⊕x₄    x₅⊕x₆    x₇⊕x₈
              \       /          \       /
Layer 2:      (··)⊕(··)          (··)⊕(··)
                    \               /
Layer 3:              (····)⊕(····)
                           |
                           ŷ

Properties:

\[\begin{align} \text{Number of layers} &: O(\log n) \\ \text{Total gates needed} &: O(n) \\ \text{Depth} &: \log_2 n \end{align}\]

Note: Technically, each XOR gate might require a few AND/OR/NOT gates, so each “layer” might be 2-3 actual layers, but the depth is still $O(\log n)$.

Efficiency: Very efficient! Logarithmic depth, linear number of gates.

Solution 2: Shallow Network (Single Hidden Layer)

Architecture: Forced to use just one hidden layer:

x₁, x₂, x₃, ..., xₙ
       ↓ (all inputs)
  [Hidden Layer]
       ↓
       ŷ

Problem: The hidden layer must be exponentially large!

Why? To compute XOR with one hidden layer, you need to enumerate all possible input configurations:

\[\text{Number of hidden units needed} \approx 2^n - 1 = O(2^n)\]

Reason: With $n$ binary inputs, there are $2^n$ possible configurations. A shallow network must essentially memorize which configurations give XOR = 1 vs XOR = 0.

Comparison: Deep vs Shallow

Approach	Network Depth	Hidden Units Needed	Total Parameters
Deep (XOR tree)	$O(\log n)$	$O(n)$	$O(n)$
Shallow (1 layer)	$O(1)$ (fixed)	$O(2^n)$	$O(n \cdot 2^n)$

Exponential savings: Deep network is exponentially more efficient!

Mathematical Insight

General principle: There exist functions $f: {0,1}^n \to {0,1}$ such that:

Deep network: Uses $O(\text{poly}(n))$ units

Shallow network: Requires $O(2^n)$ units

Conclusion: For some problems, depth provides exponential representational efficiency.

Practical Caveat

Andrew Ng’s note: “Personally, I find the circuit theory result less useful for gaining intuition, but it’s one of the results people often cite when explaining the value of deep representations.”

Why less useful?

Most real-world functions aren’t worst-case circuit theory problems
Empirical success often precedes theoretical understanding
XOR trees are somewhat artificial examples

But still valuable because:

Provides mathematical justification for depth
Shows depth isn’t just about having more parameters
Explains why we can’t always compensate with wider shallow networks

Practical Considerations

The “Deep Learning” Brand

Let’s be honest about terminology:

Marketing reality: Part of why “deep learning” took off is simply great branding!

Before: “Neural networks with many hidden layers” (mouthful)

After: “Deep learning” (concise, evocative, cool! 🎯)

The term “deep” sounds:

Profound
Sophisticated
Advanced
Mysterious

Result: The rebranding helped capture popular imagination and research funding!

But: Regardless of marketing, deep networks genuinely do work well! The performance backs up the hype.

How Deep Should You Go?

Common beginner mistake: Immediately using very deep networks for every problem.

Recommended Approach

Start simple, then increase complexity:

Step 1: Logistic Regression (0 hidden layers)
        ↓ (if not good enough)
Step 2: Shallow Network (1-2 hidden layers)
        ↓ (if not good enough)
Step 3: Deeper Networks (3-5 hidden layers)
        ↓ (if not good enough)
Step 4: Very Deep Networks (10+ hidden layers)

Treat depth as a hyperparameter: Tune it like learning rate or regularization strength!

Depth as Hyperparameter

Hyperparameter tuning process:

# Pseudocode for tuning depth
depths_to_try = [0, 1, 2, 3, 5, 10, 20]

best_depth = None
best_performance = 0

for depth in depths_to_try:
    model = build_network(depth=depth)
    performance = evaluate(model, validation_set)

    if performance > best_performance:
        best_performance = performance
        best_depth = depth

print(f"Optimal depth: {best_depth} layers")

Factors affecting optimal depth:

Dataset size (more data → can use deeper networks)
Problem complexity (harder problems → might need more depth)
Computational budget (deeper → slower training)
Regularization (deeper → more prone to overfitting)

Recent Trends

Over the last several years, there’s been a trend toward very deep networks:

Examples of successful deep architectures:

AlexNet (2012): 8 layers
VGG (2014): 16-19 layers
ResNet (2015): 50-152 layers
Modern transformers: Often dozens of layers

When very deep networks excel:

Large datasets (millions of examples)
Complex tasks (image recognition, language modeling)
When computational resources are available
When proper regularization is used

But: Don’t assume “deeper is always better”—it depends on your specific problem!

Practical Guidelines

Scenario	Recommended Depth	Rationale
Small dataset (<1K examples)	0-2 layers	Avoid overfitting
Medium dataset (1K-100K)	2-5 layers	Balance capacity and generalization
Large dataset (>100K)	5-20+ layers	Leverage data to learn complex features
Starting new problem	0-2 layers	Establish baseline, then increase
Production system	Tune as hyperparameter	Find optimal depth empirically

Summary: Why Depth Matters

Three Key Reasons

1. Hierarchical Feature Learning

Simple features (edges) → Medium features (parts) → Complex features (objects)
Mirrors how brains process information
Natural for many real-world problems

2. Mathematical Efficiency

Circuit theory shows exponential advantages for certain functions
Deep networks can compute some functions with exponentially fewer parameters
Depth provides representational power beyond just parameter count

3. Empirical Success

Deep networks consistently outperform shallow networks on complex tasks
Enabled breakthroughs in computer vision, speech, NLP
Scale well with large datasets

When to Use Deep Networks

✅ Good candidates for deep networks:

Large datasets available
Complex hierarchical structure in data (images, audio, text)
Sufficient computational resources
Tasks where shallow networks plateau

❌ When simpler might be better:

Small datasets (risk of overfitting)
Simple problems (don’t need the complexity)
Limited computational budget
Need interpretability (shallow models easier to understand)

Key Takeaways

Depth is special: It’s not just about having more parameters—layer structure matters
Hierarchical learning: Deep networks naturally learn features from simple to complex
Compositional representation: Complex features are compositions of simpler ones
Face recognition example: Pixels → Edges → Facial parts → Complete faces
Speech recognition example: Waveforms → Audio features → Phonemes → Words → Sentences
Brain inspiration: Human visual cortex also processes hierarchically (but analogy has limits)
Circuit theory: Deep networks can be exponentially more efficient than shallow networks
XOR example: Computing n-way XOR takes $O(\log n)$ depth or $O(2^n)$ width
Exponential savings: Some functions need exponentially fewer parameters with depth
Branding matters: “Deep learning” is partly successful due to great naming!
Start simple: Begin with shallow networks, increase depth as needed
Depth is a hyperparameter: Tune it empirically for your specific problem
Recent trend: Very deep networks (dozens of layers) work well for many applications
Not always deeper: More layers aren’t always better—depends on data and problem
Empirical success: Despite theoretical limitations, deep networks work amazingly well in practice!