Last modified: Jan 31 2026 at 10:09 PM • 4 mins read
Gradient Descent
Table of contents
- Introduction
- Recap: Cost Function
- Key Insight: Averaging Derivatives
- Algorithm: Gradient Descent for $m$ Examples
- Understanding the Variables
- Complete One-Step Algorithm
- Problem: Two For-Loops
- Why For-Loops Are a Problem
- Performance Comparison
- What’s Next: Vectorization
- Key Takeaways
- Summary Formula
Introduction
Previously, we learned gradient descent for one training example. Now we’ll extend this to $m$ training examples (the entire training set).
Recap: Cost Function
The cost function for logistic regression:
\[J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(a^{(i)}, y^{(i)})\]Where:
- $m$ = number of training examples
- $a^{(i)} = \sigma(z^{(i)}) = \sigma(w^T x^{(i)} + b)$ = prediction for example $i$
- $\mathcal{L}(a^{(i)}, y^{(i)})$ = loss for example $i$
Key Insight: Averaging Derivatives
Since the cost function is an average of individual losses, the derivative is also an average:
\[\frac{\partial J}{\partial w_1} = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \mathcal{L}^{(i)}}{\partial w_1}\]What this means:
- Compute derivative for each training example
- Sum them up
- Divide by $m$ to get the average
Algorithm: Gradient Descent for $m$ Examples
Step 1: Initialize Accumulators
J = 0 # Cost accumulator
dw1 = 0 # Gradient accumulator for w1
dw2 = 0 # Gradient accumulator for w2
db = 0 # Gradient accumulator for b
Step 2: Loop Over Training Examples
for i in range(1, m+1):
# Forward propagation
z_i = w1*x1_i + w2*x2_i + b
a_i = sigmoid(z_i)
# Accumulate cost
J += -(y_i * log(a_i) + (1-y_i) * log(1-a_i))
# Backward propagation
dz_i = a_i - y_i
# Accumulate gradients
dw1 += x1_i * dz_i
dw2 += x2_i * dz_i
db += dz_i
Note: This example uses 2 features. For $n$ features, you’d have dw1, dw2, …, dwn.
Step 3: Average the Gradients
J = J / m # Average cost
dw1 = dw1 / m # Average gradient for w1
dw2 = dw2 / m # Average gradient for w2
db = db / m # Average gradient for b
Step 4: Update Parameters
w1 = w1 - alpha * dw1
w2 = w2 - alpha * dw2
b = b - alpha * db
Where $\alpha$ is the learning rate.
Understanding the Variables
Accumulators vs Per-Example Derivatives
Accumulators (no superscript):
dw1,dw2,db- Sum across all examples- Used to compute final gradients
Per-example derivatives (with superscript $i$):
- $dz^{(i)}$ - Derivative for example $i$ only
- Computed inside the loop
Why the distinction?
- Accumulators collect information from all examples
- Per-example derivatives are temporary calculations
Complete One-Step Algorithm
This is one iteration of gradient descent:
# Initialize
J, dw1, dw2, db = 0, 0, 0, 0
# Loop over training examples
for i in range(1, m+1):
# Forward pass
z_i = w1*x1_i + w2*x2_i + b
a_i = sigmoid(z_i)
J += -(y_i * log(a_i) + (1-y_i) * log(1-a_i))
# Backward pass
dz_i = a_i - y_i
dw1 += x1_i * dz_i
dw2 += x2_i * dz_i
db += dz_i
# Average
J /= m
dw1 /= m
dw2 /= m
db /= m
# Update
w1 -= alpha * dw1
w2 -= alpha * dw2
b -= alpha * db
To train: Repeat this entire process many times until convergence.
Problem: Two For-Loops
This implementation has two weaknesses:
Loop 1: Over Training Examples
for i in range(1, m+1): # Loop over m examples
...
Loop 2: Over Features
For $n$ features, you need:
dw1 += x1_i * dz_i
dw2 += x2_i * dz_i
# ...
dwn += xn_i * dz_i
This is essentially another loop over features.
Why For-Loops Are a Problem
Inefficiency in Deep Learning:
Pre-Deep Learning Era:
- For-loops were acceptable
- Vectorization was a “nice to have”
- Small datasets made speed less critical
Deep Learning Era:
- Massive datasets (millions of examples)
- For-loops are too slow
- Vectorization is essential
The Solution: Vectorization - techniques that eliminate explicit for-loops
Performance Comparison
| Approach | Speed | Scalability |
|---|---|---|
| Two for-loops | Slow | Poor (doesn’t scale) |
| Vectorized | Fast | Excellent (scales to millions) |
Why it matters:
- Modern datasets: millions of examples
- Need to process quickly
- For-loops don’t scale
What’s Next: Vectorization
In the next videos, we’ll learn vectorization techniques to:
- Eliminate the loop over training examples
- Eliminate the loop over features
- Implement gradient descent with no explicit loops
Benefits of vectorization:
- Much faster execution
- Cleaner code
- Scales to massive datasets
- Takes advantage of modern hardware (GPUs)
Key Takeaways
- Gradient for multiple examples: Average of individual gradients
- Accumulators: Sum gradients across examples, then divide by $m$
- One iteration: Forward pass → compute gradients → average → update
- Multiple iterations: Repeat until convergence
- For-loops are slow: Need vectorization for efficiency
- Next step: Learn vectorization to eliminate loops
Summary Formula
For the entire training set:
\[\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)} (a^{(i)} - y^{(i)})\] \[\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (a^{(i)} - y^{(i)})\]These formulas compute the gradients needed for gradient descent across all examples.