Last modified: Jan 31 2026 at 10:09 PM • 6 mins read

Other Regularization Methods

Introduction
Data Augmentation
Early Stopping
Summary Comparison
Key Takeaways

Introduction

Beyond L2 regularization and dropout, there are additional techniques to reduce overfitting in neural networks. This lesson covers two important methods: data augmentation and early stopping.

Data Augmentation

The Problem: Not Enough Training Data

Ideal Solution: Collect more training data

Reality: Getting more data is often:

Expensive (requires labeling, collection efforts)
Time-consuming
Sometimes impossible

Practical Solution: Augment your existing training data by creating modified versions of your examples.

Augmentation Techniques for Images

1. Horizontal Flipping

Technique: Mirror the image horizontally

Example - Cat Classifier:

Original cat photo on left, arrow pointing to horizontally flipped version in center showing the same fluffy orange cat mirrored, demonstrating data augmentation. Below, the digit 4 is shown transforming through rotation and distortion into slightly warped versions that still represent the number 4.

import numpy as np
from PIL import Image

# Original image
original_cat = Image.open('cat.jpg')

# Flip horizontally
flipped_cat = original_cat.transpose(Image.FLIP_LEFT_RIGHT)

# Now you have 2 training examples instead of 1

Result: Double your training set size with minimal effort

Trade-off: The augmented examples are not as valuable as completely independent new examples, but they’re essentially free (except for computational cost).

Important: Notice we flip horizontally but NOT vertically—upside-down cats aren’t realistic examples!

2. Random Crops and Zooms

Technique: Take random sections of the image at different scales

import random

def random_crop(image, crop_size):
    """Randomly crop a portion of the image"""
    width, height = image.size
    left = random.randint(0, width - crop_size)
    top = random.randint(0, height - crop_size)
    
    return image.crop((left, top, left + crop_size, top + crop_size))

# Create multiple crops from one image
crops = [random_crop(original_cat, 224) for _ in range(5)]

Why It Works: A zoomed-in portion of a cat is still a cat—you’re teaching the model that the object can appear at different positions and scales.

3. Rotations and Distortions

Applications: Particularly useful for optical character recognition (OCR)

Example - Digit Recognition:

from scipy.ndimage import rotate

# Original digit "4"
digit = load_digit_image('four.png')

# Apply subtle rotation
rotated_digit = rotate(digit, angle=15, reshape=False)

# Apply distortion (elastic deformation)
distorted_digit = apply_elastic_distortion(digit)

Note on Distortion Strength:

For demonstration: Strong distortions make the concept clear
In practice: Use subtle distortions—you don’t want extremely warped digits that look unnatural

What Data Augmentation Teaches Your Model

By synthesizing augmented examples, you’re explicitly telling your algorithm:

Transformation	Invariance Learned
Horizontal flip	“A cat facing left is still a cat”
Random crop/zoom	“A cat at different scales/positions is still a cat”
Rotation	“A slightly tilted digit is still the same digit”
Distortion	“Handwriting variations don’t change the digit identity”

Data Augmentation as Regularization

How It Regularizes:

Increases effective training set size → More data to learn from
Introduces controlled variation → Prevents memorization of exact pixel patterns
Low cost → Almost free regularization (just computation)

Comparison:

Method	Cost	Data Increase	Independence
Collect new data	High	100%	Fully independent
Data augmentation	Low	Variable	Somewhat redundant

Early Stopping

The Concept

Idea: Stop training your neural network before it fully converges on the training set.

How It Works

Step 1: Track Two Metrics During Training

Plot both metrics as training progresses:

Training error (or cost function $J$) → Should decrease monotonically
Dev set error → Initially decreases, then starts increasing

Step 2: Identify the Optimal Point

Typical Behavior:

\[\text{Iteration} \rightarrow \begin{cases} J_{\text{train}} & \searrow \text{(keeps decreasing)} \\ J_{\text{dev}} & \searrow \text{then} \nearrow \text{(starts increasing)} \end{cases}\]

Decision: Stop training at the iteration where dev set error is lowest.

# Pseudocode for early stopping
best_dev_error = float('inf')
patience = 10
wait = 0

for iteration in range(max_iterations):
    train_one_epoch()
    
    dev_error = evaluate_on_dev_set()
    
    if dev_error < best_dev_error:
        best_dev_error = dev_error
        save_model()
        wait = 0
    else:
        wait += 1
        
    if wait >= patience:
        print(f"Early stopping at iteration {iteration}")
        break

Why Early Stopping Works

Connection to Weight Magnitudes:

Early in training: Random initialization → Weights $w$ are small
As training progresses: Weights grow larger and larger
Early stopping effect: Stops training while weights are still mid-sized

\[\text{Early stopping} \approx \text{Choosing smaller } \|w\| \approx \text{L2 regularization effect}\]

Result: Smaller weight norms → Less overfitting

The Downside: Orthogonalization Violation

The Principle of Orthogonalization

Ideal Machine Learning Workflow: Separate concerns into independent tasks

Task	Goal	Tools
1. Optimize cost function	Minimize $J(w,b)$	Gradient descent, Adam, RMSprop
2. Prevent overfitting	Reduce variance	L2 regularization, dropout, more data

Orthogonalization: Work on one task at a time with dedicated tools for each task.

Note: Adam (Adaptive Moment Estimation) and RMSprop (Root Mean Square Propagation) are advanced optimization algorithms that improve upon basic gradient descent. They adaptively adjust learning rates for each parameter, leading to faster and more stable convergence.

Why Early Stopping Violates This Principle

The Problem: Early stopping couples both tasks together

Task 1 Impact: You stop optimizing $J$ before it’s fully minimized → Not doing a great job at Task 1
Task 2 Impact: You’re simultaneously trying to prevent overfitting → Mixed objective

Consequence: The search space becomes more complicated—you can’t independently tune optimization and regularization.

Early Stopping vs L2 Regularization

Alternative Approach: Use L2 Regularization Instead

# Train as long as possible with L2 regularization
for lambda_val in [0.001, 0.01, 0.1, 1.0]:
    model = train_with_L2_regularization(lambda_val)
    evaluate(model)