Last modified: Jan 31 2026 at 10:09 PM • 10 mins read
Parameters vs Hyperparameters
Table of contents
- Introduction
- Parameters vs Hyperparameters
- The Empirical Process of Hyperparameter Tuning
- Why You Must Experiment
- Practical Hyperparameter Tuning Strategy
- Systematic Hyperparameter Search (Preview)
- Common Pitfalls to Avoid
- The State of Hyperparameter Tuning
- Practical Advice Summary
- Key Takeaways
Introduction
Building effective deep neural networks requires organizing both parameters (learned by the model) and hyperparameters (chosen by you). Understanding the difference between these two concepts is crucial for efficient model development.
This lesson covers:
- What parameters are (and what they do)
- What hyperparameters are (and why they matter)
- The empirical process of finding good hyperparameters
- Why hyperparameter values change over time
- Practical advice for hyperparameter tuning
Key insight: Hyperparameters control how your parameters are learned, making them “parameters of the learning process” rather than parameters of the model itself.
Parameters vs Hyperparameters
What Are Parameters?
Parameters are the variables that your model learns from the training data:
- Weights: $W^{[1]}, W^{[2]}, \ldots, W^{[L]}$
- Biases: $b^{[1]}, b^{[2]}, \ldots, b^{[L]}$
These values are:
- Learned automatically through gradient descent
- Updated during training iterations
- Optimized to minimize the cost function
Remember: You don’t set these values manually—the learning algorithm finds them!
What Are Hyperparameters?
Hyperparameters are the settings that you choose before training that control how the parameters are learned:
Common Hyperparameters
| Hyperparameter | Description | Example Values |
|---|---|---|
| Learning rate ($\alpha$) | Step size for gradient descent | 0.001, 0.01, 0.1 |
| Number of iterations | Training epochs/steps | 1000, 5000, 10000 |
| Number of layers ($L$) | Network depth | 2, 3, 5, 10 |
| Hidden units per layer ($n^{[l]}$) | Layer width | 50, 100, 256, 512 |
| Activation functions | Non-linearity choice | ReLU, tanh, sigmoid |
Advanced Hyperparameters (Course 2)
These will be covered in more detail later:
- Momentum term
- Mini-batch size
- Regularization parameters (L2, dropout)
- Learning rate decay
Why “Hyper”parameters?
The prefix “hyper” means “above” or “beyond”—hyperparameters sit above the regular parameters in the hierarchy:
Hyperparameters (α, L, n^[l], ...)
↓
Control how we learn
↓
Parameters (W, b)
↓
Determine predictions
↓
Output (ŷ)
The relationship:
- Hyperparameters control the learning process
- The learning process determines the parameters
- The parameters determine the model’s predictions
Terminology note: In earlier machine learning eras, people often called $\alpha$ a “parameter.” With deep learning’s many hyperparameters, we now distinguish them clearly.
The Empirical Process of Hyperparameter Tuning
Deep Learning Is Empirical
Empirical means you learn from experiments and observations rather than pure theory.
Reality check: You cannot know the best hyperparameters in advance—you must try them!
The Iterative Cycle
┌─────────────────────────────────────┐
│ │
↓ │
1. Have an idea │
(e.g., "α = 0.01") │
↓ │
2. Implement and train │
↓ │
3. Observe results │
(cost function behavior) │
↓ │
4. Refine idea │
(e.g., "Try α = 0.05") │
└──────────────────────────────────────┘
Example: Tuning Learning Rate
Let’s say you’re experimenting with different learning rates:
Experiment 1: $\alpha = 0.01$
Cost J
↓
↓ Gradual decrease
↓_______________ (slow but steady)
Iterations
Observation: Learning is happening but slowly.
Experiment 2: $\alpha = 0.1$ (10× larger)
Cost J
↑
↑ Divergence!
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (cost explodes)
Iterations
Observation: Learning rate too large—gradient descent overshoots.
Experiment 3: $\alpha = 0.03$
Cost J
↓
↓↓ Fast decrease
↓_______ ________ (quick convergence)
Iterations
Observation: Good balance—fast learning, stable convergence.
Decision: Use $\alpha = 0.03$ for this problem!
Visualizing Different Hyperparameter Choices
| Learning Rate | Cost Function Behavior | Verdict |
|---|---|---|
| Too small (0.001) | Slow steady decrease | ✓ Works but slow |
| Good (0.01-0.03) | Fast smooth decrease | ✓✓ Ideal |
| Too large (0.1+) | Oscillation or divergence | ✗ Doesn’t work |
Why You Must Experiment
Reason 1: Cross-Domain Transfer Is Unpredictable
Deep learning is applied across many domains:
- Computer vision (images)
- Speech recognition (audio)
- Natural language processing (text)
- Structured data (tables, databases)
- Recommender systems (e-commerce)
- Web search (ranking)
The problem: Hyperparameter intuitions from one domain may not transfer to another!
Example scenarios:
| Your Experience | New Problem | What Happens |
|---|---|---|
| Computer vision expert | Speech recognition | Some intuitions carry over, some don’t |
| NLP researcher | Medical imaging | Very different hyperparameters needed |
| E-commerce ML | Financial forecasting | Different data → different settings |
Best practice: Even with experience, always try a range of values when starting a new problem.
Reason 2: Hyperparameters Change Over Time
Even for the same problem, optimal hyperparameters can change:
Factors That Cause Changes
- Hardware improvements
- CPUs get faster → can train with larger batches
- GPUs improve → can use deeper networks
- Memory increases → can process more data
- Dataset evolution
- More data collected → may need more iterations
- Data distribution shifts → may need different regularization
- New features added → architecture may need adjustment
- Model improvements
- Better initialization methods discovered
- New activation functions developed
- Advanced optimization algorithms available
Practical Recommendation
Rule of thumb: Every few months (or when significant changes occur), re-evaluate your hyperparameters to see if better values exist.
What to do:
- Set aside time for hyperparameter exploration
- Try variations around current values
- Compare performance on validation set
- Update if you find improvements
Building Intuition Over Time
As you work on a problem longer:
Month 1: Everything is trial and error
- Try many different values
- Learn what works and what doesn’t
Month 6: Patterns emerge
- You start to recognize good ranges
- Faster to narrow down choices
Year 1: Strong intuition
- Can make educated first guesses
- Still verify with experiments
Important: Intuition is problem-specific! Don’t assume it fully transfers to new domains.
Practical Hyperparameter Tuning Strategy
Step-by-Step Approach
Step 1: Start with Reasonable Defaults
Use commonly recommended starting points:
| Hyperparameter | Starting Value | Reasoning |
|---|---|---|
| Learning rate ($\alpha$) | 0.01 | Works well for many problems |
| Hidden layers ($L$) | 2-3 | Balance complexity and training time |
| Hidden units ($n^{[l]}$) | 50-100 | Enough capacity, not too slow |
| Activation (hidden) | ReLU | Fast, works well in practice |
| Activation (output) | Sigmoid/Softmax | Depends on task (binary/multi-class) |
Step 2: Implement and Train
# Example training setup
parameters = initialize_parameters(layer_dims)
for i in range(num_iterations):
AL, caches = forward_propagation(X, parameters)
cost = compute_cost(AL, Y)
grads = backward_propagation(AL, Y, caches)
parameters = update_parameters(parameters, grads, alpha)
if i % 100 == 0:
print(f"Cost after iteration {i}: {cost}")
Step 3: Evaluate on Validation Set
Don’t use training error alone!
- Training set: Used to learn parameters
- Validation set: Used to choose hyperparameters
- Test set: Used for final evaluation (unbiased)
Step 4: Try Variations
Once you have a baseline, experiment systematically:
Example: Learning rate search
Try: α ∈ {0.001, 0.003, 0.01, 0.03, 0.1, 0.3}
Pick: Best validation performance
Example: Architecture search
Try: L ∈ {2, 3, 4, 5}
For each L, try: n^[l] ∈ {50, 100, 200}
Pick: Best validation performance vs training time tradeoff
What to Monitor
While training, watch these indicators:
| Metric | What It Tells You | Action |
|---|---|---|
| Training cost decreasing | Learning is happening | ✓ Good |
| Training cost flat | Learning stuck | Try larger $\alpha$ or different architecture |
| Training cost increasing | Divergence | ✗ Reduce $\alpha$ immediately |
| Train/validation gap small | Good generalization | ✓ Model is working |
| Train/validation gap large | Overfitting | Add regularization or get more data |
Systematic Hyperparameter Search (Preview)
In Course 2, you’ll learn advanced techniques:
Grid Search
Try all combinations of hyperparameter values:
α = {0.01, 0.03, 0.1}
L = {2, 3, 4}
Total combinations: 3 × 3 = 9 experiments
Random Search
Sample hyperparameter values randomly:
α ~ Uniform(0.001, 0.1)
L ~ {2, 3, 4, 5}
n^[l] ~ {50, 100, 200, 500}
Run 20-50 random combinations
Advantage: Often more efficient than grid search!
Bayesian Optimization
Use previous results to guide next experiments:
- Train model with hyperparameters A → score S_A
- Train model with hyperparameters B → score S_B
- Use S_A and S_B to intelligently pick next hyperparameters C
- Repeat
Advantage: Finds good hyperparameters with fewer experiments!
Common Pitfalls to Avoid
Pitfall 1: Using Test Set for Hyperparameter Tuning
Wrong approach:
Train on training set
Tune hyperparameters using test set ← NO!
Why it’s wrong: You’re “learning” from test set, so final test performance is overoptimistic.
Right approach:
Train on training set
Tune hyperparameters using validation set ← YES!
Final evaluation on test set
Pitfall 2: Not Re-evaluating Hyperparameters
Wrong: Set hyperparameters once and never change them.
Right: Periodically re-evaluate, especially when:
- You get more data
- Hardware changes
- Problem requirements shift
Pitfall 3: Overfitting to Validation Set
Problem: If you try too many hyperparameter combinations, you might overfit to the validation set itself!
Solution:
- Limit the number of experiments
- Use separate validation and test sets
- Consider cross-validation for small datasets
Pitfall 4: Ignoring Training Time
Problem: Best hyperparameters might be too slow to use in practice.
Solution: Consider performance vs training time tradeoff:
Option A: 95% accuracy, trains in 1 hour
Option B: 96% accuracy, trains in 10 hours
Choice depends on your constraints!
The State of Hyperparameter Tuning
Current Reality (2025)
Honest truth: Hyperparameter tuning remains somewhat of an art, not a perfect science.
Why it’s challenging:
- Problems are diverse: No universal hyperparameters work everywhere
- Infrastructure changes: CPUs, GPUs, frameworks keep evolving
- Data changes: Distributions shift over time
- Research advances: New techniques constantly emerging
What this means for you:
- Experimentation is necessary and expected
- Building intuition takes time and practice
- Even experts experiment and iterate
Future Directions
Research is making progress on:
- AutoML: Automatic hyperparameter optimization
- Neural Architecture Search (NAS): Automatically find network structures
- Meta-learning: Learn how to set hyperparameters from past experience
- Transfer learning: Reduce need for problem-specific tuning
But for now: Embrace the empirical process!
Practical Advice Summary
Do’s ✓
- Start simple: Begin with 2-3 layers and standard hyperparameters
- Use validation set: Always tune on separate data from training
- Try ranges: Test multiple values (e.g., 0.001, 0.01, 0.1 for $\alpha$)
- Monitor metrics: Watch both training and validation performance
- Document everything: Keep track of what you tried and results
- Re-evaluate periodically: Check if hyperparameters still optimal
- Build intuition: Learn from each experiment
Don’ts ✗
- Don’t expect perfect values: Accept that tuning is empirical
- Don’t use test set for tuning: Save it for final evaluation
- Don’t blindly copy: Hyperparameters from other domains may not transfer
- Don’t set once forever: Optimal values change over time
- Don’t ignore training time: Consider practical constraints
- Don’t try everything: Be systematic, not exhaustive
- Don’t get discouraged: Tuning is hard for everyone!
Key Takeaways
- Parameters ($W, b$): Learned automatically by the model during training
- Hyperparameters ($\alpha, L, n^{[l]}$, activation): Chosen by you before training
- Hyperparameters control parameters: They determine how learning happens
- Deep learning is empirical: You must try different values and observe results
- Iterative process: Idea → Implement → Evaluate → Refine → Repeat
- Learning rate experiments: Try multiple values and watch cost function behavior
- Domain transfer is unpredictable: Intuitions may or may not carry across problems
- Hyperparameters change over time: Re-evaluate periodically (every few months)
- Use validation set: Tune on separate data from training set
- Monitor both metrics: Training and validation performance tell different stories
- Start with defaults: Use reasonable starting points (e.g., $\alpha = 0.01$)
- Be systematic: Try ranges of values, not random guessing
- Consider constraints: Balance performance vs training time
- Build intuition: Experience helps but doesn’t eliminate need for experiments
- Embrace uncertainty: Hyperparameter tuning is hard—even for experts!