Last modified: Jan 31 2026 at 10:09 PM • 7 mins read

Train / Dev / Test Sets

Introduction
Why Data Splitting Matters
The Three Data Sets
Data Split Ratios: Then vs Now
Mismatched Train and Test Distributions
When You Don’t Need a Test Set
Practical Guidelines
Key Takeaways

Introduction

Properly setting up your training, development (dev), and test sets is one of the most important decisions you’ll make when building a neural network. Good data organization can dramatically accelerate your progress and help you find high-performance models faster.

Why Data Splitting Matters

The Iterative Nature of Deep Learning

Deep learning is fundamentally an empirical, iterative process. When starting a new project, you need to make many decisions:

Architecture: How many layers? How many hidden units per layer?
Optimization: What learning rate? Which optimizer?
Regularization: Dropout rate? L2 penalty?
Activation functions: ReLU? Tanh? Leaky ReLU?

Key insight: It’s almost impossible to choose the right hyperparameters on your first attempt, even for experienced practitioners.

The Development Cycle

The typical workflow looks like this:

Three circular arrows forming a cycle connecting three labeled points: Idea at top, Code at right, and Experiment at left, illustrating the iterative machine learning development process where you generate ideas about model architecture and hyperparameters, implement them in code, run experiments to evaluate performance, and use results to inform new ideas

Your goal: Make this cycle as fast as possible. Proper data setup is crucial for rapid iteration.

Domain Expertise Doesn’t Always Transfer

Even experienced researchers find that intuitions don’t transfer well across domains:

NLP expert → Computer vision (different best practices)
Speech recognition → Advertising (different data characteristics)
Security → Logistics (different problem structures)

Why? Optimal choices depend on:

Amount of available data
Number of input features
Hardware configuration (GPU vs CPU, cluster setup)
Data distribution characteristics
Application-specific constraints

The Three Data Sets

Purpose of Each Set

Set	Purpose	Usage
Training Set	Learn parameters ($w$, $b$)	Train your model repeatedly
Dev Set	Compare model architectures	Choose between different models
Test Set	Unbiased performance estimate	Final evaluation only

Note: The dev set is also called the hold-out cross-validation set or validation set. These terms are interchangeable.

The Standard Workflow

Train: Train multiple models on the training set
Validate: Evaluate all models on the dev set
Select: Pick the best-performing model from dev set results
Test: Evaluate the final selected model on the test set (once!)

The test set gives you an unbiased estimate because you haven’t used it to make any decisions about your model.

Data Split Ratios: Then vs Now

Traditional Machine Learning Era (Small Data)

Dataset size: 100 to 10,000 examples

Common splits:

70% train / 30% test
60% train / 20% dev / 20% test

These ratios were reasonable when data was limited.

Modern Deep Learning Era (Big Data)

Dataset size: 1 million+ examples

Modern splits:

1,000,000 examples: 98% train / 1% dev / 1% test (10K dev, 10K test)
10,000,000 examples: 99.5% train / 0.25% dev / 0.25% test
100,000,000 examples: 99.9% train / 0.05% dev / 0.05% test

Why Such Small Dev/Test Sets?

Dev set purpose: Determine which of several models performs best

Do you need 200,000 examples to compare 10 algorithms? No!
10,000 examples is usually more than sufficient
You just need enough data to distinguish between models with statistical confidence

Test set purpose: Estimate final model performance

Again, 10,000-50,000 examples is typically enough
You need enough data for a confident performance estimate, not 20% of your data

General Guidelines by Dataset Size

Total Examples	Train	Dev	Test	Reasoning
100-1,000	60%	20%	20%	Traditional splits work fine
10,000	60%	20%	20%	Still reasonable
100,000	90%	5%	5%	Starting to shift
1,000,000	98%	1%	1%	Modern big data approach
10,000,000+	99%+	<1%	<1%	Maximize training data

Mismatched Train and Test Distributions

The Modern Reality

In modern deep learning, it’s increasingly common to have different distributions for training vs dev/test sets.

Example: Cat Photo App

Scenario: Building an app to find cat pictures for users

Data sources:

Training set: High-quality cat photos scraped from the Internet
- Professional photography
- High resolution
- Perfect framing and lighting
- Large volume available (millions of images)
Dev/Test sets: Cat photos uploaded by users
- Phone camera quality
- Lower resolution
- Casual conditions (blurry, poor lighting)
- Limited volume available (thousands of images)

Critical Rule for Mismatched Distributions

Rule: Dev and test sets must come from the same distribution!

Why this matters:

\[\text{Train distribution} \neq \text{Dev/Test distribution} \text{ (OK)}\] \[\text{Dev distribution} \neq \text{Test distribution} \text{ (NOT OK)}\]

Reasoning:

You’ll spend significant time optimizing performance on the dev set
If dev and test distributions differ, good dev performance doesn’t guarantee good test performance
You want to “aim” at the same target you’ll be evaluated on

When to Use Mismatched Distributions

Use this approach when:

✅ You can easily acquire large training data from alternative sources
✅ Your true target data is limited
✅ Alternative data is relevant to your task

Example scenarios:

Medical imaging: Public datasets + small hospital-specific data
Speech recognition: Audiobooks + real user recordings
Product recommendations: Historical data + recent user behavior

When You Don’t Need a Test Set

Test Set is Optional

The test set serves one purpose: Give an unbiased estimate of your final model’s performance.

If you don’t need that unbiased estimate, you can skip the test set!

Train + Dev Only (No Test)

Workflow:

Train on training set
Evaluate multiple models on dev set
Pick the best model based on dev set
Deploy that model

Caveat: Because you’ve optimized for the dev set, your performance estimate is biased (overly optimistic).

When this is acceptable:

Rapid prototyping and experimentation
Internal tools where slight performance overestimation is acceptable
Situations where you can quickly observe real-world performance

Terminology Warning

⚠️ Common confusion: Many teams call their dev set a “test set” when they don’t have a separate test set.

What they actually have:

Training set
Dev set (mislabeled as “test set”)
No true test set

Problem: They’re overfitting to what they call the “test set”, which defeats the purpose of unbiased evaluation.

Better terminology: Just be honest and call it a dev set!

Practical Guidelines

Quick Reference: Setting Up Your Data

Split your data into train/dev/test
Choose split ratios based on total data size
Ensure dev and test come from the same distribution
Make dev/test represent your target application
Consider skipping test set if you don’t need unbiased estimates

Checklist for Data Setup

Do I have enough data for each split to be meaningful?
Are my dev and test sets from the same distribution?
Do my dev/test sets represent real-world usage?
Is my dev set large enough to distinguish between models?
Is my test set large enough for confident performance estimation?
Have I considered whether I actually need a test set?

Red Flags to Avoid

❌ Don’t:

Use different distributions for dev and test sets
Make dev/test sets too large when you have big data
Use the test set multiple times to make decisions
Mix real target data into training when it should be in dev/test

✅ Do:

Put real target data in dev/test sets
Use alternative sources to bulk up training data
Keep dev/test distributions identical
Reserve test set for final evaluation only

Key Takeaways

Data organization matters: Proper train/dev/test setup accelerates progress significantly
Deep learning is iterative: You’ll cycle through many experiments before finding good hyperparameters
Traditional splits are outdated: With big data, use much smaller dev/test sets (1-2% each)
Dev set size: Just needs to distinguish between models (10K examples often sufficient)
Test set size: Just needs confident performance estimate (10K-50K examples often sufficient)
Distribution matching: Dev and test must have same distribution
Distribution mismatch: Train can differ from dev/test if it gives you more data
Test set optional: Skip it if you don’t need unbiased performance estimates
Terminology confusion: Many call dev sets “test sets” - be aware of this
Domain transfer is hard: Intuitions from one domain rarely transfer to others
Hardware matters: GPU/CPU configuration affects optimal hyperparameters
Empirical process: You must experiment - no one gets it right the first time
Fast iteration wins: Efficient data setup enables rapid experimentation
Aim at the right target: Put representative data in dev/test, not training
More data usually helps: Use creative tactics to acquire more training data