Last modified: Jan 31 2026 at 10:09 PM 3 mins read

Vectorization

Table of contents

  1. Introduction
  2. What is Vectorization?
  3. Performance Comparison Demo
  4. Why Vectorization Works: SIMD
  5. Why This Matters in Deep Learning
  6. The Golden Rule
  7. Common Vectorization Patterns
  8. Key Takeaways

Introduction

Vectorization is the art of eliminating explicit for-loops in your code. In deep learning, this is a critical skill because:

  • Deep learning works best with large datasets
  • For-loops make training extremely slow
  • Vectorized code can run 300x faster

Why it matters: The difference between your code taking 1 minute vs 5 hours.

What is Vectorization?

The Problem: Computing $z = w^T x + b$

In logistic regression, we need to compute:

\[z = w^T x + b\]

Where:

  • $w \in \mathbb{R}^{n_x}$ (weight vector)
  • $x \in \mathbb{R}^{n_x}$ (feature vector)
  • Both can be very large vectors (many features)

Non-Vectorized Implementation (Slow)

z = 0
for i in range(n_x):
    z += w[i] * x[i]
z += b

Problem: This explicit for-loop is very slow for large $n_x$.

Vectorized Implementation (Fast)

z = np.dot(w, x) + b

Benefits:

  • Single line of code
  • No explicit loop
  • Much faster execution

Performance Comparison Demo

Let’s demonstrate the speed difference with a concrete example.

Setup: Create Test Data

import numpy as np
import time

# Create two 1-million dimensional arrays
a = np.random.rand(1000000)
b = np.random.rand(1000000)

Vectorized Version

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(f"Vectorized version: {1000*(toc-tic):.2f} ms")

Result: ~1.5 milliseconds

Non-Vectorized Version

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(f"For loop version: {1000*(toc-tic):.2f} ms")
print(f"Result: {c}")

Result: ~481 milliseconds

Performance Summary

ImplementationTimeSpeedup
Vectorized1.5 ms1x (baseline)
For-loop481 ms321x slower

Both compute the same result, but vectorized is ~300x faster!

Why Vectorization Works: SIMD

Hardware Parallelization

Both CPUs and GPUs have SIMD instructions:

SIMD = Single Instruction Multiple Data

What this means:

  • Process multiple data points simultaneously
  • One operation → many calculations in parallel

Where Vectorization Runs

HardwarePerformanceNotes
GPUExcellentSpecialized for parallel computation
CPUVery GoodAlso supports SIMD, just not as optimized

Key insight: NumPy’s built-in functions (np.dot, np.sum, etc.) automatically leverage SIMD parallelism on both CPUs and GPUs.

Why This Matters in Deep Learning

Training Time Comparison

Non-vectorized code:

  • Small dataset: manageable
  • Large dataset: hours or days

Vectorized code:

  • Small dataset: instant
  • Large dataset: minutes

Real-World Impact

Code Type1M Examples10M Examples
For-loops5 hours50 hours
Vectorized1 minute10 minutes

The Golden Rule

Whenever possible, avoid explicit for-loops

Instead:

  • Use NumPy’s built-in functions
  • Think in terms of matrix/vector operations
  • Let the library handle parallelization

Common Vectorization Patterns

Instead of This (Loop)

result = 0
for i in range(n):
    result += w[i] * x[i]

Do This (Vectorized)

result = np.dot(w, x)

More Examples

OperationLoop VersionVectorized
Dot productsum(w[i]*x[i])np.dot(w, x)
Element-wise multiply[w[i]*x[i] for i]w * x
Sumsum(x[i] for i)np.sum(x)
Exponential[exp(x[i]) for i]np.exp(x)

Key Takeaways

  1. Vectorization eliminates for-loops using matrix operations
  2. 300x speedup is typical for vectorized vs loop-based code
  3. SIMD instructions enable parallel processing on CPUs and GPUs
  4. NumPy functions automatically leverage hardware parallelization
  5. Deep learning requires vectorization to handle large datasets efficiently
  6. Rule of thumb: Always prefer vectorized operations over loops

Remember: In deep learning, vectorization isn’t optional—it’s essential for practical implementation.