Last modified: Jan 31 2026 at 10:09 PM • 3 mins read

Logistic Regression Cost Function

Introduction
Goal of Training
Loss Function (Single Example)
Intuition Behind Log Loss
Cost Function (Entire Training Set)
Key Terminology
Training Objective
Connection to Neural Networks

Introduction

To train a logistic regression model, we need to define a cost function that measures how well our model performs.

Recap from previous lesson:

\[\hat{y} = \sigma(w^T x + b)\]

where $\sigma(z)$ is the sigmoid function.

Goal of Training

Objective: Find parameters $w$ and $b$ such that predictions $\hat{y}^{(i)}$ are close to the true labels $y^{(i)}$ for all training examples.

Notation for Training Examples

For the $i$-th training example:

\[\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)\]

We can also define:

\[z^{(i)} = w^T x^{(i)} + b\]

Notational Convention: The superscript $(i)$ in parentheses refers to the $i$-th training example. This applies to $x^{(i)}$, $y^{(i)}$, $z^{(i)}$, etc.

Loss Function (Single Example)

The loss function $\mathcal{L}(\hat{y}, y)$ measures how well the prediction $\hat{y}$ matches the true label $y$ for a single training example.

Why Not Use Squared Error?

You might consider using squared error:

\[\mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2\]

Problem: This creates a non-convex optimization problem with multiple local optima, making gradient descent unreliable.

Logistic Regression Loss Function

Instead, we use the log loss (binary cross-entropy):

\[\mathcal{L}(\hat{y}, y) = -\left[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\right]\]

Why this works: This function is convex, making optimization much easier.

Intuition Behind Log Loss

Goal: Minimize the loss function (make it as small as possible).

Case 1: When $y = 1$

If $y = 1$, the loss becomes:

\[\mathcal{L}(\hat{y}, 1) = -\log(\hat{y})\]

(The second term vanishes because $(1-y) = 0$)

To minimize loss:

Want $-\log(\hat{y})$ to be small
This means $\log(\hat{y})$ should be large
Therefore, $\hat{y}$ should be large
Since $\hat{y} \in [0, 1]$, we want $\hat{y} \approx 1$

Interpretation: When the true label is 1, the loss pushes $\hat{y}$ close to 1.

Case 2: When $y = 0$

If $y = 0$, the loss becomes:

\[\mathcal{L}(\hat{y}, 0) = -\log(1 - \hat{y})\]

(The first term vanishes because $y = 0$)

To minimize loss:

Want $-\log(1-\hat{y})$ to be small
This means $\log(1-\hat{y})$ should be large
Therefore, $(1-\hat{y})$ should be large
This means $\hat{y}$ should be small
We want $\hat{y} \approx 0$

Interpretation: When the true label is 0, the loss pushes $\hat{y}$ close to 0.

Summary

The loss function ensures:

If $y = 1$: Push $\hat{y} \to 1$
If $y = 0$: Push $\hat{y} \to 0$

Cost Function (Entire Training Set)

The loss function measures performance on a single example. The cost function $J(w, b)$ measures performance on the entire training set.

Definition:

\[J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})\]

Expanded form:

\[J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})\right]\]

Where:

$m$ = number of training examples
$\hat{y}^{(i)}$ = prediction for example $i$ using parameters $w$ and $b$
$y^{(i)}$ = true label for example $i$

Key Terminology

Loss Function $\mathcal{L}(\hat{y}, y)$:

Measures error on a single training example
Compares one prediction to one true label

Cost Function $J(w, b)$:

Measures average error across the entire training set
This is what we minimize during training

Training Objective

Goal: Find parameters $w$ and $b$ that minimize the cost function $J(w, b)$.

This will be accomplished using gradient descent, which we’ll explore in the next lesson.

Connection to Neural Networks

Logistic regression can be viewed as a very simple neural network with:

No hidden layers
Single output unit
Sigmoid activation

This makes it an excellent foundation for understanding more complex neural networks.