Last modified: Jan 31 2026 at 10:09 PM • 3 mins read
Logistic Regression Cost Function
Table of contents
- Introduction
- Goal of Training
- Loss Function (Single Example)
- Intuition Behind Log Loss
- Cost Function (Entire Training Set)
- Key Terminology
- Training Objective
- Connection to Neural Networks
Introduction
To train a logistic regression model, we need to define a cost function that measures how well our model performs.
Recap from previous lesson:
\[\hat{y} = \sigma(w^T x + b)\]where $\sigma(z)$ is the sigmoid function.
Goal of Training
Objective: Find parameters $w$ and $b$ such that predictions $\hat{y}^{(i)}$ are close to the true labels $y^{(i)}$ for all training examples.
Notation for Training Examples
For the $i$-th training example:
\[\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)\]We can also define:
\[z^{(i)} = w^T x^{(i)} + b\]Notational Convention: The superscript $(i)$ in parentheses refers to the $i$-th training example. This applies to $x^{(i)}$, $y^{(i)}$, $z^{(i)}$, etc.
Loss Function (Single Example)
The loss function $\mathcal{L}(\hat{y}, y)$ measures how well the prediction $\hat{y}$ matches the true label $y$ for a single training example.
Why Not Use Squared Error?
You might consider using squared error:
\[\mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2\]Problem: This creates a non-convex optimization problem with multiple local optima, making gradient descent unreliable.
Logistic Regression Loss Function
Instead, we use the log loss (binary cross-entropy):
\[\mathcal{L}(\hat{y}, y) = -\left[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\right]\]Why this works: This function is convex, making optimization much easier.
Intuition Behind Log Loss
Goal: Minimize the loss function (make it as small as possible).
Case 1: When $y = 1$
If $y = 1$, the loss becomes:
\[\mathcal{L}(\hat{y}, 1) = -\log(\hat{y})\](The second term vanishes because $(1-y) = 0$)
To minimize loss:
- Want $-\log(\hat{y})$ to be small
- This means $\log(\hat{y})$ should be large
- Therefore, $\hat{y}$ should be large
- Since $\hat{y} \in [0, 1]$, we want $\hat{y} \approx 1$
Interpretation: When the true label is 1, the loss pushes $\hat{y}$ close to 1.
Case 2: When $y = 0$
If $y = 0$, the loss becomes:
\[\mathcal{L}(\hat{y}, 0) = -\log(1 - \hat{y})\](The first term vanishes because $y = 0$)
To minimize loss:
- Want $-\log(1-\hat{y})$ to be small
- This means $\log(1-\hat{y})$ should be large
- Therefore, $(1-\hat{y})$ should be large
- This means $\hat{y}$ should be small
- We want $\hat{y} \approx 0$
Interpretation: When the true label is 0, the loss pushes $\hat{y}$ close to 0.
Summary
The loss function ensures:
- If $y = 1$: Push $\hat{y} \to 1$
- If $y = 0$: Push $\hat{y} \to 0$
Cost Function (Entire Training Set)
The loss function measures performance on a single example. The cost function $J(w, b)$ measures performance on the entire training set.
Definition:
\[J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})\]Expanded form:
\[J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})\right]\]Where:
- $m$ = number of training examples
- $\hat{y}^{(i)}$ = prediction for example $i$ using parameters $w$ and $b$
- $y^{(i)}$ = true label for example $i$
Key Terminology
Loss Function $\mathcal{L}(\hat{y}, y)$:
- Measures error on a single training example
- Compares one prediction to one true label
Cost Function $J(w, b)$:
- Measures average error across the entire training set
- This is what we minimize during training
Training Objective
Goal: Find parameters $w$ and $b$ that minimize the cost function $J(w, b)$.
This will be accomplished using gradient descent, which we’ll explore in the next lesson.
Connection to Neural Networks
Logistic regression can be viewed as a very simple neural network with:
- No hidden layers
- Single output unit
- Sigmoid activation
This makes it an excellent foundation for understanding more complex neural networks.