Friday, October 21, 2011

Why is the error function minimized in logistic regression convex?

We want to prove that the error/objective function of logistic regression :

is convex.

Proof:

Before beginning the proof, i would first like to make you review/recollect a few definitions/rules/facts/results related to convex functions:

Intuitively, this condition says that the tangent/first-order-taylor-series approximation of f(x) is globally an under-estimator.

Now notice that if we can prove that the two functions

$\dpi{120} { - \log \left( {{h_\theta }\left( x^i \right)} \right)} \;\; and \;\; { - \log \left( {1 - {h_\theta }\left( x^i \right)} \right)}$

are convex, then our objective function

$\inline \dpi{120} J\left(\theta\right) = \sum\limits_{i = 1}^m {{y^i}\left${ - \log \left( {{h_\theta }\left( x^i \right)} \right)} \right] + \left( {1 - {y^i}} \right)\left\[ { - \log \left( {1 - {h_\theta }\left( x^i \right)} \right)} \right$})$

must also be convex since any linear combination of two or more convex functions is also convex.

Let us now try to prove that

$\inline \dpi{120} { - \log \left( {{h_\theta }\left( x \right)} \right)} = { - \log \left( {\frac{1}{1+e^{-\theta^Tx}}} \right)} = { \log \left( {1+e^{-\theta^Tx}} \right)}$

is a convex function of theta. In order to do this, we will use the second-order condition of convexity described above. Let us first compute the hessian matrix:

$\begin{array}{lcl} grad: && \\ && \\ \nabla _\theta \; \left[ { - \log \left( {{h_\theta }\left( x \right)} \right)} \right] & = & \nabla _\theta \; \left[ { \log \left( {1+e^{-\theta^Tx}} \right)} \right] \\ & = & \left( {\frac{-e^{-\theta^Tx}}{1 + e^{-\theta^Tx}} }\right)\; x \\ & = & \left( {\frac{1}{1 + e^{-\theta^Tx}} - 1}\right)\; x \\ & = & \left( {h_\theta\left( x \right) - 1}\right)x \end{array}$ $\begin{array}{lcl} hessian: && \\ && \\ \nabla _\theta^2 \; \left[ { - \log \left( {{h_\theta }\left( x \right)} \right)} \right] & = & \nabla _\theta \left( \nabla _\theta \; \left[ { - \log \left( {{h_\theta }\left( x \right)} \right)} \right] \right) \\ & = & \nabla _\theta \left( \left( {h_\theta\left( x \right) - 1}\right)\; x \right) \\ & = & h_\theta\left( x \right) \left( 1 - h_\theta\left( x \right) \right) xx^T \end{array}$

Now below is the proof that this hessian matrix is positive semi-definite:

$\begin{array}{lcl} \forall z: \;\; z^T \nabla_x^2 \left( -\log \left( {{h_\theta }\left( x \right)} \right) \right ) & = & z^T \left[ h_\theta\left( x \right) \left( 1 - h_\theta\left( x \right) \right) xx^T \right] z \\ & = & h_\theta\left( x \right) \left( 1 - h_\theta\left( x \right) \right) \left( x^Tz \right)^2 \geq 0 \;\;\;\; (2) \end{array}$

Let us now try to prove that

$\dpi{120} \begin{array}{lcl} {-\log \left( {1 - {h_\theta }\left( x \right)} \right)} & = & { - \log \left( {1 - \frac{1}{1+e^{-\theta^Tx}}} \right)} \\ & = & { - \log \left( {\frac{e^{-\theta^Tx}}{1+e^{-\theta^Tx}}} \right)} \\ & = & { \theta^Tx + \log \left( {1+e^{-\theta^Tx}} \right)} \end{array}$

is a convex function of theta. In order to do this, we will again use the second-order condition of convexity described above. Let us first compute its hessian matrix:

$\begin{array}{lcl} grad: && \\ && \\ \nabla _\theta \; \left[ { - \log \left( {1 - {h_\theta }\left( x \right)} \right)} \right] & = &\nabla _\theta \; \left[ { \theta^Tx + \log \left( {1+e^{-\theta^Tx}} \right)} \right] \\ & = & x + \nabla _\theta \; \left[ { \log \left( {1+e^{-\theta^Tx}} \right)} \right] \end{array}$ $\begin{array}{lcl} hessian: & & \\ &&\\ \nabla _\theta^2 \; \left[ { - \log \left( {1 - {h_\theta }\left( x \right)} \right)} \right] & = & \nabla _\theta \left( \nabla _\theta \; \left[ { - \log \left( {1 - {h_\theta }\left( x \right)} \right)} \right] \right) \\ & = & \nabla_\theta \left( x + \nabla _\theta \; \left[ { \log \left( {1+e^{-\theta^Tx}} \right)} \right] \right) \\ & = & \nabla _\theta^2 \; \left[ { - \log \left( {{h_\theta }\left( x \right)} \right)} \right] \\ &&(we \; have \; proved \; in \; Eq.~(2) \; above \\ && that \; this \; is \; positive \; semi-definite) \end{array}$

Above, we have proved that both

$\dpi{120} { - \log \left( {{h_\theta }\left( x^i \right)} \right)} \;\; and \;\; { - \log \left( {1 - {h_\theta }\left( x^i \right)} \right)}$

are convex functions. And, the error/objective function of logistic regression

$\inline \dpi{120} J\left(\theta\right) = \sum\limits_{i = 1}^m {{y^i}\left${ - \log \left( {{h_\theta }\left( x^i \right)} \right)} \right] + \left( {1 - {y^i}} \right)\left\[ { - \log \left( {1 - {h_\theta }\left( x^i \right)} \right)} \right$})$

is essentially a linear-combination of several such convex functions. Now, since a linear combination of two or more convex functions is convex, we conclude that the objective function of logistic regression is convex.

Hence proved …

Following the same line of approach/argument it can be easily proven that the objective function of logistic regression is convex even if regularization is used.

1. Just to correct a little mistake: only a positive linear combination of convex functions is guaranteed to be convex again. However, in the logistic regression case y_i are positive, so it works indeed.

2. This comment has been removed by the author.

3. I think for equation(2) you lost a z in the left part.

4. Awesome post, thank you.

Agree with above that (2) missing a z on L.H.S.

5. I think in Eq.(2) you should write grad_{\theta} and not w.r.t. x.
This is of course just a typo.
Nice post btw.!

6. I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!! Best Machine Learning institute in Chennai | Best Machine learning training | best machine learning training certification

7. Very interesting blog.Thanks for sharing this much valuable information.Keep Rocking.
rpa training institute in chennai | rpa course fee in chennai | trending technologies list 2018