The expectation maximization (EM) algorithm is a general technique for finding maximum likelihood (or MAP) estimates for models with latent variables. Let $x$ be the observed data, $z$ the hidden variables, and $\theta$ the parameters; the goal is to maximize the log likelihood function:

A first observation is that a direct optimization of the log likelihood can be quite difficult or intractable (see the log applied to the integral). Working instead with the so-called complete-data log likelihood, i.e., $ \log p(x,z \mid \theta)$, seems much easier (the log acts directly on the joint). However, we can’t just optimize this function since we don’t have the hidden variables; our knowledge of the hidden variables is given by their posterior, so an intuitive solution would be to optimize the expected value of the complete data log likelihood under the posterior of the latent variables. This pretty much summarizes EM. However, there is a broader context which leads to the aforementioned rationale so let’s investigate further. It turns out that, for any choice of $q(z)$, the following log likelihood decomposition holds:

In the above derivation, we expressed the log likelihood in terms of a functional $\mathcal{L}(q, \theta)$ (strictly speaking it’s a functional w.r.t. $q$ and a function w.r.t. $\theta$) and a relative entropy term $\mathsf{KL} (q \parallel p(. \mid x, \theta))$. The former term lower bounds the log likelihood since the latter is non-negative; EM maximizes the log likelihood indirectly by optimizing its lower bound. Rearranging terms, we have:

The optimization of the above bound is done in two steps:

  • Say the current value of the parameter vector is $\theta^{old}$; maximize $\mathcal{L}(q, \theta^{old})$ w.r.t. $q(z)$ while holding $\theta^{old}$ fixed
  • Holding $q(z)$ fixed, maximize $\mathcal{L}(q, \theta)$ w.r.t. $\theta$ to get a new parameter vector $\theta^{new}$

In the first step, $\log p(x \mid \theta)$ is constant w.r.t. $q(z)$, therefore, the maximum of the lower bound is achieved when the KL divergence is zero, i.e., when $q(z) = p(z \mid x, \theta^{old})$. The bound thus becomes:

In the second step the above bound is maximized w.r.t. $\theta$. Since the entropy term is a constant w.r.t. $\theta$ we just need to maximize the expectation of the complete data log likelihood under the “old” posterior of the latent variables:

The EM algorithm can be summarized as follows:

  • Initialize the parameters $\theta^{old}$
  • E step: evaluate $p(z \mid x, \theta^{old})$
  • M Step: evaluate $\theta^{new} = \underset{\theta}{\mathsf{arg \ max}} \ Q(\theta, \theta^{old}) $
  • Check convergence; if that’s yet to be satisfied set $\theta^{old} = \theta^{new}$ and repeat the above two steps

Note that the EM algorithm improves the log likelihood with every iteration. It’s easy to see this:

Alright, enough with the theory! Let’s apply EM to a mixture of multivariate Gaussian distributions: $x_{1:N}$ D-dimensional points, $z_{1:N}$ per point cluster assignments, $\mu_{1:K}$ cluster means, $\Sigma_{1:K}$ cluster covariances and $\pi_{1:K}$ cluster prevalences.

In the E step we have:

where

In the M step we need to maximize the expected value of the complete-data log likelihood under the “old” posterior of the per-point cluster assignments:

Maximizing w.r.t. the mean of a cluster $\mu_h$ we get:

Maximizing w.r.t. the covariance of a cluster $\Sigma_h$ we get:

Finally, we need to maximize w.r.t. the prevalence of a cluster $\pi_h$. This optimization is subject to the “sum to one” constraint which we’ll enforce using Lagrange multipliers:

Taking derivatives w.r.t. $\pi_h$ gives us:

The derivation w.r.t. $\lambda$ results in:

Solving the above system of equations we get:

There you have it! Initialize the parameters and iterate between the E and M steps until convergence.