Mixtures can be found in a large number of models, e.g., clustering and classification models, models of annotation, models with hierarchical structures, topic models, and many others. In many of these cases, the number of mixture components is unknown; choosing a small number of clusters means potentially distintive groups will end up being merged together, while choosing a large number will introduce noise (overly fine grained clusters get created that should otherwise be merged together). This motivates the need for a nonparametric mixture where the number of components can adapt to and grow with the data.

In this blog post I discuss the use of a stick breaking process in the context of an unbounded mixture. The weights of the mixture can be viewed as draws from an infinite categorical. This distribution is formed by successively breaking a stick of length one and measuring the broken parts. Using as the breaking proportions and as the length of the broken parts we have:

More formally, the breaking proportions are assumed to be drawn from a $\mathsf{Beta}(1,\alpha)$ distribution. The $\alpha$ hyperparameter affects the growth of the number of components with the data: small values favor a smaller number of clusters, while bigger values favor a bigger number. Focusing, for simplicity, on a mixture of unigrams that we apply, let’s say, to a collection of short documents, we have the following generative process:

- For every cluster i = 1 to $\infty$:
- Draw breaking proportion $\nu_i \sim \mathsf{Beta}(1,\alpha)$
- Draw a unigram distribution (i.e., a distribution over the vocabulary space) $\beta_i \sim \mathsf{Dirichlet}(\eta)$

- For every document d = 1 to M:
- Draw a cluster $z_d \sim \mathsf{Categorical} (\pi(\nu))$
- For every word position n = 1 to $N_d$:
- Draw a word $w_{d,n} \sim \mathsf{Categorical} ( \beta_{z_d} )$

Estimating the parameters of the above mixture is straightforward with variational inference because the stick breaking process nicely complies with conjugacy. The infinite mixture is approximated with truncated variational distributions, i.e., we fix a truncation level T and let the T-th stick take, under the variational distribution, the reminder mass of the unit stick ($q(\nu_T = 1) = 1$); thus subsequent sticks will have no mass under the variational distribution.

I showed in a previous post that when the complete conditionals are in the exponential family, the variational distributions take the same form and have the natural parameters equal to the expected value (under the variational distribution) of the natural parameter of the corresponding complete conditional. We will use this elegant result in the derivations below.

Let’s compute first the complete conditional for a unigram distribution:

The just derived the complete conditional is Dirichlet distributed. The corresponding variational distribution will thus have the same form, i.e., $q(\beta_{i} \mid \lambda_{i}) = \mathsf{Dirichlet}(\lambda_{i})$, with the natural parameter equal to the expected value (under the variational distribution) of the natural parameter of the conditional:

Applying the same set of steps for the per document cluster assignments, we have:

The complete conditional for the per document cluster assignments is a Categorical distribution, leading to $q(z_{d} \mid \phi_{d}) = \mathsf{Categorical}(\phi_{d})$, where:

Finally, let’s compute the complete conditional for the breaking proportions:

The second term needs a bit more work, so let’s focus on that for a moment:

Going back to the derivation of the complete conditional for the breaking proportions, we have:

We can see the conditional is a Beta distribution. We’ll thus have $q(\nu_i \mid \gamma_i) = \mathsf{Beta}(\gamma_{i,1}, \gamma_{i,2})$, where:

Knowing the functional form of the variational distributions, computing the expectations present in the update formulas derived above is straightforward:

For the computation of the second expectation I used the fact that the variational distribution has no mass above the truncation level T. For the following expectations I used a well known result from exponential families, i.e., the first derivative of the log normalizer is equal to the expected value of the sufficient statistics. Usually you can find these expectations in relevant textbooks (or Wikipedia).

The inference algorithm involves iterating between the derived update formulas until the variational objective function (the ELBO) plateaus:

I’ll leave out the expansion of the ELBO as an exercise. Should be straightforward following the work up to this point.