# The principle of maximum entropy

## Introduction: what is entropy?

\[ \newcommand{\EE}{\mathbb{E}} \newcommand{\RR}{\mathbb{R}} \newcommand{\mbf}{\mathbf} \DeclareMathOperator{\argmax}{argmax} \]

Intuitively, the notion of entropy defines a measure of “disorder” or “expected surprise” given a probability distribution. As described by Shannon in is 1948 article, the entropy can be defined as follows Shannon, Claude Elwood. “A mathematical theory of communication.” *Bell system technical journal* 27.3 (1948): 379-423. .

Let \(X\) be a discrete random variable on a space \(\mathcal{X}\) with probability mass function \(\mathbf{p}(x)\). We define the discrete (Shannon) entropy as follows, named after Boltzmann’s H-theorem: \[\begin{aligned} H(X) = \mathbb{E}\left[ \log \frac{1}{\mathbf{p}(x)} \right] = \sum_{x \in \mathcal{X}} \mathbf{p}(x) \log \frac{1}{\mathbf{p}(x)} \end{aligned}\]

The discrete entropy satisfies the following basic properties:

- \(H(X) \geq 0\).
- \(H(X) \leq \log |\mathcal{X}|\), with equality if and only if \(X\) is distributed uniformly over \(\mathcal{X}\).

Part (a) follows from the fact that \(\log \frac{1}{\mathbf{p}(x)} \geq 0\) for any \(x \in \mathcal{X}\). Hence the expectation is nonnegative and \(H(X) \geq 0\).

To show part (b), we will apply Jensen’s inequality. First, note that \(f(t) = \log (t)\) is concave. Applying Jensen’s inequality, we obtain: \[\begin{aligned} \mathbb{E}\left[ \log \frac{1}{\mathbf{p}(x)} \right] & \leq \log \mathbb{E}\left[ \frac{1}{\mathbf{p}(x)} \right] \\ &= \log \sum_{x \in \mathcal{X}} \mathbf{p}(x) \cdot \frac{1}{\mathbf{p}(x)} \\ &= \log |\mathcal{X}|. \end{aligned}\]

A similar object to study is the so-called *relative entropy*, \(D(\mathbf{p}|| \mathbf{q})\), which serves as a measure of distance between two probability distributions. Importantly, \(D(\mathbf{p}|| \mathbf{q}) \neq D(\mathbf{q}|| \mathbf{p})\) in general, so the relative entropy is not a metric.

Let \(\mathbf{p}\) and \(\mathbf{q}\) be probability distributions on a space \(\mathcal{X}\). Then the relative entropy \(D(\mathbf{p}|| \mathbf{q})\) is defined as \[\begin{aligned} D(\mathbf{p}|| \mathbf{q}) = \mathbb{E}_{x \sim \mathbf{p}(x)} \left[ \log \frac{\mathbf{p}(x)}{\mathbf{q}(x)} \right] = \sum_{x \in \mathcal{X}} \mathbf{p}(x) \log \frac{\mathbf{p}(x)}{\mathbf{q}(x)}. \end{aligned}\] This quantity is also known as the Kullback-Leibler divergence and has numerous applications in statistics and physics. .

We will use the following basic fact in subsequent proofs:

The relative entropy is nonnegative. That is, \(D(\mathbf{p}|| \mathbf{q}) \geq 0\) with equality if and only if \(\mathbf{p}(x) = \mathbf{q}(x)\) for all \(x \in \mathcal{X}\).

Applying Jensen’s inequality to the concave function \(f(t) = \log(t)\), we obtain: \[\begin{aligned} D(\mathbf{p}|| \mathbf{q}) &= \mathbb{E}_{x \sim \mathbf{p}(x)} \left[ \log \frac{\mathbf{p}(x)}{\mathbf{q}(x)} \right] \\ &= - \mathbb{E}_{x \sim \mathbf{p}(x)} \left[ \log \frac{\mathbf{q}(x)}{\mathbf{p}(x)} \right] \\ &\geq - \log \mathbb{E}_{x \sim \mathbf{p}(x)} \left[ \frac{\mathbf{q}(x)}{\mathbf{p}(x)} \right] \\ &= - \log \left( \sum_{x \in \mathcal{X}} \mathbf{q}(x) \right) \\ &= \log 1 \\ &= 0. \end{aligned}\]

Since \(f(t)\) is strictly concave, equality can be achieved in Jensen’s inequality only when \(\frac{\mathbf{q}(x)}{\mathbf{p}(x)}\) is constant. Since \(\mathbf{p}\) and \(\mathbf{q}\) are probability distributions, they must in fact be equal.

We can define a version of the entropy for continuous random variables \(X\).

\[\begin{aligned} h(x) = \mathbb{E}\left[ \log \frac{1}{\mathbf{p}(X)} \right] = \int_{-\infty}^{\infty} \mathbf{p}(x) \log \frac{1}{\mathbf{p}(x)}.\end{aligned}\]

While these objects look similar, in fact the discrete and continuous definitions of entropy are rather different. For instance, while \(H(X) \geq 0\), in fact the differential entropy does not satisfy this property.

The differential entropy of a Gaussian random variable \(X \sim \mathcal{N}(0, \sigma^2)\) is \(\log \sqrt{2 \pi e} \sigma\).

Recall that the density function \(p(x)\) is given by \(p(x) = \frac{1}{\sqrt{2 \pi } \sigma} \exp \left( - \frac{x^2}{2 \sigma^2} \right)\). Therefore \[\begin{aligned} h(X) &= - \int_{-\infty}^{\infty} p(x) \log p(x) \, dx \\ &= \int_{-\infty}^{\infty} p(x) \log \sqrt{2 \pi} \sigma \, dx + \int_{- \infty}^{\infty} p(x) \frac{x^2}{2 \sigma^2} \, dx \\ &= \log \sqrt{2 \pi} \sigma + \frac{\sigma^2}{2 \sigma^2} \\ &= \log \sqrt{2 \pi e} \sigma. \end{aligned}\]

## Prior probability distributions

Suppose we have some prior knowledge about a phenomenon we would like to observe. Given this knowledge, what is the best prior probability distribution?

The *principle of maximum entropy* states that we ought to choose the prior probability distribution that maximizes the entropy conditioned on our constraints. In particular, this prior will be “maximally disordered” given the constraints; in a technical sense this prior makes the fewest assumptions.

There is significant debate on whether the principle of maximum entropy is the best choice for prior selection in statistics. We do not concern ourselves here with these difficult issues, but we refer the interested reader to Cover, Thomas M., and Joy A. Thomas. *Elements of information theory.* John Wiley & Sons, 2012. Jaynes, Edwin T. “On the rationale of maximum-entropy methods.” Proceedings of the IEEE 70.9 (1982): 939-952. MacKay, David JC, and David JC Mac Kay. *Information theory, inference and learning algorithms.* Cambridge university press, 2003. .

## Examples of maximum entropy

We now present three simple examples of the maximum entropy principle.

Suppose \(\mathbf{p}\) is a discrete probability distribution on a finite set \(\mathcal{X}\). Then \[\begin{aligned} H(\mathbf{p}) \leq \log |\mathcal{X}|, \end{aligned}\] with equality if and only if \(\mathbf{p}\) is uniform on \(\mathcal{X}\).

This was proven in Section 1. Intuitively, this states that with no constraints on a discrete probability distribution, the maximum entropy prior is uniform on the event space \(\mathcal{X}\).

Let \(\mathbf{p}\) be a continuous probability distribution on \(\mathbb{R}\) with variance \(\sigma^2\). Then \[\begin{aligned}
h(\mathbf{p}) \leq \log \sqrt{2 \pi e} \sigma.
\end{aligned}\] Equality holds if and only if \(p\) is Gaussian with variance \(\sigma^2\).

We have already seen that the differential entropy of a Gaussian random variable \(G \sim \mathcal{N}(0, \sigma^2)\) is \(h(G) = \log \sqrt{2 \pi e} \sigma\).

Let \(\mathbf{p}\) be a probability density on \(\RR\) with variance \(\sigma^2\) and mean \(\mu\) (which exists by the definition of variance). Let \(\mathbf{q}\) be Gaussian with mean \(\mu\) and variance \(\sigma^2\).

Now, \[\begin{aligned} 0 & \leq D (\mathbf{q}|| \mathbf{p}) = \int_{- \infty}^{\infty} \mathbf{q}(x) \log \frac{\mathbf{q}(x)}{\mathbf{p}(x)} \, dx \\ & = - h(\mathbf{q}) - \int_{- \infty}^{\infty} \mathbf{q}(x) \log \mathbf{p}(x) \, dx \\ & = - h(\mathbf{q}) - \int_{-\infty}^{\infty} \mathbf{p}(x) \log \mathbf{p}(x) \, dx \\ &= -h(\mathbf{q}) + h(\mathbf{p}). \end{aligned}\]

The second to last inequality is the key step. This follows since \(\mathbf{q}\) and \(\mathbf{p}\) yield the same second moments for the terms encoded by \(\log \mathbf{p}(x)\).

Let \(\mathbf{p}\) be a continuous probability density function on \((0, \infty)\) with mean \(\mu\). Then \[\begin{aligned} h(p) \leq 1 + \log \mu, \end{aligned}\] with equality if and only if \(\mathbf{p}\) is exponential with mean \(\mu\). That is, \(p(x) = \frac{1}{\mu} \exp (- \frac{x}{\mu})\).

This theorem has a natural interpretation in physics. Let \(X\) be a random variable describing the height of molecules in the atmosphere. The average potential energy of the molecules is fixed (mean \(\lambda\)), and the atmosphere tends to the distribution that has the maximum entropy.

We will prove this result using Lagrange multipliers. Let \(\mathbf{p}\) be a probability distribution on \((0, \infty)\) with mean \(\mu\). Define \[\begin{aligned} F(\mathbf{p}, \lambda_1, \lambda_2) &= - \int_{0}^{\infty} \mathbf{p}(x) \log \mathbf{p}(x) \, dx + \lambda_1 \left( \int_{0}^{\infty} \mathbf{p}(x) \, dx - 1 \right) + \lambda_2 \left( \int_{0}^{\infty} x \mathbf{p}(x) \, dx - \mu \right) \\ &= \int_{0}^{infty} \mathcal{L}(x, \mathbf{p}(x), \lambda_1, \lambda_2) \, dx - \lambda_1 - \lambda_2 \mu, \end{aligned}\] where \(\mathcal{L}(x, \mathbf{p}, \lambda_1, \lambda_2) = - \mathbf{p}\log \mathbf{p}+ \lambda_1 \mathbf{p}+ \lambda_2 x \mathbf{p}\). Taking partials, \[\begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{p}} = - 1 - \log p + \lambda_1 + \lambda_2 x, \end{aligned}\] and at a maximum entropy distribution we have \(\frac{\partial \mathcal{L}}{\partial \mathbf{p}} = 0\), so that \[\begin{aligned} \mathbf{p}(x) = \exp \left( \lambda_1 - 1 + \lambda_2 x \right), \end{aligned}\] for \(x \geq 0\).

Now, since \(\int_{0}^{\infty} \mathbf{p}(x) \, dx\) is finite, we must have \(\lambda_2 < 0\). This implies \[\begin{aligned} \int_{0}^{\infty} \mathbf{p}(x) &= 1 \\ &= e^{\lambda_1 - 1} \int_{0}^{\infty} e^{\lambda_2 x} \, dx \\ &= \frac{e^{\lambda_1 - 1}}{|\lambda_2|} \\ \end{aligned}\] so that \(e^{\lambda_1 - 1} = |\lambda_2|\).

Now, since \(\int_{0}^{\infty} x e^{\lambda_2 x} \, dx = \frac{1}{\lambda_2^2}\), the condition \(\int_{0}^{\infty} x \, \mathbf{p}(x) \, dx = \mu\), so that \(\lambda_2 = - \frac{1}{\mu}\).

Putting this together, we conclude \(\mathbf{p}(x) = \frac{1}{\mu} e^{-x / \mu}\), which is indeed the exponential distribution.

## Generalizations

Is it possible to obtain a maximum entropy solution under more general constraints?

*Problem statement.* Maximize the entropy \(h(f)\) over all probability densities \(f\) satisfying the moment constraints below, where \(S\) is the support set. \[\begin{aligned}
\int_{S} f(x) r_i(x) dx = \alpha_i; \text{ for } 1 \leq i \leq m. \qquad *\end{aligned}\]

In particular, \(f\) is a density on support set \(S\) meeting moment constraints \(\alpha_1, \alpha_2, \dots, \alpha_m\).

Let \(f^{*}(x) = f_{\lambda}(x) = \exp (\lambda_0 + \sum_{i=1}^{m} \lambda_i r_i(x))\), \(x \in S\), where \(\lambda_0, \dots, \lambda_m\) are chosen so that \(f^{*}\) satisfies the constraints in (*). Then \(f^{*}\) uniquely maximizes \(h(f)\) over all probability densities \(f\) that satisfy (*).

(Sketch).

We first sketch the argument for why the \(\lambda_i\) can be chosen. The constant \(\lambda_0\) and the \(n\) Lagrange multipliers \(\lambda_1, \dots, \lambda_n\) solve the constrained optimization problem below: \[\begin{aligned} \max_{\lambda_i} \left\{ \sum_{i=0}^{n} \lambda_i \alpha_i - \int_{S} \exp \left( \sum_{i=0}^{n} \lambda_i f_i(x) \right) \right\}. \end{aligned}\]

Under the Karush-Kuhn-Tucker (KKT) conditions, we can show that the optimization problem above has a unique solution, since the objective function is concave in the \(\lambda_i\). The full argument is out of the scope of this article, but we refer the interested reader to [@boyd2004convex].

Now, let \(g\) satisfy the constraints in (*). Then \[\begin{aligned} h(g) &= - \int_{S} g \ln g \\ &= - \int_{S} g \ln \frac{g}{f^{*}} f^{*} \\ &= - \text{D} (g || f^{*}) - \int_{S} g \ln f^{*} \\ &\leq - \int_{S} g \ln f^{*} \\ &= - \int_{S} g \left( \lambda_0 + \sum_{i} \lambda_i r_i \right) \\ &= - \int_{S} f^{*} \left( \lambda_0 + \sum_{i} \lambda_i r_i \right) \\ &= - \int_{S} f^{*} \ln f^{*} \\ &= h(f^{*}) \end{aligned}\]

Note that equality holds in (a) if and only if \(g(x) = f^{*}(x)\) for all \(x\), which demonstrates uniqueness.

Suppose that \(n\) dice are rolled and the total number of spots is \(n \alpha\). What proportion of the dice are showing face \(i\) where \(i \in \left\{ 1, 2, \dots, 6 \right\}\)?

We will count the number of ways that \(n\) dice can fall so that \(n_i\) dice show face \(i\); this is just the multinomial coefficient \(\binom{n}{n_1, n_2, \dots, n_6}\).

To find the most probable state, we will maximize the multinomial coefficient \(\binom{n}{n_1, n_2, \dots, n_6}\) under the constraint \(\sum_{i=1}^{6} i n_i = n \alpha\).

A form of Stirling’s approximation states \(n! \approx \left ( \frac{n}{e} \right )^{n}\). In particular, this implies

\[\begin{aligned} \binom{n}{n_1, n_2, \dots, n_6} & \approx \frac{\left ( \frac{n}{e} \right )^n}{\prod_{i=1}^{6} \left( \frac{n_i}{e} \right)^n} \\ &= \prod_{i=1}^{6} \left( \frac{n}{n_i} \right)^{n_i} \\ & = \exp \left( n H (\frac{n_1}{n}, \frac{n_2}{n}, \cdots, \frac{n_6}{n}) \right).\end{aligned}\]

This shows that maximizing \(\binom{n}{n_1, n_2, \dots, n_6}\) under the given constraints is nearly equivalent to maximizing \(H(p_1, \dots, p_6)\) under the constraint \(\sum_{i} i p_i = \alpha\). Applying the theorem, the maximum entropy distribution is

\[\begin{aligned} p_i^{*} = \frac{e^{\lambda_i}}{\sum_{i=1}^{6} e^{\lambda_i}},\end{aligned}\] where \(\lambda\) is chosen so that \(\sum_{i} i p_i^{*} = \alpha\). Returning to the original question, the most probable state is \((np_1^{*}, np_2^{*}, \dots, n p_6^{*})\) and we expect \(n_i^{*} = n p_i^{*}\) dice showing face \(i\).

## The Wallis experiment

This experiment is due to Graham Wallis who mentioned it to E.T. Jaynes in 1962 Jaynes, Edwin T. *Probability theory: The logic of science.* Cambridge university press, 2003. .

Suppose Alice is a researcher trying to understand some data. She is studying a phenomenon and wants to estimate a prior probability distribution among \(m\) possible outcomes. She has some prior information about what the phenomenon looks like.

To estimate the distribution, she runs the following experiment:

- Randomly distribute \(N\) quanta of probability, each worth \(\frac{1}{N}\), among the \(m\) possibilities.
- Check if the probability assignment is consistent with her prior information. If inconsistent: reject and try again.
- If the assignment agrees with her prior information, her estimated prior distribution \(\mathbf{p}\) is given by \[\begin{aligned} \mathbf{p}_i = \frac{n_i}{N}; \qquad i \in \left\{ 1, 2, \dots, m \right\} \end{aligned}\] where \(\mathbf{p}_i\) is the probability of the \(i\)-th outcome, and \(n_i\) is the number of quanta that were assigned to the \(i\)-th proposition.

As we’ll see, this experiment has deep ties to the principle of maximum entropy. We now ask: what is the most probable prior distribution Alice will arrive at?

The probability of any particular probability distribution \(\mathbf{p}\) is given by a multinomial coefficient. \[\begin{aligned} \Pr(\mathbf{p}) = \frac{N!}{n_1! n_2 ! \cdots n_m!} m^{-N},\end{aligned}\]

To find the most likely distribution \(\mathbf{p}\), it suffices to maximize the term \(A = \frac{N!}{n_1! n_2! \cdots n_m!}\), or a monotonically increasing function of \(A\), e.g. \(\frac{1}{N} \log A\). \[\begin{aligned} \argmax_{\mathbf{p}} A &= \argmax_{\mathbf{p}} \frac{1}{N} \log A \\ &= \argmax_{\mathbf{p}} \frac{1}{N} \log \frac{N!}{n_1! n_2! \cdots n_m!} \\ &= \argmax_{\mathbf{p}} \frac{1}{N} \log \frac{N!}{(N \mathbf{p}_1)! (N \mathbf{p}_2)! \cdots (N \mathbf{p}_m)!} \\ &= \argmax_{\mathbf{p}} \frac{1}{N} \left( \log N! - \sum_{i=1}^{m} \log (N \mathbf{p}_i)! \right)\end{aligned}\]

What is the limit of this quantity as the number of trials \(N \to \infty\)? Applying Stirling’s approximation: \[\begin{aligned} \argmax_{\mathbf{p}} \lim_{N \to \infty} \left ( \frac{1}{N} \log A \right ) &= \argmax_{\mathbf{p}} \frac{1}{N} \left( N \log N - \sum_{i=1}^{m} N p_i \log (N p_i) \right) \\ &= \argmax_{\mathbf{p}} \left ( \log N - \sum_{i=1}^{m} {p_i} \log (N p_i) \right ) \\ & = \argmax_{\mathbf{p}} \left ( - \sum_{i=1}^{m} p_i \log p_i \right ) \\ & = \argmax_{\mathbf{p}} H(\mathbf{p}).\end{aligned}\]

In conclusion, Alice’s experiment will most likely converge to the maximum entropy distribution as \(N \to \infty\).