Menu bar

23/08/2021

Divergence Between Probability Distributions

It is often desirable to quantify the difference between probability distribution for a given variable.

This occurs frequently in machine learning, when we may be interested in calculating the difference between an actual and observed probability distribution.

This can be achieved using techniques from information theory, such as the Kullback-Leibler Divergence (KL divergence), or relative entropy, and the Jensen-Shannon Divergence that provides a normalized and symmetrical version of the KL divergence.

This tutorial is divided into three parts; they are:
  • Statistical Distance
  • Kullback-Leibler Divergence
  • Jensen-Shannon Divergence

1. Statistical Distance

There are many situations where we may want to compare two probability distributions.

Specifically, we may have a single random variable and two different probability distributions for the variable, such as a true distribution and an approximation of that distribution. 

In situations like this, it can be useful to quantify the difference between the distributions. 

Generally, this is referred to as the problem of calculating the statistical distance between two statistical objects, e.g. probability distributions.

One approach is to calculate a distance measure between the two distributions. 

This can be challenging as it can be difficult to interpret the measure. Instead, it is more common to calculate a divergence between two probability distributions. 

A divergence is like a measure but is not symmetrical. This means that a divergence is a scoring of how one distribution differs from another, where calculating the divergence for distributions P and Q would give a different score from Q and P.

Divergence scores are an important foundation for many different calculations in information theory and more generally in machine learning. For example, they provide shortcuts for calculating scores such as mutual information (information gain) and cross-entropy used as a loss function for classification models.

Divergence scores are also used directly as tools for understanding complex modeling problems, such as approximating a target probability distribution when optimizing generative adversarial network (GAN) models.


2. Kullback-Leibler Divergence

The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much one probability distribution differs from another probability distribution. 

The KL divergence between two distributions Q and P is often stated using the following notation: KL(P||Q).



The intuition for the KL divergence score is that when the probability for an event from P is large, but the probability for the same event in Q is small, there is a large divergence. When the probability from P is small and the probability from Q is large, there is also a large divergence, but not as large as the first case.

The log can be base-2 to give units in bits, or the natural logarithm base-e with units in nats. When the score is 0, it suggests that both distributions are identical, otherwise the score is positive. 

Importantly, the KL divergence score is not symmetrical, for example:
KL(P||Q) # KL(Q||P).

It is named for the two authors of the method Solomon Kullback and Richard Leibler, and is sometimes referred to as relative entropy.

The KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution p.


# plot of distributions
from matplotlib import pyplot
# define distributions
events = ['red', 'green', 'blue']
p = [0.10, 0.40, 0.50]
q = [0.80, 0.15, 0.05]
print('P=%.3f Q=%.3f' % (sum(p), sum(q)))
# plot first distribution
pyplot.subplot(2,1,1)
pyplot.bar(events, p)
# plot second distribution
pyplot.subplot(2,1,2)
pyplot.bar(events, q)
# show the plot
pyplot.show()


-----Result-----


Histogram of Two Different Probability Distributions for the Same Random
Variable



# example of calculating the kl divergence between two probability mass functions
from math import log2
# calculate the kl divergence
def kl_divergence(p, q):
    return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))
# define distributions
p = [0.10, 0.40, 0.50]
q = [0.80, 0.15, 0.05]
# calculate (P || Q)
kl_pq = kl_divergence(p, q)
print('KL(P || Q): %.3f bits' % kl_pq)
# calculate (Q || P)
kl_qp = kl_divergence(q, p)
print('KL(Q || P): %.3f bits' % kl_qp)


-----Result-----

KL(P || Q): 1.927 bits
KL(Q || P): 2.022 bits

If we change log2() to the natural logarithm log() function, the result is in nats, as follows:

# KL(P || Q): 1.336 nats
# KL(Q || P): 1.401 nats


# example of calculating the kl divergence (relative entropy) with scipy
from scipy.special import rel_entr
# define distributions
p = [0.10, 0.40, 0.50]
q = [0.80, 0.15, 0.05]
# calculate (P || Q)
kl_pq = rel_entr(p, q)
print('KL(P || Q): %.3f nats' % sum(kl_pq))
# calculate (Q || P)
kl_qp = rel_entr(q, p)
print('KL(Q || P): %.3f nats' % sum(kl_qp))

-----Result-----

KL(P || Q): 1.336 nats
KL(Q || P): 1.401 nats


3. Jensen-Shannon Divergence

The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference (or similarity) between two probability distributions. 

It uses the KL divergence to calculate a normalized score that is symmetrical. 

This means that the divergence of P from Q is the same as Q from P, or stated formally: 
                            JS(P||Q) ≡ JS(Q||P)

The JS divergence can be calculated as follows:

            JS(P||Q) = 1/2  × KL(P||M) + 1/2  × KL(Q||M)

Where M is calculated as:

                            M = 1/2 × (P + Q)

It is more useful as a measure as it provides a smoothed and normalized version of KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm.

The square root of the score gives a quantity referred to as the Jensen-Shannon distance, or JS distance for short.


# example of calculating the js divergence between two mass functions
from math import log2
from math import sqrt
from numpy import asarray
# calculate the kl divergence
def kl_divergence(p, q):
return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))
# calculate the js divergence
def js_divergence(p, q):
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)
# define distributions
p = asarray([0.10, 0.40, 0.50])
q = asarray([0.80, 0.15, 0.05])
# calculate JS(P || Q)
js_pq = js_divergence(p, q)
print('JS(P || Q) divergence: %.3f bits' % js_pq)
print('JS(P || Q) distance: %.3f' % sqrt(js_pq))
# calculate JS(Q || P)
js_qp = js_divergence(q, p)
print('JS(Q || P) divergence: %.3f bits' % js_qp)
print('JS(Q || P) distance: %.3f' % sqrt(js_qp))


-----Result------

JS(P || Q) divergence: 0.420 bits
JS(P || Q) distance: 0.648
JS(Q || P) divergence: 0.420 bits
JS(Q || P) distance: 0.648



# calculate the jensen-shannon distance metric
from scipy.spatial.distance import jensenshannon
from numpy import asarray
# define distributions
p = asarray([0.10, 0.40, 0.50])
q = asarray([0.80, 0.15, 0.05])
# calculate JS(P || Q)
js_pq = jensenshannon(p, q, base=2)
print('JS(P || Q) Distance: %.3f' % js_pq)
# calculate JS(Q || P)
js_qp = jensenshannon(q, p, base=2)
print('JS(Q || P) Distance: %.3f' % js_qp)


-----Result-----

JS(P || Q) Distance: 0.648
JS(Q || P) Distance: 0.648



No comments:

Post a Comment