Machine learning: Information Entropy

Information theory is a subfield of mathematics concerned with transmitting data across a noisy channel.

A cornerstone of information theory is the idea of quantifying how much information there is in a message.

More generally, this can be used to quantify the information in an event and a random variable, called entropy, and is calculated using probability.

Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as feature selection, building decision trees, and, more generally, fitting classification models.

As such, a machine learning practitioner requires a strong understanding and intuition for information and entropy.

This tutorial is divided into three parts; they are:

What is Information Theory?
Calculate the Information for an Event
Calculate the Information for a Random Variable

1. What is Information Theory?

Information theory is a field of study concerned with quantifying information for communication.

It is a subfield of mathematics and is concerned with topics like data compression and the limits of signal processing.

A foundational concept from information theory is the quantification of the amount of information in things like events, random variables, and distributions.

Quantifying the amount of information requires the use of probabilities, hence the relationship of information theory to probability.

Measurements of information are widely used in artificial intelligence and machine learning, such as in the construction of decision trees and the optimization of classifier models.

As such, there is an important relationship between information theory and machine learning and a practitioner must be familiar with some of the basic concepts from the field.

2. Calculation the Information for an Event

Quantifying information is the foundation of the field of information theory.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event, that is, how unlikely it is.

Those events that are rare (low probability) are more surprising and therefore have more information those events that are common (high probability).

Low Probability Event: High Information (surprising).
High Probability Event: Low Information (unsurprising).

We can calculate the amount of information there is in an event using the probability of the event.

This is called Shannon information, self-information, or simply the information, and can be calculated for a discrete event I(x) as follows:

I(x) = −log(p(x))

Where I(x) is the information content (self-information) of x, log() is the base-2 logarithm and p(x) is the probability of the event x.

The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits).

The calculation of information is often written as h() to contrast it with entropy H(); for example: h(x) = − log(p(x))

The negative sign ensures that the result is always positive or zero. Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise.

Consider a flip of a single fair coin. The probability of heads (and
tails) is 0.5.

# calculate the information for a coin flip
from math import log2
# probability of the event
p = 0.5
# calculate information for event
h = -log2(p)
# print the result
print('p(x)=%.3f, information: %.3f bits' % (p, h))

-----Result-----

p(x)=0.500, information: 1.000 bits

We can also explore the information in a single roll of a fair six-sided dice, e.g. the information in rolling a 6. We know the probability of rolling any number is 1/6, which is a smaller number than 1/2 for a coin flip, therefore we would expect more surprise or a larger amount of information.

# calculate the information for a dice roll
from math import log2
# probability of the event

p = 1.0 / 6.0
# calculate information for event
h = -log2(p)
# print the result
print('p(x)=%.3f, information: %.3f bits' % (p, h))

-----Result-----

p(x)=0.167, information: 2.585 bits

# compare probability vs information entropy
from math import log2
from matplotlib import pyplot
# list of probabilities
probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
# calculate information
info = [-log2(p) for p in probs]
# plot probability vs information
pyplot.plot(probs, info, marker='.')
pyplot.title('Probability vs Information')
pyplot.xlabel('Probability')
pyplot.ylabel('Information')
pyplot.show()

-----Result-----

Plot of Probability vs Information

3. Calculate the Information for a Random Variable

In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable.

Calculating the information for a random variable is called information entropy, Shannon entropy, or simply entropy.

The intuition for entropy is that it is the average number of bits required to represent or transmit an event drawn from the probability distribution for the random variable.

Entropy can be calculated for a random variable X with K discrete states as follows:

That is the negative of the sum of the probability of each event multiplied by the log of the probability of each event.

Like information, the log() function uses base-2 and the units are bits.

The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty.

The largest entropy for a random variable will be if all events are equally likely.

We can consider a roll of a fair die and calculate the entropy for the variable. Each outcome has the same probability of 1/6, therefore it is a uniform probability distribution. We therefore would expect the average information to be the same information for a single event calculated in the previous section.

# calculate the entropy for a dice roll
from math import log2
# the number of events
n = 6
# probability of one event
p = 1.0 /n
# calculate entropy
entropy = -sum([p * log2(p) for _ in range(n)])
# print the result
print('entropy: %.3f bits' % entropy)

-----Result-----

entropy: 2.585 bits

Running the example calculates the entropy as more than 2.5 bits, which is the same as the information for a single outcome.

This makes sense, as the average information is the same as the lower bound on information as all outcomes are equally likely.

If we know the probability for each event, we can use the entropy() SciPy function to calculate the entropy directly.

# calculate the entropy for a dice roll
from scipy.stats import entropy
# discrete probabilities
p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
# calculate entropy
e = entropy(p, base=2)
# print the result
print('entropy: %.3f bits' % e)

-----Result-----

entropy: 2.585 bits

Skewed Probability Distribution (unsurprising): Low entropy.

Balanced Probability Distribution (surprising): High entropy.

Running the example creates the 6 probability distributions with [0,1] probability through to [0.5,0.5] probabilities. As expected, we can see that as the distribution of events changes from skewed to balanced, the entropy increases from minimal to maximum values. That is, if the average event drawn from a probability distribution is not surprising we get a lower entropy, whereas if it is surprising, we get a larger entropy.

# compare probability distributions vs entropy
from math import log2
from matplotlib import pyplot
# calculate entropy
def entropy(events, ets=1e-15):
return -sum([p * log2(p + ets) for p in events])
# define probabilities
probs = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
# create probability distribution
dists = [[p, 1.0 - p] for p in probs]
# calculate entropy for each distribution
ents = [entropy(d) for d in dists]
# plot probability distribution vs entropy
pyplot.plot(probs, ents, marker='.')
pyplot.title('Probability Distribution vs Entropy')
pyplot.xticks(probs, [str(d) for d in dists])
pyplot.xlabel('Probability Distribution')
pyplot.ylabel('Entropy (bits)')
pyplot.show()

-----Result-----

Plot of Probability Distribution vs Entropy

Calculating the entropy for a random variable provides the basis for other measures such as mutual information (information gain).

It also provides the basis for calculating the difference between two probability distributions with cross-entropy and the KL-divergence.

Machine learning

Menu bar

21/08/2021

Information Entropy

No comments:

Post a Comment