Machine learning: Introduction to Multivariate Statistics

Fundamental statistics are useful tool in applied machine learning for better understanding your data.

They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the Covariance Matrix and Principal Component Analysis respectively.

In this tutorial, you will discover how fundamental statistical operations work and how to implement them using Numpy.

This tutorial is divided into 4 parts; they are:

Expected Value and Mean
Variance and Standard Deviation
Covariance and Correlation
Covariance Matrix

A. Expected Value and Mean

In probability, the average value of some random variable X is called expected value or expectation.

The expected value uses the notation E with square brackets around the name of variable, for example E[X].

In simple case, such as the flipping of a coin or rolling a dice, the
probability of each even is just like as likely.

The mean is denoted by the lower case Greek letter mu µ is calculated from the sample of observations, rather than all possible values.

# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)

-----Result-----

[1 2 3 4 5 6]
3.5

# matrix means

from numpy import array
from numpy import mean

# define matrix

M = array([

[1,2,3,4,5,6],

[1,2,3,4,5,6]])

print(M)

# column means

col_mean = mean(M, axis=0)

print(col_mean)

# row means

row_mean = mean(M, axis=1)

print(row_mean)

-----Result-----

[[1 2 3 4 5 6]

[1 2 3 4 5 6]]

[ 1. 2. 3. 4. 5. 6.]

[ 3.5 3.5]

B. Variance and Standard Deviation

In probability, the variance of some random variable X is a measure of

how much values in the distribution vary on average with respect to the

mean. The variance is denoted as the function Var() on the variable:

Var[X]

If the probability of each example in the distribution is equal, we have:

In statistics, the variance can be estimated from a sample of examples

drawn from the domain.

In the abstract, the sample variance is denoted by the lower case sigma

with a 2 superscript indicating the units are squared (e.g. σ2), not that

you must square the final value.

The sum of the squared differences is multiplied by the reciprocal of the

number of examples minus 1 to correct for a bias.

# vector variance

from numpy import array

from numpy import var

# define vector

v = array([1,2,3,4,5,6])

print(v)

# calculate variance

result = var(v, ddof=1)

print(result)

-----Result-----

[1 2 3 4 5 6]
3.5

# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)

# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)

-----Result-----

[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]

Standard deviation is calculated as the square root of the variance and is

denoted as lowercase s.

# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)

-----Result-----

[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 1.87082869 1.87082869]

C. Covariance and Correlation

In probability, covariance is the measure of the joint probability for two

random variables. It describes how the two variables change together.

It is denoted as the function cov(X,Y).

In statistics, the sample covariance can be calculated in the same way,

although with a bias correction.

The size of the covariance can be interpreted as whether the two

variables increase together (positive) or one increase but one decrease

(negative). The magnitude of the covariance is not easily interpreted. A

covariance value of zero indicates that both variables are completely

independent.

# vector covariance
from numpy import array
from numpy import cov
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second covariance
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)

-----Result-----

[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-7.5

The covariance can be normalized to a score between -1 and 1 to make

magnitude interpretable by dividing it by the standard deviation of X and

Y. The result is called the correlation of the variables, also called the

Pearson correlation coefficient, named for the developer of the method.

# vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)

-----Result-----

[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-1.0

D. Covariance Matrix

Covariance matrix is a square and symmetric matrix that describes the

covariance between two or more random variables.

The diagonal of the covariance matrix are the variance of each of random

variables, as such it is often called the variance-covariance matrix.

The covariance matrix provides a useful tool for separating the structured

relationships in a matrix of random variables.

It is a key element used in the Principal Component Analysis (PCA) data

reduction method.

# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)

-----Result-----

[[ 1 5 8]
[ 3 5 11]
[ 2 4 9]
[ 3 6 10]
[ 1 5 10]]

[[ 1. 0.25 0.75]

[ 0.25 0.5 0.25]

[ 0.75 0.25 1.3 ]]

Machine learning

Menu bar

14/08/2021

Introduction to Multivariate Statistics

No comments:

Post a Comment