Menu bar

14/08/2021

Introduction to Multivariate Statistics

Fundamental statistics are useful tool in applied machine learning for better understanding your data. 

They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the Covariance Matrix and Principal Component Analysis respectively.

In this tutorial, you will discover how fundamental statistical operations work and how to implement them using Numpy.

This tutorial is divided into 4 parts; they are:
  • Expected Value and Mean
  • Variance and  Standard Deviation
  • Covariance and Correlation
  • Covariance Matrix

A. Expected Value and Mean

In probability, the average value of some random variable X is called expected value or expectation.

The expected value uses the notation E with square brackets around the name of variable, for example E[X].



In simple case, such as the flipping of a coin or rolling a dice, the 
probability of each even is just like as likely. 


The mean is denoted by the lower case Greek letter mu µ is calculated from the sample of observations, rather than all possible values.


# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)

-----Result-----

[1 2 3 4 5 6]
3.5


# matrix means
from numpy import array
from numpy import mean
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column means
col_mean = mean(M, axis=0)
print(col_mean)
# row means
row_mean = mean(M, axis=1)
print(row_mean)

-----Result-----

[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 1. 2. 3. 4. 5. 6.]
[ 3.5 3.5]


B. Variance and Standard Deviation

In probability, the variance of some random variable X is a measure of 
how much values in the distribution vary on average with respect to the 
mean. The variance is denoted as the function Var() on the variable:
Var[X]



If the probability of each example in the distribution is equal, we have:


In statistics, the variance can be estimated from a sample of examples 
drawn from the domain. 

In the abstract, the sample variance is denoted by the lower case sigma
with a 2 superscript indicating the units are squared (e.g. σ2), not that 
you must square the final value.

The sum of the squared differences is multiplied by the reciprocal of the 
number of examples minus 1 to correct for a bias. 


# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)

-----Result-----

[1 2 3 4 5 6]
3.5


# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)

-----Result-----

[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]



Standard deviation is calculated as the square root of the variance and is
denoted as lowercase s.

# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)

-----Result-----

[[1 2 3 4 5 6]
[1 2 3 4 5 6]]

[ 0. 0. 0. 0. 0. 0.]
[ 1.87082869 1.87082869]



C. Covariance and Correlation

In probability, covariance is the measure of the joint probability for two 
random variables. It describes how the two variables change together.

It is denoted as the function cov(X,Y). 



In statistics, the sample covariance can be calculated in the same way, 
although with a bias correction. 


The size of the covariance can be interpreted as whether the two 
variables increase together (positive) or one increase but one decrease 
(negative). The magnitude of the covariance is not easily interpreted. A 
covariance value of zero indicates that both variables are completely 
independent.

# vector covariance
from numpy import array
from numpy import cov
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second covariance
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)

-----Result-----

[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-7.5

  
The covariance can be normalized to a score between -1 and 1 to make 
magnitude interpretable by dividing it by the standard deviation of X and
Y. The result is called the correlation of the variables, also called the 
Pearson correlation coefficient, named for the developer of the method.


# vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)

-----Result-----

[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-1.0


D. Covariance Matrix

Covariance matrix is a square and symmetric matrix that describes the
covariance between two or more random variables. 

The diagonal of the covariance matrix are the variance of each of random
variables, as such it is often called the variance-covariance matrix. 


The covariance matrix provides a useful tool for separating the structured
relationships in a matrix of random variables.

It is a key element used in the Principal Component Analysis (PCA) data 
reduction method.

# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)

-----Result-----

[[ 1 5 8]
[ 3 5 11]
[ 2 4 9]
[ 3 6 10]
[ 1 5 10]]

[[ 1.  0.25 0.75]
[ 0.25 0.5  0.25]
[ 0.75 0.25 1.3 ]]


No comments:

Post a Comment