Fundamental statistics are useful tool in applied machine learning for better understanding your data.
They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the Covariance Matrix and Principal Component Analysis respectively.
In this tutorial, you will discover how fundamental statistical operations work and how to implement them using Numpy.
This tutorial is divided into 4 parts; they are:
- Expected Value and Mean
- Variance and Standard Deviation
- Covariance and Correlation
- Covariance Matrix
A. Expected Value and Mean
In probability, the average value of some random variable X is called expected value or expectation.
The expected value uses the notation E with square brackets around the name of variable, for example E[X].
In simple case, such as the flipping of a coin or rolling a dice, the
probability of each even is just like as likely.
probability of each even is just like as likely.
The mean is denoted by the lower case Greek letter mu µ is calculated from the sample of observations, rather than all possible values.
# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)
-----Result-----
[1 2 3 4 5 6]
3.5
3.5
# matrix means
from numpy import array
from numpy import mean
from numpy import mean
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column means
col_mean = mean(M, axis=0)
print(col_mean)
# row means
row_mean = mean(M, axis=1)
print(row_mean)
-----Result-----
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 1. 2. 3. 4. 5. 6.]
[ 3.5 3.5]
B. Variance and Standard Deviation
In probability, the variance of some random variable X is a measure of
how much values in the distribution vary on average with respect to the
mean. The variance is denoted as the function Var() on the variable:
Var[X]
If the probability of each example in the distribution is equal, we have:
In statistics, the variance can be estimated from a sample of examples
drawn from the domain.
In the abstract, the sample variance is denoted by the lower case sigma
with a 2 superscript indicating the units are squared (e.g. σ2), not that
you must square the final value.
The sum of the squared differences is multiplied by the reciprocal of the
number of examples minus 1 to correct for a bias.
# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)
-----Result-----
[1 2 3 4 5 6]
3.5
3.5
# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)
-----Result-----
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]
Standard deviation is calculated as the square root of the variance and is
denoted as lowercase s.
# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)
-----Result-----
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 1.87082869 1.87082869]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 1.87082869 1.87082869]
C. Covariance and Correlation
In probability, covariance is the measure of the joint probability for two
random variables. It describes how the two variables change together.
It is denoted as the function cov(X,Y).
In statistics, the sample covariance can be calculated in the same way,
although with a bias correction.
The size of the covariance can be interpreted as whether the two
variables increase together (positive) or one increase but one decrease
(negative). The magnitude of the covariance is not easily interpreted. A
covariance value of zero indicates that both variables are completely
independent.
from numpy import array
from numpy import cov
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second covariance
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)
-----Result-----
[9 8 7 6 5 4 3 2 1]
-7.5
The covariance can be normalized to a score between -1 and 1 to make
magnitude interpretable by dividing it by the standard deviation of X and
Y. The result is called the correlation of the variables, also called the
Pearson correlation coefficient, named for the developer of the method.
# vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)
-----Result-----
[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-1.0
[9 8 7 6 5 4 3 2 1]
-1.0
Covariance matrix is a square and symmetric matrix that describes the
covariance between two or more random variables.
The diagonal of the covariance matrix are the variance of each of random
variables, as such it is often called the variance-covariance matrix.
The covariance matrix provides a useful tool for separating the structured
relationships in a matrix of random variables.
It is a key element used in the Principal Component Analysis (PCA) data
reduction method.
# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)
-----Result-----
[[ 1 5 8]
[ 3 5 11]
[ 2 4 9]
[ 3 6 10]
[ 1 5 10]]
[ 3 5 11]
[ 2 4 9]
[ 3 6 10]
[ 1 5 10]]
[[ 1. 0.25 0.75]
[ 0.25 0.5 0.25]
[ 0.75 0.25 1.3 ]]
No comments:
Post a Comment