Menu bar

29/09/2021

Introduction to Univariate, Bivariate and Multivariate Analysis

In the field of data, there is nothing more important than understanding the data that you are trying to analyze. In order to understand the data is it important to understand the purpose of the analysis because this will help you save time and dictate how to go about analyzing the data.

There are a lots of different tools, techniques and methods that can be used to conduct your analysis. You could use software libraries, visualization tools and statistic testing methods.

Regardless if you are a Data Analyst or a Data Scientist, it is crucial to know Univariate, Bivariate and Multivariate statistical analysis.

First we must understand the types of variables:

Categorical variables — variables that have a finite number of categories or distinct groups. Examples: gender, method of payment, horoscope, etc.

Numerical variables — variables that consist of numbers. There are two main numerical variables.
  • Discrete variables — variables that can be counted within a finite time. Examples: the change in your pocket, number of students in a class, numerical grades, etc.
  • Continuous variables — variables that are infinite in number often measured on a scale of sort. Examples: weight, height, temperature, date and time of a payment, etc.
However, depending on the type of variable, it can also be changed to another variable for ease of use. For example the date and time could be broken down to year, month and time could be categorized into AM and PM.

A common technique for continuous variables is "binning" the variables into categories. For example, the weight of a person can be categorized into "below average"/"slim", "average" and "above average"/"obese" by setting ranges.


Univariate Analysis

Univariate analysis is the simplest of the three analyses where the data you are analyzing is only one variable. There are many different ways people use univariate analysis. The most common univariate analysis is checking the central tendency (mean, median and mode), the range, the maximum and minimum values, and standard deviation of a variable.

Common visual technique used for univariate analysis is a histogram, which is a frequency distribution graph. You could also use a box plot or violin plot to compare the spread of the variables and provides an insight into outliers. Using any of the above mentioned to compare the "sepal_length" in the iris dataset across species is only comparing one variable, therefore a Univariate analysis.


Bivariate Analysis

Bivariate analysis is where you are comparing two variables to study their relationships. These variables could be dependent or independent to each other. In Bivariate analysis is that there is always a Y-value for each X-value.

The most common visual technique for bivariate analysis is a scatter plot, where one variable is on the x-axis and the other on the y-axis. In addition to the scatter plot, regression plot and correlation coefficient are also frequently used to study the relationship of the variables. For example, continuing with the iris dataset, you can compare "sepal_length" vs "sepal_width" or "sepal_length" vs the "petal_length"to see if there is a relationship.


Multivariate Analysis

Multivariate analysis is similar to Bivariate analysis but you are comparing more than two variables. For three variables, you can create a 3-D model to study the relationship (also known as Trivariate Analysis). However, since we cannot visualize anything above the third dimension, we often rely on other softwares and techniques for us to be able to grasp the relationship in the data.

In terms of visualization, Seaborn library in Python allows for pairplots where it generates one large chart of selected variables against one another in a series of scatter plots and histograms depending on the type of variable, also known as scatter plot matrix. Again, in the series to come, I will provide the code and examples of this.

Depending on the dataset and the depth of analysis required, there are other techniques that you could deploy, such as Principal Component Analysis or logistic regression, linear regression, cluster analysis, etc. Again, in the series to come, I will provide the code and examples of this and dive deeper into PCA and its importance in data.



References:

No comments:

Post a Comment