Menu bar

30/08/2021

Data Cleaning - Part 3 - How to Mark and Remove Missing Data

Real-world data often has missing values. Data can have missing values for a number of reasons such as observations that were not recorded and data corruption. 

Handling missing data is important as many machine learning algorithms do not support data with missing values. 

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:
  • How to mark invalid or corrupt values as missing in your dataset.
  • How to confirm that the presence of marked missing values causes problems for learning algorithms.
  • How to remove rows with missing data from your dataset and evaluate a learning algorithm on the transformed dataset.

Data Cleaning - Part 2 - Outlier Identification and Removal

Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values.

After completing this tutorial, you will know:
  • That an outlier is an unlikely observation in a dataset and may have one of many causes.
  • How to use simple univariate statistics like standard deviation and interquartile range to identify and remove outliers from a data sample.
  • How to use an outlier detection model to identify and remove rows from a training dataset in order to lift predictive modeling performance.

28/08/2021

Data Cleaning - Part 1 - Basic Data Cleaning

Data cleaning is a critically important step in any machine learning project. Before jumping to sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. 

In this tutorial, you will discover basic data cleaning methods. After completing this tutorial, you will know:
  • How to identify and remove column variables that only have a single value.
  • How to identify and consider column variables with very few unique values. 
  • How to identify and remove rows that contain that duplicate observations.  

27/08/2021

Data Preparation Without Data Leakage

In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models. 
After completing this tutorial, you will know:
  • Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
  • Data preparation must be prepared on the training set only in order to avoid data leakage.
  • How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python

26/08/2021

Why Data Preparation is So Important

Given that we have standard implementations of highly parameterized machine learning algorithms in open source libraries, fitting models has become routine. 

As such, the most challenging part of each predictive modeling project is how to prepare the one thing that is unique to the project: the data used for modeling. 

In this tutorial, you will discover the importance of data preparation for each machine learning project. 

25/08/2021

Data Preparation in a Machine Learning Project

Data preparation may be one of the most difficult steps in any machine learning projects. The reason is that each data is difference and highly specific to the project. 

After completing this tutorial, you will know:
  • Each predictive modeling project with machine learning is different, but there are common steps performed on each project. 
  • Data preparation involves best exposing the unknown underlying structure of the problem to learning algorithm.
  • The steps before and after data preparation in a project can inform what data preparation methods to apply, or at least explore.

23/08/2021

Cross-Entropy for Machine Learning

Cross-entropy is commonly used in machine learning as a loss function. 

Cross-entropy is a measure from the field of information theory, building upon entropy and generally calculating the difference between two probabilities distributions. 

It is closely related to but is different from KL divergence that calculates the relative entropy between two probability distributions, whereas cross-entropy can be thought to calculate the total entropy between the distributions. 

Cross-entropy is also related to and often confused with logistic loss, call log loss. Although the two measures are derived from a difference source, when used as loss functions for classification models, both measures calculate the same quantity and can be used interchangeably.
In this tutorial, you will discover cross-entropy for machine learninng.

Divergence Between Probability Distributions

It is often desirable to quantify the difference between probability distribution for a given variable.

This occurs frequently in machine learning, when we may be interested in calculating the difference between an actual and observed probability distribution.

This can be achieved using techniques from information theory, such as the Kullback-Leibler Divergence (KL divergence), or relative entropy, and the Jensen-Shannon Divergence that provides a normalized and symmetrical version of the KL divergence.

22/08/2021

Probability Density Estimation

Probability density is the relationship between observations and their probability. 

Some outcomes of a random variable will have low probability density and other outcomes will have a high probability density. 

The overall shape of the probability density is referred to as a probability distribution, and the calculation of probabilities for specific outcomes of a random variable is performed by a probability density function, or PDF for short. 

It is useful to know the probability density function for a sample of data in order to know whether a given observation is unlikely, or so unlikely as to be considered an outlier or anomaly and whether it should be removed. 

Probability Distributions

Probability can be used for more than calculating the likelihood of one event; it can summarize the likelihood of all possible outcomes. A thing of interest in probability is called a random variable, and the relationship between each possible outcome for a random variable and their probabilities is called a probability distribution.

21/08/2021

Information Entropy

Information theory is a subfield of mathematics concerned with transmitting data across a noisy channel. 

A cornerstone of information theory is the idea of quantifying how much information there is in a message. 

More generally, this can be used to quantify the information in an event and a random variable, called entropy, and is calculated using probability.
 
Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as feature selection, building decision trees, and, more generally, fitting classification models.

As such, a machine learning practitioner requires a strong understanding and intuition for information and entropy.

20/08/2021

Data Visualization

Data visualization is an important skill in applied statistics and machine learning. This can be helpful when exploring and getting to know a dataset and can help with identifying patterns, corrupt data, outliers, and much more.

18/08/2021

Examples Of Statistics In Machine Learning

Statistics and machine learning are two very closely related fields. In fact, the line between the two can be very fuzzy at times.

It would be fair to say that statistical methods are required to effectively work through a machine learning predictive modeling project.

17/08/2021

Linear Regression

Linear regression is a method for modeling the relationship between one or more independent variables and a dependent variable. 

It is a staple of statistics and is often considered a good introductory machine learning method. 

In this tutorial, you will discover the matrix formulation of linear regression and how to solve it using direct and matrix factorization methods.

Singular Value Decomposition

Matrix decomposition, also known as matrix factorization, involves describing a given matrix using its constituent elements. 

Perhaps the most known and widely used matrix decomposition method is the Singular-Value Decomposition, or SVD.

All matrices have an SVD, which makes it more stable than other methods, such as the eigendecomposition. 

As such, it is often used in a wide array of applications including compressing, denoising, and data reduction.

16/08/2021

Eigendecomposition

Matrix decompositions are a useful tool for reducing a matrix to their constituent parts in order to simplify a range of more complex operations. 

Perhaps the most used type of matrix decomposition is the eigendecomposition that decomposes a matrix into eigenvectors and eigenvalues. 

Matrix Decompositions

Many complex matrix operations cannot be solved efficiently or with stability using the limited precision of computers. 

Matrix decompositions are methods that reduce a matrix into constituent parts that make it easier to calculate more complex matrix operations. 

Matrix decomposition methods, also called matrix factorization methods, are a foundation of linear algebra in computers, even for basic operations such as solving systems of linear equations, calculating the inverse, and calculating the determinant of a matrix.

15/08/2021

Principal Component Analysis

An important machine learning method for dimensionality reduction is called Principal Component Analysis (PCA).

It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

In this tutorial, you will discover the PCA machine learning method for dimensionality reduction and how to implement it from scratch in Python.

14/08/2021

Introduction to Multivariate Statistics

Fundamental statistics are useful tool in applied machine learning for better understanding your data. 

They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the Covariance Matrix and Principal Component Analysis respectively.

In this tutorial, you will discover how fundamental statistical operations work and how to implement them using Numpy.

Tensors and Tensor Arithmetic

In deep learning it is common to see a lot of discussion around tensors as the cornerstone data structure. 

Tensor even appears in name of Google’s flagship machine learning library: TensorFlow

Tensors are a type of data structure used in linear algebra, and like vectors and matrices, you can calculate arithmetic operations with tensors.

This tutorial is divided into 3 parts; they are:
  • What are Tensors
  • Tensors in Python
  • Tensor Arithmetic

13/08/2021

Sparse Matrix

A sparse matrix is a matrix that is comprised of mostly zero values. Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices. 

Below is an example of a small 3 × 6 sparse matrix


The example has 13 zero values of the 18 elements in the matrix, giving this matrix a sparsity score of 0.722 or about 72%.

11/08/2021

Loss and Loss Functions for Training Deep Learning Neural Networks

Neural network are trained using stochastic gradient descent and require that you choose a loss function when designing and configuring your model. 

There are many loss functions to choose from and it can be challenging to know what to choose, or even what a loss function is and the role it plays when training a neural network.

09/08/2021

4 Types of Classification Tasks in Machine Learning

Examples of classification problems include:
  • Given an example, classify if it is spam or note
  • Given a handwritten character, classify if as one of known characters
  • Given recent user behavior, classify as churn or not
Classification requires a training dataset with many examples of inputs and outputs from which to learn.

14 Different Types of Learning in Machine Learning

There are 14 types of learning that you must be familiar with as a practitioners; they are:

Learning Problem

1.Supervised Learning 
2.Unsupervised Learning
3.Reinforcement Learning

Hybrid Learning Problem

4.Semi-Supervised Learning
5.Self-Supervised Learning
6.Multi-Instance Learning

Statistical Inference

7.Inductive Learning
8.Deductive Inference
9.Transductive Learning

 Learning Techniques

10.Multi-Task Learning 
11.Active Learning
12.Online Learning
13.Transfer Learning
14.Ensemble Learning 

07/08/2021

Generative Adversarial Networks(GAN)

 Generative Adversarial Networks(GAN)

Reinforcement Learning

 Reinforcement Learning

Natural Language Processing

 Natural Language Processing

NumPy Array Broadcasting

Arrays with difference sizes cannot be added, subtracted, or generally be used in arithmetic. A way to overcome this is to duplicate the smaller array so that it has the dimensionality and size as the large array. This is called array broadcasting. 

06/08/2021

Evaluation Metrics for Classification

A. Binary Classification


Confusion Matrix

The confusion matrix is a table with the number of correct and incorrect predictions broken down by class



How to Accelerate Learning of Deep Neural Networks With Batch Normalization

Batch Normalization is a technique design to automatic standardize the inputs to a layer in a deep learning neural network.

Once implemented, batch normalization has the effect of dramatically accelerating the training process of a neural network, and in some cases improves the performance of the model via a modest regularization effect.

05/08/2021

Lenet-5 Architecture

Lenet-5 is one of the easiest pre-trained models proposed by Yann Lecun in the year 1998.

The network has 5 layers with learnable parameters and hence named Lenet-5. It has three sets of convolution layers with a combination of average pooling. After convolution and average pooling layers, we have two fully connected layers. At last, a Softmax classifier which classifies the images into respective class.

Convolutional Neural Networks for Machine Learning

These networks preserve the spatial structure of the problem.

CNNs are popular because people are achieving state-of-the-art-results on difficult computer vision and natural language processing tasks.

Given a dataset of gray scale images with the standardized size of 32x32 pixels each, a traditional feedforward neural network would require 1024 input weights (plus one bias).

03/08/2021

When to use MLP, CNN, RNN Neural Networks?

What neural network is appropriate for your predictive modeling problem?

When to Use Multilayer Perceptrons?
  • Tabular datasets
  • Classification prediction problems
  • Regression prediction problems.

01/08/2021

How to choose an Activation Function for Deep Learning

Activation function are a critical part of the design of a neural network.

The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.