Machine learning: Processing Data

LESSON 1 - Define data and establish baseline.

Why is data definition hard?

- Label is ambiguous

Data definition question

- What is the input x ?

- What features need to be included?

- What is the target label y?

- How can we ensure labelers give consistence labels?

Major types of data problem

- Unstructured data: video, audio

- Structure data: spreadsheet table

- Small data: <= 10,000 examples

- Big data: > 10,000 examples

Unstructured vs Structured Data

Unstructured Data

May or may not have huge collection of unlabeled examples x
Humans can label more data
Data argumentation more likely to be helpful

Structured Data

May be difficult to obtain more data
Human labeling may not be possible (with some exceptions)

Small Data vs Big Data

Small Data (<= 10,000 examples)

Clean labels are critical
Can manually look through dataset and fix labels
Can get all labelers to talk to each other.

Big Data (> 10,000 examples)

Emphasis data process

If you are looking for advice for problem in machine learning project, try to find someone that worked in the same quadrant as the problem you are trying to solve.

Small data and label consistence

You have five examples for dataset, and the output Y is pretty noise, It is difficult to know what is the function you should use to map voltage to the rotor speed in rpm.

If you had a ton of data, this dataset is equally noise as the one on the left left, but you just have a lot more data. Then the learning algorithms can average over the noise data sets and you can now fill the function.

But now, you have clean and consistent labels, you can pretty confidently fit a function through your data and with only five examples.

Big data problems can have small data challenges too.

Problems with large dataset but where there's are rare events in the input.

Web search: large web search engine companies all have very large data sets of web search queries, but many queries actually very rare.

Self-driving cars: that very rare occurrence of a young child running across the highway, or that very rare occurrence of a truck parked across the highway.

Production recommendation systems: if you have an online catalog of million items, then you have a lot of products where number sold of that item is quite small. And so the amount of data you have of users interacting with the items is actually small.

=> When you have small dataset, label consistency is critical. Even when you have big data set, label consistency can be very important.

Improving label consistency

Have multiple labelers label same example
When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to each agreement.
If labelers believe that x doesn't contain enough information, consider changing x.
Iterate until it is hard to significantly increase agreement
Have a class/label to capture uncertainty

Small Data vs Big Data (Unstructured Data)

Small Data

Usually small number of labelers
Can ask labelers to discuss specific labels

Big Data

Get to consistent definition with a small group
Then send labeling instructors to labelers
Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.

Human-Level Performance

In the process of measuring HLP, you find that HLP is much less than perfect performance, much lower than 100 percent. Improving labeling consistency will both raise HLP.

Obtaining Data

How long should you spend obtaining data?

You know that machine learning is a highly iterative process where you need to pick a model, hyperparameters, have a data set, then training to carry out error analysis and go around this loop multiple times to get to a good model.