LESSON 1 - Define data and establish baseline.
Why is data definition hard?
- Label is ambiguous
Data definition question
- What is the input x ?
- What features need to be included?
- What is the target label y?
- How can we ensure labelers give consistence labels?
Major types of data problem
- Unstructured data: video, audio
- Structure data: spreadsheet table
- Small data: <= 10,000 examples
- Big data: > 10,000 examples
Unstructured vs Structured Data
Unstructured Data
- May or may not have huge collection of unlabeled examples x
- Humans can label more data
- Data argumentation more likely to be helpful
Structured Data
- May be difficult to obtain more data
- Human labeling may not be possible (with some exceptions)
Small Data vs Big Data
Small Data (<= 10,000 examples)
- Clean labels are critical
- Can manually look through dataset and fix labels
- Can get all labelers to talk to each other.
Big Data (> 10,000 examples)
- Emphasis data process
If you are looking for advice for problem in machine learning project, try to find someone that worked in the same quadrant as the problem you are trying to solve.
Small data and label consistence
You have five examples for dataset, and the output Y is pretty noise, It is difficult to know what is the function you should use to map voltage to the rotor speed in rpm.
If you had a ton of data, this dataset is equally noise as the one on the left left, but you just have a lot more data. Then the learning algorithms can average over the noise data sets and you can now fill the function.
But now, you have clean and consistent labels, you can pretty confidently fit a function through your data and with only five examples.
Big data problems can have small data challenges too.
Problems with large dataset but where there's are rare events in the input.
Web search: large web search engine companies all have very large data sets of web search queries, but many queries actually very rare.
Self-driving cars: that very rare occurrence of a young child running across the highway, or that very rare occurrence of a truck parked across the highway.
Production recommendation systems: if you have an online catalog of million items, then you have a lot of products where number sold of that item is quite small. And so the amount of data you have of users interacting with the items is actually small.
=> When you have small dataset, label consistency is critical. Even when you have big data set, label consistency can be very important.
Improving label consistency
- Have multiple labelers label same example
- When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to each agreement.
- If labelers believe that x doesn't contain enough information, consider changing x.
- Iterate until it is hard to significantly increase agreement
- Have a class/label to capture uncertainty
Small Data vs Big Data (Unstructured Data)
Small Data
- Usually small number of labelers
- Can ask labelers to discuss specific labels
Big Data
- Get to consistent definition with a small group
- Then send labeling instructors to labelers
- Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.