Menu bar

20/02/2023

Processing Data

LESSON 1 - Define data and establish baseline.

Why is data definition hard?

- Label is ambiguous

Data definition question

- What is the input x ?

- What features need to be included?

- What is the target label y?

- How can we ensure labelers give consistence labels?

Major types of data problem

- Unstructured data: video, audio

- Structure data: spreadsheet table

- Small data: <= 10,000 examples

- Big data: > 10,000 examples


Unstructured vs Structured Data 

Unstructured Data

  • May or may not have huge collection of unlabeled examples x
  • Humans can label more data
  • Data argumentation more likely to be helpful

Structured Data

  • May be difficult to obtain more data
  • Human labeling may not be possible (with some exceptions)

Small Data vs Big Data

Small Data  (<= 10,000 examples)

  • Clean labels are critical
  • Can manually look through dataset and fix labels
  • Can get all labelers to talk to each other.

Big Data (> 10,000 examples)

  • Emphasis data process


If you are looking for advice for problem in machine learning project, try to find someone that worked in the same quadrant as the problem you are trying to solve.


Small data and label consistence



You have five examples for dataset, and the output Y is pretty noise, It is difficult to know what is the function you should use to map voltage to the rotor speed in rpm.

If you had a ton of data, this dataset is equally noise as the one on the left left, but you just have a lot more data. Then the learning algorithms can average over the noise data sets and you can now fill the function.

But now, you have clean and consistent labels, you can pretty confidently fit a function through your data and with only five examples.


Big data problems can have small data challenges too.

Problems with large dataset but where there's are rare events in the input.

Web search: large web search engine companies all have very large data sets of web search queries, but many queries actually very rare.

Self-driving cars: that very rare occurrence of a young child running across the highway, or that very rare occurrence of a truck parked across the highway.

Production recommendation systems: if you have an online catalog of million items, then you have a lot of products where number sold of that item is quite small. And so the amount of data you have of users interacting with the items is actually small.


=> When you have small dataset, label consistency is critical. Even when you have big data set, label consistency can be very important.


Improving label consistency

  • Have multiple labelers label same example
  • When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to each agreement.
  • If labelers believe that x doesn't contain enough information, consider changing x.
  • Iterate until it is hard to significantly increase agreement
  • Have a class/label to capture uncertainty

Small Data vs Big Data (Unstructured Data)

Small Data

  • Usually small number of labelers
  • Can ask labelers to discuss specific labels

Big Data

  • Get to consistent definition with a small group
  • Then send labeling instructors to labelers
  • Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.

Human-Level Performance

In the process of measuring HLP, you find that HLP is much less than perfect performance, much lower than 100 percent. Improving labeling consistency will both raise HLP.

Obtaining Data

How long should you spend obtaining data?

You know that machine learning is a highly iterative process where you need to pick a model, hyperparameters, have a data set, then training to carry out error analysis and go around this loop multiple times to get to a good model.








Data pipeline

POC (proof-of-concept) phase
- Goal is to decide if the application is workable and worth deploying. 
- Focus on getting prototype to work!
- It's ok if data pre-processing is manual. But take extensive notes/comments.

Production phase
- After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable.
- E.g., TensorFlow Transform, Apache Beam, Airflow, ...



Data provenance: refers to where data come from.

Data lineage: refers to the sequence of steps needed to get to the end of the pipeline.

Meta data: is data about data.







LESSON 2 - Collecting Data

Important of Data

Data is first class citizen.
Good data is key for success.
Code in Software = Data in ML.







LESSON 3 - Labeling Data



LESSON 4 - Validating Data