Lecture 3: Data engineering [Draft]
Note: This is the outdated note. For the 2022 version, please see Data Engineering Fundamentals and Creating Training Data. For the fully developed text, see the book Designing Machine Learning Systems (Chip Huyen, O’Reilly 2022).
Errata, questions, and feedback -- please send to [email protected]. Thank you!
CS 329S: Machine Learning Systems Design (cs329s.stanford.edu)
Prepared by Chip Huyen & the CS 329S course staff
Reviewed by Luke Metz
Errata and feedback: please send to [email protected]
Note:
1. See the course overview and prerequisites on the lecture slides.
2. The course, including lecture slides and notes, is a work in progress. This is the first time the course is offered and the subject of ML systems design is fairly new, so we (Chip + the course staff) are all learning too. We appreciate your:
1. enthusiasm for trying out new things
2. patience bearing with things that don’t quite work
3. feedback to improve the course.
________________
Table of contents
Mind vs. data 3
Data engineering 101 6
Data sources 6
Data formats 7
JSON 7
Row-based vs. column-based 8
Slightly related: NumPy vs. Pandas 9
Text vs. binary format 11
OLTP (OnLine Transaction Processing) vs. OLAP (OnLine Analytical Processing) 12
ETL: Extract, Transform, Load 12
Structured vs. unstructured data 13
ETL to ELT 14
Batch processing vs. stream processing 14
Creating training datasets 19
Labeling 19
The challenges of hand labels 20
Label multiplicity 20
⚠ More data isn’t always better ⚠ 20
Solution to label multiplicity 21
How to deal with the lack of labels 22
Weak supervision 22
Semi-supervised 24
Transfer learning 24
Active learning 25
________________
Mind vs. data
Progress in the last decade shows that the success of an ML system depends largely on the data it was trained on. Instead of focusing on improving ML algorithms, most companies focus on managing and improving their data[1].
Despite the success of models using massive amounts of data, many are skeptical of the emphasis on data as the way forward. In the last three years, at every academic conference I attended, there were always some debates among famous academics on the power of mind (inductive biases such as intelligent architectural designs) vs. data.
In theory, you can both pursue intelligent design and leverage computation, but spending time on one often takes time away from another[2].
On the mind over data camp, there’s Dr. Judea Pearl, a Turing Award winner best known for his work on causal inference and Bayesian networks. The introduction to his book, “The book of why”, is entitled “Mind over data,” in which he emphasizes: “Data is profoundly dumb.” He also went on Twitter to warn all data-centric ML people that they might be out of job in 3-5 years.
[3]
There’s also a milder opinion from Dr. Chris Manning, who’s a professor at Stanford and who’s also a great person. He argued that huge computation and a massive amount of data with a simple learning device create incredibly bad learners. Structure allows us to design systems that can learn more from less data[4].
Many people in ML today are on the data over mind camp. Richard Sutton wrote a great blog post in which he claimed that:
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. … Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.”
When asked how Google search was doing so well, Peter Norvig, Google’s Director of Search, responded: “We don’t have better algorithms. We just have more data.”[5]
The debate isn’t about whether finite data is necessary, but whether it’s sufficient. The term finite here is important, because if we had infinite data, we can just look up the answer. Having a lot of data is different from having infinite data.
Regardless of which camp is right, data is important. Dr. Monica Rogati argued in “The data science hierarchy of needs[6]” that data lies at the foundation of data science. Without data, there’s no data science.
Models are getting bigger and using more data. Back in 2013, people were getting excited when the One Billion Words Benchmark for Language Modeling was released, which contains 0.8 billion tokens[7]. Six years later, OpenAI’s GPT-2 used a dataset of 10 billion tokens. And another year later, GPT-3 used 500 billion tokens.
Dataset
Year
Tokens (M)
Penn Treebank
1993
1
Text8
2011
17
One Billion
2013
800
BookCorpus
2015
985
GPT-2 (OpenAI)
2019
10,000
GPT-3 (OpenAI)
2020
500,000
Data engineering 101
Data systems, in and of themselves, are beasts. If you haven’t spent years and years digging through literature, it’s very easy to get lost in acronyms. There are many challenges and possible solutions—if you look into the data stack for different tech companies it seems like each is doing their own thing.
In this lecture, we’ll cover the basics of data engineering. What we cover is very, very basic. If you haven’t already, we highly recommend that you take a database class.
Data sources
An ML system works with data from many different sources. One source is user-generated data which includes inputs (e.g. phrases to be translated into Google Translate) and clicks (e.g. booking a trip, clicking on or ignoring a suggestion, scrolling). User-generated data can be passive, e.g. user ignoring a popup, spending x seconds on page. Users tend to have little patience, so in general, user-generated data requires fast processing.
Another source system-generated data (sometimes called machine-generated data) such as logs, metadata, predictions made by models. Logs are generated to record the state of the system and significant events in the system for bookkeeping and debugging. They can be generated periodically and/or whenever something interesting happens.
Logs provide visibility into how the application is doing, and the main purpose of this visibility is for debugging and possibly improving the application. If you want to be alerted as soon as something abnormal happens on your system, logs should be processed as soon as they’re generated.
There’s also enterprise applications data. A company might use various enterprise applications to manage their assets such as inventory, customer relationship, users. These applications generate data that can be used for ML models. This data can be very large and need to be updated frequently.
Then there’s the wonderfully weird and creepy world of third-party data. First-party data is the data that your company already collects about your users or customers. Second-party data is the data collected by another company on their own customers. Third-party data companies collect data on the general public who aren’t their customers.
The rise of the Internet and smartphones has made it much easier for all types of data to be collected. It’s especially easy with smartphones since each phone has a Mobile Advertiser ID, which acts as a unique ID to aggregate all activities on a phone. Data from apps, websites, check-in services, etc. is collected and (hopefully) anonymized to generate activity history for each person.
You can buy all types of data (e.g. social media activities, purchase history, web browsing habits, car rentals, political leaning) for different demographic groups (e.g. men, age 25-34, working in tech, living in the Bay Area). From this data, you can infer information such as people who like brand A also like brand B.
Third-party data is usually sold as structured data after being cleaned and processed by vendors.
Data formats
Once you have data, you might want to store it. How to store multi-modal data -- e.g. when each sample might contain both images and texts? If you’ve trained an ML model, how to store it so it can be loaded correctly?
The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later is data serialization. There are many, many data serialization formats[8]. The table below consists of just a few of the common formats that you might work with.
Format
Binary/Text
Human-readable?
Example use cases
JSON
Text
Yes
Everywhere
CSV
Text
Yes
Everywhere
Parquet
Binary
No
Hadoop, Amazon Redshift
Avro
Binary primary
No
Hadoop
Protobuf
Binary primary
No
Google, TensorFlow (TFRecord)
Pickle
Text, binary
No
Python, PyTorch serialization
JSON
The format that is ubiquitous everywhere is probably JSON. It’s human-readable, language-independent (many programming languages support it), and versatile. Its key-value pair paradigm allows you to make your data as structured as you want. For example, you can have your data like this:
{
"firstName": "Boatie",
"lastName": "McBoatFace",
"isVibing": true,
"age": 12,
"address": {
"streetAddress": "12 Ocean Drive",
"city": "Port Royal",
"postalCode": "10021-3100"
}
}
Or you can have it as a blob of text like this:
{
"text": "Boatie McBoatFace, aged 12, is vibing, at 12 Ocean Drive, Port Royal, 10021-3100"
}
Row-based vs. column-based
There are two formats that I want to go over in detail: CSV and Parquet. CSV is row-based -- data is stored and retrieved row-by-row. Parquet is column-based -- data is stored and retrieved column by column.
If we consider each sample as a row and eac