cs329s_03_note_data_engineering

Lecture note on the basics of data engineering, covering

data
engineering
learning
data science
analytics
software
  1. Home
  2. Google Doc
  3. cs329s_03_note_data_engineering

cs329s_03_note_data_engineering

Lecture note on the basics of data engineering, covering

* data formats (row- vs. column-based, text vs. binary)

* ETL

* batch processing vs. stream processing

* training datasets


WIP. Feedback much appreciated

data, engineering, learning, data science, analytics, software

Lecture 3: Data engineering [Draft]

Note: This is the outdated note. For the 2022 version, please see Data Engineering Fundamentals and Creating Training Data. For the fully developed text, see the book Designing Machine Learning Systems (Chip Huyen, O’Reilly 2022).

Errata, questions, and feedback -- please send to [email protected]. Thank you!

CS 329S: Machine Learning Systems Design (cs329s.stanford.edu)

Prepared by Chip Huyen & the CS 329S course staff

Reviewed by Luke Metz

Errata and feedback: please send to [email protected]

Note:

1. See the course overview and prerequisites on the lecture slides.

2. The course, including lecture slides and notes, is a work in progress. This is the first time the course is offered and the subject of ML systems design is fairly new, so we (Chip + the course staff) are all learning too. We appreciate your:

1. enthusiasm for trying out new things

2. patience bearing with things that don’t quite work

3. feedback to improve the course.

________________

Table of contents

Mind vs. data 3

Data engineering 101 6

Data sources 6

Data formats 7

JSON 7

Row-based vs. column-based 8

Slightly related: NumPy vs. Pandas 9

Text vs. binary format 11

OLTP (OnLine Transaction Processing) vs. OLAP (OnLine Analytical Processing) 12

ETL: Extract, Transform, Load 12

Structured vs. unstructured data 13

ETL to ELT 14

Batch processing vs. stream processing 14

Creating training datasets 19

Labeling 19

The challenges of hand labels 20

Label multiplicity 20

⚠ More data isn’t always better ⚠ 20

Solution to label multiplicity 21

How to deal with the lack of labels 22

Weak supervision 22

Semi-supervised 24

Transfer learning 24

Active learning 25

________________

Mind vs. data

Progress in the last decade shows that the success of an ML system depends largely on the data it was trained on. Instead of focusing on improving ML algorithms, most companies focus on managing and improving their data[1].

Despite the success of models using massive amounts of data, many are skeptical of the emphasis on data as the way forward. In the last three years, at every academic conference I attended, there were always some debates among famous academics on the power of mind (inductive biases such as intelligent architectural designs) vs. data.

In theory, you can both pursue intelligent design and leverage computation, but spending time on one often takes time away from another[2].

On the mind over data camp, there’s Dr. Judea Pearl, a Turing Award winner best known for his work on causal inference and Bayesian networks. The introduction to his book, “The book of why”, is entitled “Mind over data,” in which he emphasizes: “Data is profoundly dumb.” He also went on Twitter to warn all data-centric ML people that they might be out of job in 3-5 years.

[3]

There’s also a milder opinion from Dr. Chris Manning, who’s a professor at Stanford and who’s also a great person. He argued that huge computation and a massive amount of data with a simple learning device create incredibly bad learners. Structure allows us to design systems that can learn more from less data[4].

Many people in ML today are on the data over mind camp. Richard Sutton wrote a great blog post in which he claimed that:

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. … Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.”

When asked how Google search was doing so well, Peter Norvig, Google’s Director of Search, responded: “We don’t have better algorithms. We just have more data.”[5]

The debate isn’t about whether finite data is necessary, but whether it’s sufficient. The term finite here is important, because if we had infinite data, we can just look up the answer. Having a lot of data is different from having infinite data.

Regardless of which camp is right, data is important. Dr. Monica Rogati argued in “The data science hierarchy of needs[6]” that data lies at the foundation of data science. Without data, there’s no data science.

Models are getting bigger and using more data. Back in 2013, people were getting excited when the One Billion Words Benchmark for Language Modeling was released, which contains 0.8 billion tokens[7]. Six years later, OpenAI’s GPT-2 used a dataset of 10 billion tokens. And another year later, GPT-3 used 500 billion tokens.

Dataset

Year

Tokens (M)

Penn Treebank

1993

1

Text8

2011

17

One Billion

2013

800

BookCorpus

2015

985

GPT-2 (OpenAI)

2019

10,000

GPT-3 (OpenAI)

2020

500,000

Data engineering 101

Data systems, in and of themselves, are beasts. If you haven’t spent years and years digging through literature, it’s very easy to get lost in acronyms. There are many challenges and possible solutions—if you look into the data stack for different tech companies it seems like each is doing their own thing.

In this lecture, we’ll cover the basics of data engineering. What we cover is very, very basic. If you haven’t already, we highly recommend that you take a database class.

Data sources

An ML system works with data from many different sources. One source is user-generated data which includes inputs (e.g. phrases to be translated into Google Translate) and clicks (e.g. booking a trip, clicking on or ignoring a suggestion, scrolling). User-generated data can be passive, e.g. user ignoring a popup, spending x seconds on page. Users tend to have little patience, so in general, user-generated data requires fast processing.

Another source system-generated data (sometimes called machine-generated data) such as logs, metadata, predictions made by models. Logs are generated to record the state of the system and significant events in the system for bookkeeping and debugging. They can be generated periodically and/or whenever something interesting happens.

Logs provide visibility into how the application is doing, and the main purpose of this visibility is for debugging and possibly improving the application. If you want to be alerted as soon as something abnormal happens on your system, logs should be processed as soon as they’re generated.

There’s also enterprise applications data. A company might use various enterprise applications to manage their assets such as inventory, customer relationship, users. These applications generate data that can be used for ML models. This data can be very large and need to be updated frequently.

Then there’s the wonderfully weird and creepy world of third-party data. First-party data is the data that your company already collects about your users or customers. Second-party data is the data collected by another company on their own customers. Third-party data companies collect data on the general public who aren’t their customers.

The rise of the Internet and smartphones has made it much easier for all types of data to be collected. It’s especially easy with smartphones since each phone has a Mobile Advertiser ID, which acts as a unique ID to aggregate all activities on a phone. Data from apps, websites, check-in services, etc. is collected and (hopefully) anonymized to generate activity history for each person.

You can buy all types of data (e.g. social media activities, purchase history, web browsing habits, car rentals, political leaning) for different demographic groups (e.g. men, age 25-34, working in tech, living in the Bay Area). From this data, you can infer information such as people who like brand A also like brand B.

Third-party data is usually sold as structured data after being cleaned and processed by vendors.

Data formats

Once you have data, you might want to store it. How to store multi-modal data -- e.g. when each sample might contain both images and texts? If you’ve trained an ML model, how to store it so it can be loaded correctly?

The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later is data serialization. There are many, many data serialization formats[8]. The table below consists of just a few of the common formats that you might work with.

Format

Binary/Text

Human-readable?

Example use cases

JSON

Text

Yes

Everywhere

CSV

Text

Yes

Everywhere

Parquet

Binary

No

Hadoop, Amazon Redshift

Avro

Binary primary

No

Hadoop

Protobuf

Binary primary

No

Google, TensorFlow (TFRecord)

Pickle

Text, binary

No

Python, PyTorch serialization

JSON

The format that is ubiquitous everywhere is probably JSON. It’s human-readable, language-independent (many programming languages support it), and versatile. Its key-value pair paradigm allows you to make your data as structured as you want. For example, you can have your data like this:

{

"firstName": "Boatie",

"lastName": "McBoatFace",

"isVibing": true,

"age": 12,

"address": {

"streetAddress": "12 Ocean Drive",

"city": "Port Royal",

"postalCode": "10021-3100"

}

}

Or you can have it as a blob of text like this:

{

"text": "Boatie McBoatFace, aged 12, is vibing, at 12 Ocean Drive, Port Royal, 10021-3100"

}

Row-based vs. column-based

There are two formats that I want to go over in detail: CSV and Parquet. CSV is row-based -- data is stored and retrieved row-by-row. Parquet is column-based -- data is stored and retrieved column by column.

If we consider each sample as a row and eac

cs329s_03_note_data_engineering
Info
Tags Data, Engineering, Learning, Data science, Analytics, Software
Type Google Doc
Published 07/05/2024, 07:17:03

Resources

Data & Narratives Revisited
Journalists Helping Journalists
Tesla Carriers
Bellingcat's Online Investigation Toolkit [bit.ly/bcattools]
Betas & Bludgers Writing Competitions List
Unified research on privacy-preserving contact tracing and exposure notification for COVID-19
Fortnite Meshes - Lucas7yoshi
League of Legends Esports League-Recognized Contract Database