Machine Learning Production Pipeline
Project Flow and Landscape
Chip Huyen | @chipro
Snorkel AI | snorkel.ai
07/17/2020
My background
Writing
Product
AI/ML
Cốc Cốc browser
20M+ monthly active users
Baomoi.com
acquired by VNG
Youth Asia
acquired by Groupon
2
Table of Contents
Research vs production
Data pipeline
Modeling & training
Serving
Landscape
Slides posted on Twitter @chipro!
3
Research
vs
Production
4
5
Research
Production
Performance
SOTA
Better than simpler models
6
Research
Production
Performance
SOTA
Better than simpler models
Priority
Fast training
Fast inference
Research: train many times, serve few.
Production: train few times, serve many.
7
Research
Production
Performance
SOTA
Better than simpler models
Priority
Fast training
Fast inference
Data
Static
Constantly shifting
It’s necessary for datasets in research to be static so that we can benchmark/compare models
8
Research
Production
Performance
SOTA
Better than simpler models
Priority
Fast training
Fast inference
Data
Static
Constantly shifting
Fairness
Good to have (sadly)
Important
9
Research
Production
Performance
SOTA
Better than simpler models
Priority
Fast training
Fast inference
Data
Static
Constantly shifting
Fairness
Good to have (sadly)
Important
Interpretability*
Good to have
Important
Interpretability
10
11
Research
Production
Performance
SOTA
Better than simpler models
Priority
Fast training
Fast inference
Data
Static
Constantly shifting
Fairness
Good to have (sadly)
Important
Interpretability*
Good to have
Important
Complexity
Acceptable
Impractical
12
Research
Production
Performance
SOTA
Better than simpler models
Priority
Fast training
Fast inference
Data
Static
Constantly shifting
Fairness
Good to have (sadly)
Important
Interpretability*
Good to have
Important
Complexity
Acceptable
Impractical
Hard part
Modeling
Everything else
ML Production Pipeline: Iterative
Project setup
Data pipeline
Modeling & training
Serving
13
Research: different kind of iterative
After examining the available data, you realize it’s impossible to get the data needed to solve the problem you previously defined, so you have to frame the problem differently.
After training, you realize that you need more data or need to re-label your data.
After serving, the data distribution changes and you need to add more classes.
Data Pipeline
14
Data pipeline
Deep learning is driven by data
Companies with best data win
Proprietary
“Eye-off”
15
Machine Learning System Design (Chip Huyen, 2019)
Talents join companies for the access to unique datasets
Andrej Karpathy (2018)
16
Data challenges
Machine Learning System Design (Chip Huyen, 2019)
17
Research
Production
Clean
Static
Known quirks
Noisy
Missing values
Missing labels
Unprocessed
Constantly changing
Unknown quirks
NaN values, known typos, known weird spellings (Gutenberg), this tokenizer works better than another tokenizer
Data pipeline
Data availability and collection
User data*
Storage
Data preprocessing & representation
Versioning
Verification
Concerns
18
Machine Learning System Design (Chip Huyen, 2019)
Privacy: What privacy concerns do users have about their data? What anonymizing methods do you want to use on their data? Can you store users’ data back to your servers or can only access their data on their devices?
Biases: What biases might represent in the data? How would you correct the biases? Are your data and your annotation inclusive? Will your data reinforce current societal biases?
Data pipeline
Data availability and collection
What kind of data is available? How much?
How often does the new data come in?
Is it annotated?
If not, how hard/expensive is it to get it annotated? Do you need domain experts?
19
Machine Learning System Design (Chip Huyen, 2019)
Data pipeline
Data availability and collection
User data
What data do you need from users?
How do you collect it? Are you allowed to?
How do you get users’ feedback on the system?
How do you use that feedback?
20
Machine Learning System Design (Chip Huyen, 2019)
Data pipeline
Data availability and collection
User data
Storage
Cloud? On-prem? Users’ devices?
Does a sample fit into memory?
21
Machine Learning System Design (Chip Huyen, 2019)
Data pipeline
Data availability and collection
User data
Storage
Data preprocessing & representation
Featuring engineering? Feature extraction?
What to do with missing data?
What to do with class imbalance?
What if train and test data come from different distributions?
How to combine multimodal data?
22
Machine Learning System Design (Chip Huyen, 2019)
You can’t just feed raw data to models. Pretrained embeddings?
Data pipeline
Data availability and collection
User data
Storage
Data preprocessing & representation
Versioning
How to go back to a previous version of data?
If label schema changes, your model will be outdated.
23
Machine Learning System Design (Chip Huyen, 2019)
Git doesn’t work with binary formats
Data pipeline
Data availability and collection
User data
Storage
Data preprocessing & representation
Versioning
Verification
How to know that your data is correct, fair, and sufficient?
24
Machine Learning System Design (Chip Huyen, 2019)
Data pipeline
Data availability and collection
User data
Storage
Data preprocessing & representation
Versioning
Verification
Concerns
Bias
Privacy
Regulation compliance
25
Machine Learning System Design (Chip Huyen, 2019)
Data: ethical concerns
Who owns the data?
How was it collected?
Do people consent for their data to be used?
Does it contain identifiable information?
Can you share the data with annotators off-prem?
Are you allowed to commercialize a model trained on it?
26
Modeling & Training
27
Modeling & Training
What is taught in most ML courses
Often the easier part*
28
xkcd
Model Selection
Don’t: follow buzzwords
Do: choose the simplest, not the fanciest, model that can do the job
29
Machine Learning System Design (Chip Huyen, 2019)
Be solution-oriented, not technique-oriented
Everyone wants to use BERT
Baselines
Random baseline
Human baseline
Oracle
Simple heuristics
Machine Learning System Design (Chip Huyen, 2019)
30
Not talked about: how to choose a metrics
Baselines
Random baseline
Human baseline
Oracle
Simple heuristics
Don’t underestimate good heuristics
Machine Learning System Design (Chip Huyen, 2019)
31
If your model’s performance is low, just choose an easier baseline (jk)
“If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.”
Martin Zinkevich, Google
32
Deep Learning Catch-22
Need data to develop a model
Can’t collect data without a model
Machine Learning System Design (Chip Huyen, 2019)
33
Deep Learning in Production Catch-22
Want to test DL potential without much investment
Can’t get good performance without $$/time in data labeling
Solution
Weakly-supervised (Snorkel AI)
Unsupervised (moonshot)
Machine Learning System Design (Chip Huyen, 2019)
34
Debugging
Machine Learning System Design (Chip Huyen, 2019)
35
Peak of my career
Why debugging for ML is hard
Blackbox (can’t debug a program if you don’t understand it)
Invisible bugs
Many factors can cause a model to perform poorly
Machine Learning System Design (Chip Huyen, 2019)
36
Reasons a model performs poorly
Theoretical constraints
wrong assumptions
poor model/data fit
Machine Learning System Design (Chip Huyen, 2019)
37
Reasons a model performs poorly
Theoretical constraints
Poor implementation
Machine Learning System Design (Chip Huyen, 2019)
38
Reasons a model performs poorly
Theoretical constraints
Poor implementation
Sloppy training techniques
call model.train() instead of model.eval()during eval
Machine Learning System Design (Chip Huyen, 2019)
39
If your model’s is low, just choose an easier baseline
Reasons a model performs poorly
Theoretical constraints
Poor implementation
Sloppy training techniques
Poor choice of hyperparameters
one set of hp can give SOTA, another doesn’t converge
random seed
Machine Learning System Design (Chip Huyen, 2019)
40
Reasons a model performs poorly
Theoretical constraints
Poor implementation
Sloppy training techniques
Poor choice of hyperparameters
Data problems
mismatched inputs/labels
over-preprocessed data
noisy labels
Machine Learning System Design (Chip Huyen, 2019)
41
Scaling is crucial as models are ...
Machine Learning System Design (Chip Huyen, 2019)
42
Becoming bigger Model can’t fit in memory
Model parallelism
Scaling is crucial as models are ...
Machine Learning System Design (Chip Huyen, 2019)
43
Becoming bigger Model can’t fit in memory
Using more data Data can’t fit in memory
Data parallelism
Scaling is crucial as models are ...
Machine Learning System Design (Chip Huyen, 2019)
44
Becoming bigger Model can’t fit in memory
Using more data Data can’t fit in memory
Using more GPUs Large batchsize, stale gradients
LARS - Layer-wise Adaptive Rate Scaling
Training with large batchsize
Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments (Boris Ginsburg et al., 2019)
45
a single DGX-1 with 8 NVIDIA V100 GPUs
Serving
46
Serving
Model compression
Large models are slow/costly for real-time inference
Mobile/edge devices
47
Serving
Model compression
Model compatibility
Framework used in development might not be compatible with consumer devices
48
Serving
Model compression
Model compatibility
CI/CD
ML tests take long time
49
Serving
Model compression
Model compatibility
CI/CD
Monitoring & analysis
When to update your model?
How?
50
Landscape
51
What I learned from looking at 200 machine learning tools (huyenchip.com, 2020)
52
What I learned from looking at 200 machine learning tools (huyenchip.com, 2020)
53
54
https://huyenchip.com/2020/06/22/mlops.html
Thank you!
55