2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber

Michelangelo Palette
Feature Engineering @ Uber
Amit Nene
Staff Engineer,
Michelangelo ML
Platform
Eric Chen
Engineering Manager,
Michelangelo ML
Platform

Enable engineers and data scientists across the
company to easily build and deploy machine learning
solutions at scale.
ML-as-a-service
○ Managing Data/Features
○ Tools for managing, end-to-end, heterogenous
training workflows
○ Batch, online & mobile serving
○ Feature and Model drift monitoring
Michelangelo @ Uber
MANAGE DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS

Feature Engineering @ Uber
○ Example: ETA for EATS order
○ Key ML features
○ How large is the order?
○ How busy is the restaurant?
○ How quick is the restaurant?
○ How busy is the traffic?

Managing Features
One of the hardest problems in ML
○ Finding good Features & labels
○ Data in production: reliability, scale, low latency
○ Data parity: training/serving skew
○ Real-time features: traditional tools don’t work

Palette Feature Store
Uber-specific curated and crowd-sourced feature database that is easy to use with machine
learning projects.
One stop shop
○ Search for features in single catalog/spec: rider, driver, restaurant, trip, eaters, etc.
○ Define new features + create production pipelines from spec
○ Share features across Uber: cut redundancy, use consistent data
○ Enable tooling: Data Drift Detection, Auto Feature Selection, etc.

Feature Store Organization
Organized as <entity>:<feature-group>:<feature-name>:<join-key>
Eg. @palette:restaurant:realtime_group:orders_last_30min:restaurant_uuid
Backed by a dual datastore system: similarities to lambda
○ Offline
○ Offline (Hive based) store for bulk access of features
○ Bulk retrieval of features across time
○ Online
○ KV store (Cassandra) for serving latest known value
○ Supports lookup/join of latest feature values in real time
○ Data synced between online & offline
○ Key to avoiding training/serving skew

EATS Features revisited
○ How large is the order? ← Input
○ How busy is the restaurant?
○ How quick is the restaurant?
○ How busy is the traffic?

Creating Batch Features
Offline Batch jobs
Features join
Model
Training
job
Online
Store
(Serving)
Offline
Store
(Training)
Features join
Model
Scoring
Service
Data
dispersal
Feature
Store
Palette
Feature
spec
General trends, not sensitive to
exact time of event
Ingested from Hive queries or
Spark jobs
How quick is the restaurant ?
○ Aggregate trends
○ Use Hive QL from warehouse
○ @palette:restaurant:batch_aggr:
prepTime:rId
Hive QL
Apache Hive and Apache Spark are either registered trademarks or
trademarks of the Apache Software Foundation in the United
States and/or other countries. No endorsement by The Apache
Software Foundation is implied by the use of this mark.

Creating Real-time Features
Flink-as-service
Streaming jobs
Features join
Model
Scoring
Service
Offline
Store
(Training)
Online
Store
(Serving)
Features join
Model
Training
job
Log +
Backfill
Feature
Store
Features reflecting the latest
state of the world
Ingest from streaming jobs
How busy is the restaurant ?
○ kafka topic with events
○ perform realtime aggregations
○ @palette:restaurant:rt_aggr:nMeal:
rId
Palette
Feature
spec
Apache Flink is either a registered trademark or trademark of the
Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is
implied by the use of this mark.
Flink SQL

Bring Your Own Features
Feature maintained by customers
Mechanisms for hooking
serving/training endpoints
Users maintain data parity
How busy is the region ?
○ RPC: external traffic feed
○ Log RPCs for training
○ @palette:region:traffic:nBusy:regionId
Features join
Model
Scoring
Service
Features join
Model
Training
job
Palette
Feature
spec
Offline
Proxy
Online
Proxy
Custom
store
Batch
API
Service
endpoint
RPC

Palette Feature Joins
Join @basis features
with supplied
@palette features into
single feature vector
Join billion+ rows at
points-in-time:
dominates overhead
Join/Lookup 10s of
tables at serving time
at low latency
order_i
d
nOrder restaur
ant_uui
d
latlong Label
ETA
(trainin
g)
timesta
mp
1 4 uuid1 (10,20) 40m t1
2 3 uuid2 (30,40) 35m t2
rId prepTime timestamp
uuid1 20m t1
uuid2 15m t2
join_key = rId
@basis features
training/scoring feature vector
Time + Key
join
order
_id
rId latlong Label
ETA
(trainin
g)
prepT
ime
..
1 uuid1 (10,20) 40m 20m ..
2 uuid2 (30,40) 35m 15m ..
@palette:restaurant:agg_stats:prepTime:
restaurant_uuid
@palette:restaurant:re
altime_stats:nBusy:rId
@palette:region:stats:n
Busy:regionId

Done with Feature Engineering ?
○ Feature Store Features
○ nOrder: How large is the order? (basis)
○ nMeal: How busy is the restaurant? (near real-time)
○ prepTime: How quick is the restaurant? (batch feature)
○ nBusy: How busy is the traffic? (external feature)
○ Ready to use ?
○ Model specific feature transformations
○ Chaining of features
○ Feature Transformers

Feature Consumption
Feature Store Features
○ nOrder: input feature
○ nMeal: consume directly
○ prepTime: needs transformation before use
○ nBusy: input latlong but need regionId
Setting up consumption pipelines
○ nMeal: r_id -> nMeal
○ prepTime: r_id -> prepTime -> featureImpute
○ nBusy: r_id -> lat, log -> regionId(lat, log) -> nBusy
In arbitrary order

Michelangelo Transformers
Transformer: Given a record defined a set of fields, add/modify/remove fields in the record
PipelineModel: A sequence of transformers
Spark ML: Transformer/PipelineModel on DataFrames
Michelangelo Transformers: extended transformer framework for both Apache Spark and
Apache Spark-less environments
Estimator: Analyze the data and produce a transformer

Defining a Pipeline Model
Join Palette Features
Apply Feature Eng Rules
String Indexing
One-Hot Encoding
DL Inferencing
Result Retrieval
Feature consumption
○ Feature extraction: Palette feature retrieval expressed as a transform
○ Feature mutation: Scala-like DSL for simple transforms
○ Model-centric Feature Engineering: string indexer, one-hot encoder,
threshold decision
○ Result retrieval
Modeling
○ Model inferencing (also Michelangelo Transformer)

Michelangelo Transformers Example
class MyModel (override val uid: String) extends Model[MyModel] with
MyModelParam with MLWritable with MATransformer {
...
override def transform(dataset: Dataset[_]): DataFrame = ...
override def scoreInstance(instance: util.Map[String, Object]): util.Map[String,
Object] = ...
}
class MyEstimator(override val uid: String) extends
Estimator[MyEstimator] with Params with DefaultParamsWritable {
...
override def fit(dataset: Dataset[_]): MyModel = ...
}

Palette retrieval as a Transformer
Palette Feature Transformer
Feature Meta Store
RPC Feature Proxy
Cassandra Access
Hive Access
tx_p1 = PaletteTransformer([
"@palette:restaurant:realtime_feature:nMeal:r_id",
"@palette:restaurant:batch_feature:prepTime:r_id",
"@palette:restaurant:property:lat:r_id",
"@palette:restaurant:property:log:r_id"
])
tx_p2 = PaletteTransformer([
"@palette:region:service_feature:nBusy:region_id"
])

DSL Estimator / Transformer
DSL Estimator
Code Gen / Compiler
DSL Transformer
Online classloader Offline classloader
es_dsl1 = DSLEstimator(lambdas = [
["region_id", "regionId(@palette:restaurant:property:lat:r_id,
@palette:restaurant:property:r_id"]
])
es_dsl2 = DSLEstimator(lambdas = [
["prepTime": nFill(nVal("@palette:restaurant:batch_feature:prepTime:r_id"),
avg("@palette:restaurant:batch_feature:prepTime:r_id")))"],
["nMeal": nVal("@palette:restaurant:realtime_feature:nMean:r_id")],
["nOrder": nVal("@basis:nOrder")],
["nBusy": nVal("@palette:region:service_feature:nBusy:region_id")]
])

Uber Eats Example Cont.
Computation order
○ nMeal: rId -> nMeal
○ prepTime: rId -> prepTime -> featureImpute
○ busyScale: rId -> lat, log -> regionId(lat, log) -> busyScale
Palette Transformer
id -> nMeal
id -> prepTime
id -> lat, log
DSL Transformer
lag, log -> regionID
Palette Transformer
regionID -> nBusy
DSL Transformer
impute(nMeal)
impute(prepTime)

Dev Tools: Authoring and Debugging a Pipeline
Palette feature generation
● Apache Hive QL, Apache Flink SQL
Interactive authoring
● PySpark + iPython Jupyter notebook
Centralized model store
● Serialization / Deserialization (Spark ML,
MLReadable/Writeable)
● Online and offline accessibility
basis_feature_sql = "..."
df = spark.sql(basis_feature_sql)
pipeline = Pipeline(stages=[tx_p1, es_dsl1, tx_p2, es_dsl2t, vec_asm, l_r)
pipeline_model = pipeline.fit(df)
scored_def = pipeline_model.transform(df)
model_id = MA_store.save_model(pipeline_model)
draft_id = MA_store.save_pipeline(basis_feature_sql, pipeline)
retrain_job = MA_API.train(draft_id, new_basis_feature_sql)

Takeaways
Feature Store: Batch, Realtime and External Features with online and offline parity
Offline scalability: Joins across billions of rows
Online serving latency: Parallel IO, fast storage with caching
Feature Transformers: Setup chains of transformations at training/serving time
Pipeline reliability and monitoring out-of-the-box

Thank you
https://eng.uber.com/michelangelo/
https://www.uber.com/careers/

2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber

More Related Content

What's hot (20)

Similar to 2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber (20)

More from Karthik Murugesan (20)

Recently uploaded (20)

2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber