Optimal Strategies for Large-Scale Batch ETL Jobs

Emma Tang, Neustar
Optimal Strategies for
Large-Scale Batch ETL
Jobs
#EUDev3 October, 2017

2#EUdev3
https://www.neustar.biz/marketing

Neustar
• Help the world’s most valuable brands
understand and target their consumers both
online and offline
• Maximize ROI on Ad spend
• Billions of user events per day, petabytes of data
3#EUdev3

Architecture (simplified view)
4#EUdev3

Batch ETL
• Runs on schedule/ programmatically triggered
• Aim for complete utilization of cluster resources,
esp. memory and CPU
5#EUdev3

Why Batch?
• We care about historical state
• We don’t have SLA other than 1-3x daily delivery
• Efficient, tuned optimal use of resources, cost
efficiency
6#EUdev3

What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
7#EUdev3

The attribution problem
• At Neustar, we process large quantities of ad
events
• Familiar events like: impressions, clicks,
conversions
• Which impression/click contributed to
conversion?
8#EUdev3

Example attribution
• Alice goes to her favorite news site, and sees 3
ads – impressions
• She clicks on one of them that leads to Macy’s –
click
• She buys something on Macy’s – conversion
• Her purchase can be attributed to the click and
impression events
9#EUdev3

The approach
• Join conversions with impressions and clicks on
userId
• Go through each user and attribute conversions
to correct target event (impressions/clicks)
• Latest target events are valued more, so
timestamp matters
10#EUdev3

The scale
• Impression: 250 billion
• Clicks: 20 billion
• Conversions: 50 billion
• Join 50 billion x 250 billion
11#EUdev3
impressions
clicks
conversions

• Issues at scale
• Skew
• Optimizations
• Ganglia
12#EUdev3

Driver OOM
Exception in thread "map-output-dispatcher-12" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:615)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617)
13#EUdev3

Driver OOM
• Array of mapStatuses of size m, each status
contains info about how the block is used by
each reducer (n).
• m (mappers) x n (reducers)
14#EUdev3
Status1
reducer1
reducer2
Status2
reducer1
reducer2

Driver OOM
• 2 types of mapStatus: highly compressed vs
compressed
• HighlyCompressedMapStatus tracks reduce
partition average size, with bitmap tracking
which blocks are empty for each reducer
15#EUdev3

Driver OOM
• Reduce number of partitions on either side
• 300k x 75k  100k x 75k
16#EUdev3

Disable unnecessary GC
• spark.cleaner.periodicGC.interval
• GC cycles “stop the world”.
• Large heaps means longer GC
• Set to a long period (e.g. twice the length of your
job)
17#EUdev3

Disable unnecessary GC
• ContextCleaner uses weak references to keep
track of every RDD, ShuffleDependency, and
Broadcast, and registers when the objects go
out of scope
• periodicGCService is a single-thread executor
service that calls the JVM garbage collector
periodically
18#EUdev3

Allow extra time
• spark.rpc.askTimeout
• spark.network.timeout
• in case of GC, our heap size is so large, we will
exceed the timeout.
19#EUdev3

Spurious failures
• Reading from s3 can be flaky, especially when
reading millions of files
• Set spark.task.maxFailures higher than default
of 3
• We set to < 10 to ensure true errors propagate
out quickly
20#EUdev3

• Issues at scale
• Skew
• Optimizations
• Ganglia
21#EUdev3

The skew
• Extreme skew in data
• A few users have 100k+ events for 90 days. The
average user has < 50
• Executors dying due to handful of extremely
large partitions
22#EUdev3

The skew
• Out of 20.5B users, 20.2B have < 50 events
23#EUdev3
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
110000
115000
120000
125000
130000
135000
140000
145000
150000
155000
161000
167000
172000
177000
183000
191000
197000
204000
211000
223000
248000
271000
305000
328000
393000
504000
#ofusers
# of events
count of users with number of events bucketed by 1000s

The skew: zoom
24#EUdev3
0
5
10
15
20
25
30
35
75000
79000
83000
87000
91000
95000
99000
103000
107000
111000
115000
119000
123000
127000
131000
135000
139000
143000
147000
151000
155000
160000
165000
169000
173000
177000
182000
189000
193000
198000
204000
210000
218000
231000
256000
271000
303000
314000
360000
395000
504000
#ofusers
# of events
# of users with # of events bucketed by 100s (> 75k)

Strategy: increase # of partitions
• First line of defense - increase number of
partitions so skewed data is more spread out
25#EUdev3

Strategy: Nest
• Group conversions by userId, group target event
by userId, then join the lists
• Avoid cartesian joins
26#EUdev3

Long tail: Spark UI
• 50 min long tail, median 24 s
28#EUdev3

Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
29#EUdev3

Bloom Filter
• Space-efficient probabilistic data structure to test
whether an element is a member of a set
• Size mainly determined by number of items in
the filter, and the probability of false positives
• No false negatives!
• Broadcast filter out to executors
30#EUdev3

Bloom Filter
• Using a high false positive rate, still very good
filter
• P = 5% -> 80% filtered out
• Subsequent join much faster
31#EUdev3

Bloom Filter
• Tradeoff between accuracy & size
• We’ve had great success with Bloom Filters with
size of < 5G
• Experiment with Bloom Filters
32#EUdev3

Bloom Filter Applied
• For conversions of 50 billion, false positive rate
of 0.1%, filter size is 80GB
• False positive rate of 5%, filter size is 35GB
• Still too big
33#EUdev3

Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
34#EUdev3

Long tail: what is it doing?
• Look at executor threads during long tail
com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382)
com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65)
com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630)
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209
)
org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239)
org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(Wri
tablePartitionedPairCollection.scala:56)
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scal
a:699)
36#EUdev3

• Mappers writing to shuffle space taking long
• Need to reduce data size before going into
shuffle
37#EUdev3

• Events in the long tail had almost identical
information, spread over time.
• For each user, if we retain just 1 event per hour,
at 90 days, it is around 2k events.
• However, this means we need to group by user
first, which requires a shuffle, which defeats the
whole purpose of this exercise, right?
38#EUdev3

Strategy: Filter during map side combine
• Use combineByKey and maximize map side
combine
• Thin collection out during map side combine ->
less is written to shuffle space
39#EUdev3

Still slow…
• What else can I do?
41#EUdev3

• Issues at scale
• Skew
• Optimizations
• Ganglia
42#EUdev3

Avoid shuffles
• Reuse the same partitioner instance
43#EUdev3

Avoid shuffles
• Denormalize data or union data to minimize
shuffle
• Rely on the fact we will reduce into a highly
compressed key space.
• For example, we want count of events by
campaign, also count of events by site
45#EUdev3

Coalesce partitions when loading
• Loading many small files – coalesce down # of
partitions
• No shuffle
• Reduce task overhead, greatly improve speed
• Going from 300k partitions to 60k, cut time by
half
47#EUdev3

Coalesce partitions when loading
final JavaRDD<Event> eventRDD= loadDataFromS3(); // load data
final int loadingPartitions = eventRDD.getNumPartitions(); // inspect
how many partitions
final int coalescePartitions = loadingPartitions / 5; // use algorithm
to calculate new #
eventRDD
.coalesce(coalescePartitions) // coalesce to smaller #
.map(e -> transform(e)) // faster subsequent operations
48#EUdev3

Materialize data
• Large chunk of data persisted in memory
• Large RDD used to calculate small RDD
• Use an Action to materialize the smaller
calculated result so larger data can be
unpersisted
49#EUdev3

Materialize data
parent.cache() // persist large parent PairRDD to memory
child1 = parent.reduceByKey(a).cache() // calculate child1 from parent
child2 = parent.reduceByKey(b).cache() // calculate child2 from parent
child1.count()// perform an Action
child2.count()// perform an Action
parent.unpersist() // safe to mark parent as unpersisted
// rest of the code can use memory
50#EUdev3

• Issues at scale
• Skew
• Optimizations
• Ganglia
51#EUdev3

Ganglia
• Ganglia is an extremely useful tool to
understand performance bottlenecks, and to
tune for highest cluster utilization
52#EUdev3

Ganglia: CPU wave
• Executors are going into GC multiple times in
the same stage
• Running out of execution memory
• Persist to StorageLevel.DISK_ONLY()
54#EUdev3

Ganglia: inefficient use
55#EUdev3

Ganglia: inefficient use
• Decrease # of partitions of RDDs used in this
stage
56#EUdev3

Ganglia: much better
57#EUdev3

Final Configuration
• Master 1 r3.4xl
• Executors 110 r3.4xl
• Configurations:
58#EUdev3
spark maximizeResourceAllocation TRUE
spark-defaults spark.executor.cores 16
spark-defaults spark.dynamicAllocation.enabled FALSE
spark-defaults spark.driver.maxResultSize 8g
spark-defaults spark.rpc.message.maxSize 2047
spark-defaults spark.rpc.askTimeout 300
spark-defaults spark.network.timeout 300s
spark-defaults spark.executor.heartbeatInterval 20s
spark-defaults spark.executor.memory 92540m
spark-defaults spark.yarn.executor.memoryOverhead 23300
spark-defaults spark.task.maxFailures 10
spark-defaults spark.executor.extraJavaOptions -XX:+UseG1GC
spark-defaults spark.cleaner.periodicGC.interval 600min

Summary
• Large jobs are special, use special settings
• Outsmart the skew
• Use Ganglia!
59#EUdev3

Thank you
60#EUdev3
Emma Tang
@emmayolotang
@Neustar

Optimal Strategies for Large-Scale Batch ETL Jobs

More Related Content

What's hot (20)

Similar to Optimal Strategies for Large-Scale Batch ETL Jobs (20)

Recently uploaded (20)

Optimal Strategies for Large-Scale Batch ETL Jobs

Editor's Notes