This document discusses strategies for optimizing large-scale batch ETL jobs at Neustar. It describes issues they face at scale like driver out of memory errors due to joining large datasets. It discusses ways they address skew in the data through increasing partitions and nesting joins. Optimization techniques covered include disabling unnecessary garbage collection, allowing extra timeouts, and using Ganglia and Spark UI to analyze long tasks. Using bloom filters and filtering during map-side combines can reduce data size before shuffling. Avoiding shuffles where possible through data denormalization and coalescing partitions when loading also improves performance.
Related topics: