Comparing Scalable Predictive Analysis using Spark XGBoost Platforms

Jongwook Woo
BigDAI
CalStateLA
ICONI 2022
Jeju Island, Korea
December 12, 2022
Jongwook Woo (jwoo5@calstatela.edu)
Savita Yadav, Samyuktha Muralidharan,
Big Data AI Center (BigDAI)
California State University, Los Angeles
Comparing Scalable Predictive Analysis
using Spark XGBoost Platforms

Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Why XGBoost
 Dataset
 Big Data Platforms
 Comparison of Experimental Results
Conclusion

Jongwook Woo
CalStateLA
Why XGBoost
XGBoost stands for Extreme Gradient Boosting,
Famous as it has won many Kaggle competitions.
Ensemble tree with distributed gradient-boosting based on [4]
– hardware optimization
– parallelized tree building
– tree pruning using ‘depth-first’ approach
– regularization through both LASSO (L1) and Ridge (L2) for avoiding
overfitting
– efficient handling of missing data
– built-in cross-validation capability (at each iteration)

Jongwook Woo
CalStateLA
Big Data AI Center: Regression Models [1-3]
Big Data: distributed parallel computing
Regression models to predict price of Airbnb listings
RAPIDS XGBoost models
74 % higher accuracy and up to 90 % faster than the
traditional models.
 In Spark EMR, XGBoost using GPU
–is 17 % faster than using only CPU.

Jongwook Woo
CalStateLA
Big Data AI Center: Performance of Models [1-3]
0
100
200
300
400
500
600
XGBoost
CPU
XGBoost
GPU
XGBoost
GPU
XGBoost
CPU
DT RF DF RF GBT GBT
Linux File
Systems
HDFS S3 S3 S3 S3 Linux File
Systems
Linux File
Systems
S3 Linux File
Systems
RayDP EMR EMR EMR EMR EMR RayDP RayDP EMR RayDP
Train Time (sec)

Jongwook Woo
CalStateLA
Big Data AI Center: Performance of Models [1 – 3]
Cluster Models Training
Time (sec)
RMSE R2
HDFS XGBoost GPU 15.4 33.42 0.731
S3 DT 31.1 130.4 0.238
S3 GBT 173.2 128.7 0.248
S3 RF 48.2 130.1 0.247
S3 XGBoost CPU 21.5 33.4 0.731
RayDP
EXT4
XGBoost CPU 4.3 31.1 NA
Muralidharan, S., Yadav, S., Huh, J., Lee, S., & Woo, J. (June 2022). “Scalable Prediction Models for Airbnb Listing in
Spark Big Data Cluster using GPU-accelerated RAPIDS”, Journal of Information and Communication Convergence
Engineering, 20(2), 96–102. https://doi.org/10.6109/JICCE.2022.20.2.96

Jongwook Woo
CalStateLA
Contents
Why XGBoost
 Dataset
 Conclusion

Jongwook Woo
CalStateLA
Dataset Details
Dataset:
Airbnb Listings
Dataset URLs:
– https://public.opendatasoft.com/explore/dataset/airbnb-
listings/table/?disjunctive.host_verifications&disjunctive.amenities&disjunctiv
e.features
Total Dataset size: 4 GB
Dataset Format: CSV
Not easy to handle with the traditional systems
– In Data Engineering, Analysis, and Science
Predicting whether the listing has a good rating or not.
 Using two-class classification algorithms to build a model
– to classify the listings as high rated and low rated, based on the features of the listing.

Jongwook Woo
CalStateLA
Dataset Details (Cont’d)
Data Transformation
Label: Review Scores Rating
– Converting the Review Scores Rating column to binary class
• Review Scores Rating >= 80 -> High Rating
• Review Scores Rating < 80 -> Low Rating
Features: 25 columns
– "Host Listings Count", "Host Total Listings Count", "Calculated host listings count",
"Security Deposit", "Cleaning Fee" , "Host Response Time", "Host Response Rate", "Host
Acceptance Rate", "Property Type", "Room Type", "Weekly Price", "Monthly Price",
"Maximum Nights", "Review Scores Accuracy", "Review Scores Cleanliness", "Review
Scores Checkin", "Review Scores Communication", "Review Scores Location", "Review
Scores Value", "Cancellation Policy", "Bedrooms", "Bathrooms", "Beds", "Extra People",
"Minimum Nights"

Jongwook Woo
CalStateLA
Need for Big Data Predictive Analysis
Performance issue using traditional predictive models
The size of the data set is 4 Giga-Bytes.
traditional systems mostly generate
– a memory error
– or takes several hours or days to build them.
Need Big Data
Non-expensive platform, which is distributed parallel computing systems
– and that can store a large-scale data and process it in parallel
 Apache Hadoop and Spark since 2006
– Non-expensive Super Computer
– Any small companies or university labs can own it

Jongwook Woo
CalStateLA
Big Data: Distributed Computing Systems
Hadoop Spark cluster
in-memory computing engine
– built by AMPLab at UC Berkeley in 2012.
– Spark is 100 times faster than Hadoop MapReduce in theory
– Supports Machine Learning Algorithms
More Platforms can leverage Spark Cluster for XGBoost
Nvidia RAPIDS
Intel Big DL
H2O Sparkling Water
How is performance of the platform?

Jongwook Woo
CalStateLA
XGBoost
AWS EMR
EMR Hadoop Spark Big Data cluster
– EMR 6.7.0:
• Hadoop 3.2.1, Spark 3.2.1
• Does not have XGBoost
– 3 nodes cluster
XGBoost support
– NVidia RAPIDS
• set up Rapids into EMR Spark
– to build an XGBoost model with GPU and CPU
– Big DL
– Sparkling Water

Jongwook Woo
CalStateLA
AWS EMR Big Data Cluster
PySpark, RAPIDS, Big DL, and
Sparkling Water installed
AWS EMR Big Data Cluster with GPU:
2 x g4dn.2xlarge, m3.xlarge
1. Airbnb files: HDFS
2. Data Read &
Engineering
3. Train and Test
Models

Jongwook Woo
CalStateLA
Intel Big DL
Intel provides BigDL
for distributed Deep Learning applications in the Spark
cluster.
And, provides XG
DDL
DDL lib
DDL lib
Deep Learning in Spark

Jongwook Woo
CalStateLA
XGBoost (Cont’d)
 H2O
The open-source platform
H2O.ai focuses on bringing AI to businesses through software.
–H2O is its flagship product
 H2O Sparkling Water
integrates H2O's scalable machine learning engine with Spark
–with XGBoost library

Jongwook Woo
CalStateLA
XGBoost: H2O Sparkling Water [8]

Jongwook Woo
CalStateLA
Evaluation Measurement
Accuracy
 Precision
–Intend to reduce the number of False Positives
 AUC
Computing Time
Parallel Computing to train models
–With GPU

Jongwook Woo
CalStateLA
Comparing Computing Time of Spark Platforms

Jongwook Woo
CalStateLA
Comparing Accuracy of Spark Platforms

Jongwook Woo
CalStateLA
Comparing the performance
XGBoost models
Big DL with GPU
–is 25 – 50% faster for model training time than other
platforms.
H2O Sparkling has 5 - 7% better AUC
–and 0.7% better Precision than others.

Jongwook Woo
CalStateLA
Contents
Why XGBoost
 Dataset
Conclusion

Jongwook Woo
CalStateLA
Conclusion
Comparing the performance of the platforms in Big
Data Predictive Analysis for large-scale data
XGBoost in RAPIDS, The Big DL, and Sparkling
Water
Big DL with GPU
–fastest for model training time.
H2O Sparkling has better AUC and Precision than others

Jongwook Woo
CalStateLA
Questions?

Jongwook Woo
CalStateLA
References
1. Muralidharan, S., Yadav, S., Huh, J., Lee, S., & Woo, J. (June 2022). Scalable Prediction Models for Airbnb Listing in Spark Big
Data Cluster using GPU-accelerated RAPIDS. Journal of Information and Communication Convergence Engineering, 20(2), 96–
102. https://doi.org/10.6109/JICCE.2022.20.2.96
2. Samyuktha Muralidharan, Savita Yadav, Sanghoon Lee, and, Jongwook Woo, "Scalable Price Prediction Models of Hosting
Business Leveraging Big Data with GPU", KSII The 17th Asia Pacific International Conference on Information Science and
Technology (APIC-IST) June 19-21 2022, pp95-97, ISSN 2093-0542
3. Savita Yadav, Samyuktha Muralidharan, Sanghoon Lee, Jongwook Woo, "Scalable Predictive Analysis for Airbnb Listing
Rating", KSII The 13th International Conference on Internet (ICONI) 2021, Dec 12-14 2021, Jeju Island, Korea, pp370-372,
ISSN 2093-0542
4. Why does XGBoost work so well?, https://www.kaggle.com/general/196541
5. Sparkling Water: Run on Hadoop, https://docs.h2o.ai/sparkling-water/3.2/latest-
stable/doc/install/install_and_start.html#run-on-hadoop
6. Sparkling Water Booklet, https://docs.h2o.ai/sparkling-water/3.0/latest-stable/booklet/SparklingWaterBooklet.pdf
7. NVidia Rapids, https://github.com/NVIDIA/spark-rapids
8. Lesson 1, Sparkling Water Training, https://aquarium.h2o.ai/lab/8

Jongwook Woo
CalStateLA
Big Data Cluster
 Linearly Scalable for more storage and computing power
Big Data Cluster

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms

More Related Content

Similar to Comparing Scalable Predictive Analysis using Spark XGBoost Platforms (20)

More from Jongwook Woo (20)

Recently uploaded (20)

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms