SlideShare a Scribd company logo
Jongwook Woo
BigDAI
CalStateLA
ICONI 2022
Jeju Island, Korea
December 12, 2022
Jongwook Woo (jwoo5@calstatela.edu)
Savita Yadav, Samyuktha Muralidharan,
Big Data AI Center (BigDAI)
California State University, Los Angeles
Comparing Scalable Predictive Analysis
using Spark XGBoost Platforms
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Why XGBoost
 Dataset
 Big Data Platforms
 Comparison of Experimental Results
Conclusion
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Why XGBoost
XGBoost stands for Extreme Gradient Boosting,
Famous as it has won many Kaggle competitions.
Ensemble tree with distributed gradient-boosting based on [4]
– hardware optimization
– parallelized tree building
– tree pruning using ‘depth-first’ approach
– regularization through both LASSO (L1) and Ridge (L2) for avoiding
overfitting
– efficient handling of missing data
– built-in cross-validation capability (at each iteration)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data AI Center: Regression Models [1-3]
Big Data: distributed parallel computing
Regression models to predict price of Airbnb listings
RAPIDS XGBoost models
74 % higher accuracy and up to 90 % faster than the
traditional models.
 In Spark EMR, XGBoost using GPU
–is 17 % faster than using only CPU.
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data AI Center: Performance of Models [1-3]
0
100
200
300
400
500
600
XGBoost
CPU
XGBoost
GPU
XGBoost
GPU
XGBoost
CPU
DT RF DF RF GBT GBT
Linux File
Systems
HDFS S3 S3 S3 S3 Linux File
Systems
Linux File
Systems
S3 Linux File
Systems
RayDP EMR EMR EMR EMR EMR RayDP RayDP EMR RayDP
Train Time (sec)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data AI Center: Performance of Models [1 – 3]
Cluster Models Training
Time (sec)
RMSE R2
HDFS XGBoost GPU 15.4 33.42 0.731
S3 DT 31.1 130.4 0.238
S3 GBT 173.2 128.7 0.248
S3 RF 48.2 130.1 0.247
S3 XGBoost CPU 21.5 33.4 0.731
RayDP
EXT4
XGBoost CPU 4.3 31.1 NA
Muralidharan, S., Yadav, S., Huh, J., Lee, S., & Woo, J. (June 2022). “Scalable Prediction Models for Airbnb Listing in
Spark Big Data Cluster using GPU-accelerated RAPIDS”, Journal of Information and Communication Convergence
Engineering, 20(2), 96–102. https://doi.org/10.6109/JICCE.2022.20.2.96
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Why XGBoost
 Dataset
 Big Data Platforms
 Comparison of Experimental Results
 Conclusion
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Dataset Details
Dataset:
Airbnb Listings
Dataset URLs:
– https://public.opendatasoft.com/explore/dataset/airbnb-
listings/table/?disjunctive.host_verifications&disjunctive.amenities&disjunctiv
e.features
Total Dataset size: 4 GB
Dataset Format: CSV
Not easy to handle with the traditional systems
– In Data Engineering, Analysis, and Science
Predicting whether the listing has a good rating or not.
 Using two-class classification algorithms to build a model
– to classify the listings as high rated and low rated, based on the features of the listing.
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Dataset Details (Cont’d)
Data Transformation
Label: Review Scores Rating
– Converting the Review Scores Rating column to binary class
• Review Scores Rating >= 80 -> High Rating
• Review Scores Rating < 80 -> Low Rating
Features: 25 columns
– "Host Listings Count", "Host Total Listings Count", "Calculated host listings count",
"Security Deposit", "Cleaning Fee" , "Host Response Time", "Host Response Rate", "Host
Acceptance Rate", "Property Type", "Room Type", "Weekly Price", "Monthly Price",
"Maximum Nights", "Review Scores Accuracy", "Review Scores Cleanliness", "Review
Scores Checkin", "Review Scores Communication", "Review Scores Location", "Review
Scores Value", "Cancellation Policy", "Bedrooms", "Bathrooms", "Beds", "Extra People",
"Minimum Nights"
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Need for Big Data Predictive Analysis
Performance issue using traditional predictive models
The size of the data set is 4 Giga-Bytes.
traditional systems mostly generate
– a memory error
– or takes several hours or days to build them.
Need Big Data
Non-expensive platform, which is distributed parallel computing systems
– and that can store a large-scale data and process it in parallel
 Apache Hadoop and Spark since 2006
– Non-expensive Super Computer
– Any small companies or university labs can own it
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Why XGBoost
 Dataset
 Big Data Platforms
 Comparison of Experimental Results
 Conclusion
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data: Distributed Computing Systems
Hadoop Spark cluster
in-memory computing engine
– built by AMPLab at UC Berkeley in 2012.
– Spark is 100 times faster than Hadoop MapReduce in theory
– Supports Machine Learning Algorithms
More Platforms can leverage Spark Cluster for XGBoost
Nvidia RAPIDS
Intel Big DL
H2O Sparkling Water
How is performance of the platform?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
XGBoost
AWS EMR
EMR Hadoop Spark Big Data cluster
– EMR 6.7.0:
• Hadoop 3.2.1, Spark 3.2.1
• Does not have XGBoost
– 3 nodes cluster
XGBoost support
– NVidia RAPIDS
• set up Rapids into EMR Spark
– to build an XGBoost model with GPU and CPU
– Big DL
– Sparkling Water
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AWS EMR Big Data Cluster
PySpark, RAPIDS, Big DL, and
Sparkling Water installed
AWS EMR Big Data Cluster with GPU:
2 x g4dn.2xlarge, m3.xlarge
1. Airbnb files: HDFS
2. Data Read &
Engineering
3. Train and Test
Models
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Intel Big DL
Intel provides BigDL
for distributed Deep Learning applications in the Spark
cluster.
And, provides XG
DDL
DDL lib
DDL lib
Deep Learning in Spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
XGBoost (Cont’d)
 H2O
The open-source platform
H2O.ai focuses on bringing AI to businesses through software.
–H2O is its flagship product
 H2O Sparkling Water
integrates H2O's scalable machine learning engine with Spark
–with XGBoost library
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
XGBoost: H2O Sparkling Water [8]
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Why XGBoost
 Dataset
 Big Data Platforms
 Comparison of Experimental Results
 Conclusion
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Evaluation Measurement
Accuracy
 Precision
–Intend to reduce the number of False Positives
 AUC
Computing Time
Parallel Computing to train models
–With GPU
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Comparing Computing Time of Spark Platforms
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Comparing Accuracy of Spark Platforms
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Comparing the performance
XGBoost models
Big DL with GPU
–is 25 – 50% faster for model training time than other
platforms.
H2O Sparkling has 5 - 7% better AUC
–and 0.7% better Precision than others.
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Why XGBoost
 Dataset
 Big Data Platforms
 Comparison of Experimental Results
Conclusion
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Conclusion
Comparing the performance of the platforms in Big
Data Predictive Analysis for large-scale data
XGBoost in RAPIDS, The Big DL, and Sparkling
Water
Big DL with GPU
–fastest for model training time.
H2O Sparkling has better AUC and Precision than others
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. Muralidharan, S., Yadav, S., Huh, J., Lee, S., & Woo, J. (June 2022). Scalable Prediction Models for Airbnb Listing in Spark Big
Data Cluster using GPU-accelerated RAPIDS. Journal of Information and Communication Convergence Engineering, 20(2), 96–
102. https://doi.org/10.6109/JICCE.2022.20.2.96
2. Samyuktha Muralidharan, Savita Yadav, Sanghoon Lee, and, Jongwook Woo, "Scalable Price Prediction Models of Hosting
Business Leveraging Big Data with GPU", KSII The 17th Asia Pacific International Conference on Information Science and
Technology (APIC-IST) June 19-21 2022, pp95-97, ISSN 2093-0542
3. Savita Yadav, Samyuktha Muralidharan, Sanghoon Lee, Jongwook Woo, "Scalable Predictive Analysis for Airbnb Listing
Rating", KSII The 13th International Conference on Internet (ICONI) 2021, Dec 12-14 2021, Jeju Island, Korea, pp370-372,
ISSN 2093-0542
4. Why does XGBoost work so well?, https://www.kaggle.com/general/196541
5. Sparkling Water: Run on Hadoop, https://docs.h2o.ai/sparkling-water/3.2/latest-
stable/doc/install/install_and_start.html#run-on-hadoop
6. Sparkling Water Booklet, https://docs.h2o.ai/sparkling-water/3.0/latest-stable/booklet/SparklingWaterBooklet.pdf
7. NVidia Rapids, https://github.com/NVIDIA/spark-rapids
8. Lesson 1, Sparkling Water Training, https://aquarium.h2o.ai/lab/8
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Cluster
 Linearly Scalable for more storage and computing power
Big Data Cluster

More Related Content

Similar to Comparing Scalable Predictive Analysis using Spark XGBoost Platforms (20)

PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
Apache Spark 2x Cookbook Cloudready Recipes For Analytics And Data Science 2n...
mbouemugnia
 
PPTX
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
PPTX
Introduction to Big Data and its Trends
Jongwook Woo
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Machine Learning With H2O vs SparkML
Arnab Biswas
 
PDF
Big Data Analytics With Java 1st Rajat Mehta
jnuozdz0702
 
PPTX
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
PDF
20160908 hivemall meetup
Takeshi Yamamuro
 
PDF
Big Data Analytics With R And Hadoop Vignesh Prajapati
mumeytakes4a
 
PDF
spark_v1_2
Frank Schroeter
 
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
Product prices prediction system
Ajinkya Pathak
 
PDF
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
PPTX
Atlanta MLConf
Qubole
 
PDF
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks
 
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Apache Spark 2x Cookbook Cloudready Recipes For Analytics And Data Science 2n...
mbouemugnia
 
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
Introduction to Big Data and its Trends
Jongwook Woo
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Big Data Analytics With Java 1st Rajat Mehta
jnuozdz0702
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
20160908 hivemall meetup
Takeshi Yamamuro
 
Big Data Analytics With R And Hadoop Vignesh Prajapati
mumeytakes4a
 
spark_v1_2
Frank Schroeter
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Product prices prediction system
Ajinkya Pathak
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Qubole
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 

More from Jongwook Woo (20)

PPTX
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
PDF
How To Use Artificial Intelligence (AI) in History
Jongwook Woo
 
PPTX
Machine Learning in Quantum Computing
Jongwook Woo
 
PPTX
History and Trend of Big Data and Deep Learning
Jongwook Woo
 
PPTX
The Importance of Open Innovation in AI era
Jongwook Woo
 
PPTX
Traffic Data Analysis and Prediction using Big Data
Jongwook Woo
 
PPTX
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Jongwook Woo
 
PPTX
Introduction to Big Data: Smart Factory
Jongwook Woo
 
PPTX
AI on Big Data
Jongwook Woo
 
PDF
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Jongwook Woo
 
PDF
President Election of Korea in 2017
Jongwook Woo
 
PPTX
Big Data Trend with Open Platform
Jongwook Woo
 
PPTX
Big Data Trend and Open Data
Jongwook Woo
 
PPTX
Big Data Platform adopting Spark and Use Cases with Open Data
Jongwook Woo
 
PPTX
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Jongwook Woo
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
PPTX
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
PPTX
Big Data Analysis and Industrial Approach using Spark
Jongwook Woo
 
PPTX
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Jongwook Woo
 
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
How To Use Artificial Intelligence (AI) in History
Jongwook Woo
 
Machine Learning in Quantum Computing
Jongwook Woo
 
History and Trend of Big Data and Deep Learning
Jongwook Woo
 
The Importance of Open Innovation in AI era
Jongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Jongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Jongwook Woo
 
Introduction to Big Data: Smart Factory
Jongwook Woo
 
AI on Big Data
Jongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Jongwook Woo
 
President Election of Korea in 2017
Jongwook Woo
 
Big Data Trend with Open Platform
Jongwook Woo
 
Big Data Trend and Open Data
Jongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Jongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Jongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Jongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Jongwook Woo
 
Ad

Recently uploaded (20)

PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Ad

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms

  • 1. Jongwook Woo BigDAI CalStateLA ICONI 2022 Jeju Island, Korea December 12, 2022 Jongwook Woo ([email protected]) Savita Yadav, Samyuktha Muralidharan, Big Data AI Center (BigDAI) California State University, Los Angeles Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
  • 2. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Why XGBoost  Dataset  Big Data Platforms  Comparison of Experimental Results Conclusion
  • 3. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Why XGBoost XGBoost stands for Extreme Gradient Boosting, Famous as it has won many Kaggle competitions. Ensemble tree with distributed gradient-boosting based on [4] – hardware optimization – parallelized tree building – tree pruning using ‘depth-first’ approach – regularization through both LASSO (L1) and Ridge (L2) for avoiding overfitting – efficient handling of missing data – built-in cross-validation capability (at each iteration)
  • 4. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data AI Center: Regression Models [1-3] Big Data: distributed parallel computing Regression models to predict price of Airbnb listings RAPIDS XGBoost models 74 % higher accuracy and up to 90 % faster than the traditional models.  In Spark EMR, XGBoost using GPU –is 17 % faster than using only CPU.
  • 5. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data AI Center: Performance of Models [1-3] 0 100 200 300 400 500 600 XGBoost CPU XGBoost GPU XGBoost GPU XGBoost CPU DT RF DF RF GBT GBT Linux File Systems HDFS S3 S3 S3 S3 Linux File Systems Linux File Systems S3 Linux File Systems RayDP EMR EMR EMR EMR EMR RayDP RayDP EMR RayDP Train Time (sec)
  • 6. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data AI Center: Performance of Models [1 – 3] Cluster Models Training Time (sec) RMSE R2 HDFS XGBoost GPU 15.4 33.42 0.731 S3 DT 31.1 130.4 0.238 S3 GBT 173.2 128.7 0.248 S3 RF 48.2 130.1 0.247 S3 XGBoost CPU 21.5 33.4 0.731 RayDP EXT4 XGBoost CPU 4.3 31.1 NA Muralidharan, S., Yadav, S., Huh, J., Lee, S., & Woo, J. (June 2022). “Scalable Prediction Models for Airbnb Listing in Spark Big Data Cluster using GPU-accelerated RAPIDS”, Journal of Information and Communication Convergence Engineering, 20(2), 96–102. https://doi.org/10.6109/JICCE.2022.20.2.96
  • 7. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents Why XGBoost  Dataset  Big Data Platforms  Comparison of Experimental Results  Conclusion
  • 8. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Dataset Details Dataset: Airbnb Listings Dataset URLs: – https://public.opendatasoft.com/explore/dataset/airbnb- listings/table/?disjunctive.host_verifications&disjunctive.amenities&disjunctiv e.features Total Dataset size: 4 GB Dataset Format: CSV Not easy to handle with the traditional systems – In Data Engineering, Analysis, and Science Predicting whether the listing has a good rating or not.  Using two-class classification algorithms to build a model – to classify the listings as high rated and low rated, based on the features of the listing.
  • 9. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Dataset Details (Cont’d) Data Transformation Label: Review Scores Rating – Converting the Review Scores Rating column to binary class • Review Scores Rating >= 80 -> High Rating • Review Scores Rating < 80 -> Low Rating Features: 25 columns – "Host Listings Count", "Host Total Listings Count", "Calculated host listings count", "Security Deposit", "Cleaning Fee" , "Host Response Time", "Host Response Rate", "Host Acceptance Rate", "Property Type", "Room Type", "Weekly Price", "Monthly Price", "Maximum Nights", "Review Scores Accuracy", "Review Scores Cleanliness", "Review Scores Checkin", "Review Scores Communication", "Review Scores Location", "Review Scores Value", "Cancellation Policy", "Bedrooms", "Bathrooms", "Beds", "Extra People", "Minimum Nights"
  • 10. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Need for Big Data Predictive Analysis Performance issue using traditional predictive models The size of the data set is 4 Giga-Bytes. traditional systems mostly generate – a memory error – or takes several hours or days to build them. Need Big Data Non-expensive platform, which is distributed parallel computing systems – and that can store a large-scale data and process it in parallel  Apache Hadoop and Spark since 2006 – Non-expensive Super Computer – Any small companies or university labs can own it
  • 11. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents Why XGBoost  Dataset  Big Data Platforms  Comparison of Experimental Results  Conclusion
  • 12. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data: Distributed Computing Systems Hadoop Spark cluster in-memory computing engine – built by AMPLab at UC Berkeley in 2012. – Spark is 100 times faster than Hadoop MapReduce in theory – Supports Machine Learning Algorithms More Platforms can leverage Spark Cluster for XGBoost Nvidia RAPIDS Intel Big DL H2O Sparkling Water How is performance of the platform?
  • 13. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA XGBoost AWS EMR EMR Hadoop Spark Big Data cluster – EMR 6.7.0: • Hadoop 3.2.1, Spark 3.2.1 • Does not have XGBoost – 3 nodes cluster XGBoost support – NVidia RAPIDS • set up Rapids into EMR Spark – to build an XGBoost model with GPU and CPU – Big DL – Sparkling Water
  • 14. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA AWS EMR Big Data Cluster PySpark, RAPIDS, Big DL, and Sparkling Water installed AWS EMR Big Data Cluster with GPU: 2 x g4dn.2xlarge, m3.xlarge 1. Airbnb files: HDFS 2. Data Read & Engineering 3. Train and Test Models
  • 15. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Intel Big DL Intel provides BigDL for distributed Deep Learning applications in the Spark cluster. And, provides XG DDL DDL lib DDL lib Deep Learning in Spark
  • 16. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA XGBoost (Cont’d)  H2O The open-source platform H2O.ai focuses on bringing AI to businesses through software. –H2O is its flagship product  H2O Sparkling Water integrates H2O's scalable machine learning engine with Spark –with XGBoost library
  • 17. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA XGBoost: H2O Sparkling Water [8]
  • 18. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents Why XGBoost  Dataset  Big Data Platforms  Comparison of Experimental Results  Conclusion
  • 19. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Evaluation Measurement Accuracy  Precision –Intend to reduce the number of False Positives  AUC Computing Time Parallel Computing to train models –With GPU
  • 20. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Comparing Computing Time of Spark Platforms
  • 21. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Comparing Accuracy of Spark Platforms
  • 22. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Comparing the performance XGBoost models Big DL with GPU –is 25 – 50% faster for model training time than other platforms. H2O Sparkling has 5 - 7% better AUC –and 0.7% better Precision than others.
  • 23. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents Why XGBoost  Dataset  Big Data Platforms  Comparison of Experimental Results Conclusion
  • 24. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Conclusion Comparing the performance of the platforms in Big Data Predictive Analysis for large-scale data XGBoost in RAPIDS, The Big DL, and Sparkling Water Big DL with GPU –fastest for model training time. H2O Sparkling has better AUC and Precision than others
  • 25. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Questions?
  • 26. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 1. Muralidharan, S., Yadav, S., Huh, J., Lee, S., & Woo, J. (June 2022). Scalable Prediction Models for Airbnb Listing in Spark Big Data Cluster using GPU-accelerated RAPIDS. Journal of Information and Communication Convergence Engineering, 20(2), 96– 102. https://doi.org/10.6109/JICCE.2022.20.2.96 2. Samyuktha Muralidharan, Savita Yadav, Sanghoon Lee, and, Jongwook Woo, "Scalable Price Prediction Models of Hosting Business Leveraging Big Data with GPU", KSII The 17th Asia Pacific International Conference on Information Science and Technology (APIC-IST) June 19-21 2022, pp95-97, ISSN 2093-0542 3. Savita Yadav, Samyuktha Muralidharan, Sanghoon Lee, Jongwook Woo, "Scalable Predictive Analysis for Airbnb Listing Rating", KSII The 13th International Conference on Internet (ICONI) 2021, Dec 12-14 2021, Jeju Island, Korea, pp370-372, ISSN 2093-0542 4. Why does XGBoost work so well?, https://www.kaggle.com/general/196541 5. Sparkling Water: Run on Hadoop, https://docs.h2o.ai/sparkling-water/3.2/latest- stable/doc/install/install_and_start.html#run-on-hadoop 6. Sparkling Water Booklet, https://docs.h2o.ai/sparkling-water/3.0/latest-stable/booklet/SparklingWaterBooklet.pdf 7. NVidia Rapids, https://github.com/NVIDIA/spark-rapids 8. Lesson 1, Sparkling Water Training, https://aquarium.h2o.ai/lab/8
  • 27. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Cluster  Linearly Scalable for more storage and computing power Big Data Cluster