SlideShare a Scribd company logo
‹#›
Big Telco 

Real-Time Network Analytics
Yousun Jeong
Who am I
• Senior Software Engineer of SK Telecom, South Korea’s largest
wireless communications provider
• Work on commercial products (~ ’15)

- She worked with Hadoop DW

- She worked with IaaS(OpenStack)

- She worked with PaaS(CloudFoundry)

• Mail to : jerryjung@sk.com
2
3
Table of Contents
1. Big Data in SK Telecom
2. Benefit of Spark
3. Spark Real Workload 

Real-Time Network Analytics
4. Ongoing R&D
Big Data in SKT in a Nutshell
✓ Data Size
- Currently collecting 250 TB/day
!
✓ Big Data Management Infrastructure
- Hadoop cluster (1400+ nodes); migrated from 

MPP RDBMS
✓ Use cases

- Real-Time Analytics of Base Stations

- Network Enterprise DW
!
✓ Ongoing R&D

- SKT Hadoop DW Appliance with H/W acceleration
4
Operating over 1400 nodes (30 PB+) of Hadoop cluster
SKT Hadoop Infrastructure
• Optimized configuration
• Fault tolerant and effective resource management system 5
Data Collector
Data Collect "
& pre-processing
Main Cluster
Analysis
R&D Cluster
~250 TB/day (500+ node)
Service!
Logic
Repository
(400+ Node)
(100+ node)
Service Cluster
(400+ node)
Marketing
NW 

Analytics
VoC
SKT Hadoop Infra
Data Feeding
Data Feeding
Commercialize
Develop.
Batch LayerInterface Layer
Flume
Kafka"
HDFS 

(Data Mart)
oozie (workflow)
Hive
(ETL)
Spark
(ETL)
Analytics Layer
1
2
Spark SQL
Spark MlLib
Spark GraphX
Spark R
YARN (Unified Resource Manager)
Real-Time Layer
NoSQL
Elastic

Search
HDFS
Data Service
Layer
BI
Legacy
App
3
Analytics Layer
Batch Processing Layer -
Hadoop EDW
Real-Time Processing Layer
– Real Time Analysis
3
1
2
【 Components 】
Spark Streaming"
!
H/W Accelerator
(SSD, FGPA)
Cluster Manger
Ambari
SKT Big Data Reference Architecture
Designed to handle both real-time & batch data processing and high level
analysis using Spark as a core technology
6
Benefit of Spark
Spark help us to have the gains in processing speed and implement various big
data applications easily and speedily
▪ Support for Event Stream Processing
▪ Fast Data Queries in Real Time
▪ Improved Programmer Productivity
▪ Fast Batch Processing of Large Data Set
Why SKT use spark …
7
Use cases: Summary
Network
Enterprise DW
APOLLO
• End-to-end network quality assurance and

fault analysis in a timely manner
• Real-time analysis of radio access network
to improve operation efficiency
Network analytics
8
9
DC

Parser
Kafka"
Broker
Kafka"
Producer

Kafka"
Topic
Spark
Streaming
Kafka Direct"
Stream"
1 minute widow
10 s
HDFS ES
10 s
Real-Time
Dashboard
Spark
SQL
BI

Analysis
JDBC"
ODBC
1
2
4
5
Data
Collector"
(Flume)
3
Spark

MLlib
6
Timely Processing"
Quick Response
Requirements
Parallelism
• Executors
• Partitions
• Using Akka
Use case 1: Requirements & Challenges
“Hadoop S/W and Commodity H/W
Based Cost-effective IT Infrastructure System”
【 SKT DW Infrastructure】
“High-price, High-performance
Proprietary IT Infrastructure System”
【 Legacy IT Infrastructure 】
※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System
Structured/Un-structured Data
Scale-out Structure (Petabyte, Exabyte)Data
Structured Data
Scale-up Structure (Terabyte)
Commodity H/W (x86 Server)H/W
High Performance H/W
(MPP, Fabric Switch, etc.)
Hadoop Architecture
SQL on Hadoop
S/W
Proprietary S/W

(RDBMS, etc.)
Transaction/Batch
Processing"
(SQL) Hadoop File System
Hadoop DW can handle telco big data with scalability & cost efficiency
Use case 2: Hadoop based Enterprise DW
10
※ MPP Massively Parallel Processing
11
Use case 2: Network Enterprise DW
NMS#1
DBMS
…
NMS#1
DBMS
NMS#N-1
DBMS
[ Current ]

Siloed Data & IT Management
Access NW Core NW Transport
Expected advantages
• Unification of 130+ legacy DMBSs, each of which was managing separate network
monitoring system, enabling thorough analysis over the entire network
• Quick and accurate identification of root causes of network failure
Data scientists need unified platform to collect data from all network equipment
for management and analysis purpose
NMS

#1 …
NMS

#2
NMS

#N-1
Legacy
NMS

#N
Hadoop DW
DW
Legacy
NEWN
MS#1
… NEW

NMS#
N
BI &

Analytic
…
[ Goal (4Q, 2015) ]"
Network Enterprise DW
Network EDW is a Hadoop-based data warehouse built on Spark for various
network statistics or raw data
User Benefits
• End-to-End quality assurance,

Fault analysis
• Reduces analysis lead time

(days → minutes)
• Saves TCO (1/5 less than legacy DW)
!
Hadoop DW
• Spark-SQL functions and query
optimizer
• Bulk-loading and timely processing of
large data
• SSD caching applied for 

performance enhancement
Acess
Core
Transport
EMS
EMS
T-Pani
EMS
Hadoop DW
DW Data
Data Mart
SQL on
Hadoop

(Spark SQL)
IP
EMS
AnalyticsSQL
ETL
ETL
O!
D!
S
MQE*

(Meta Query

Engine)
H/W
Accelerator !
SSD Caching
H/W
Accelerator

SSD Caching
BI
* MQE (Meta Query Engine) : Heterogeneous database integration query, including the Hadoop.
Use case 2: Network Enterprise DW
12
13
https://github.com/bitnine-oss/octopus
Use case 2: Meta Query Engine
Features"
1. Subset of ANSI-SQL"
2. Queries on multiple databases 

including Spark-SQL, Oracle."
3. SQL-based authorization"
4. User authentication"
5. Unified schema view
Use case 2: Requirements & Challenges
Timely Processing -ETL"
Integrated BI Tools"
Quick Response
Requirements
14
MDS #1
MQE #1
HA Proxy
Thrift Server 

#1
Thrift Server 

#2
Spark SQL
HDFS
YARN
WEB
MDS
BI
MQE
Meta Store
Octopus
NW EDW # 96
ETL
Spark
3
2
1
4
Use case 2: YARN(Dynamic Resource Allocation)
15
spark.dynamicAllocation.enabled true!
spark.shuffle.service.enabled true!
spark.dynamicAllocation.minExecutors 50!
spark.dynamicAllocation.maxExecutors 150!
spark.dynamicAllocation.initialExecutors 50!
spark.dynamicAllocation.cacheExecutorIdleTimeout 600!
spark.dynamicAllocation.executorIdleTimeout! 5!
spark.dynamicAllocation.schedulerBacklogTimeout! ! 5!
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout! 5
<property>!
<name>yarn.nodemanager.aux-services</name>!
<value>mapreduce_shuffle,spark_shuffle</value>!
</property>!
<property>!
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>!
<value>org.apache.spark.network.yarn.YarnShuffleService</value>!
</property>
Configuration
Use case 2: BI Integration
16
spark.sql.thriftServer.incrementalCollect true!
spark.driver.maxResultSize 10g
Configuration
Use case 2: Patches
17
SPARK-7792! - HiveContext registerTempTable not thread safe!
SPARK-7936! - Add configuration for initial size and limit of hash for aggregation!
SPARK-8153! - Add configuration for disabling partial aggregation in runtime!
SPARK-8285! - CombineSum should be calculated as unlimited decimal first!
SPARK-8312! - Populate statistics info of hive tables if it's needed to be!
SPARK-8333! - Spark failed to delete temp directory created by HiveContext!
SPARK-8334 ! - Binary logical plan should provide more realistic statistics!
SPARK-8357! - Memory leakage on unsafe aggregation path with empty input!
SPARK-8420! - Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0!
SPARK-8552! - Using incorrect database in multiple sessions!
SPARK-8707! - RDD#toDebugString fails if any cached RDD has invalid partitions!
SPARK-8826! - Fix ClassCastException in GeneratedAggregate!
SPARK-9685! - Unspported dataType: char(X) in Hive!
SPARK-10151! - Support invocation of hive macro!
SPARK-10152! - Support Init script for hive-thriftserver!
SPARK-10679! - javax.jdo.JDOFatalUserException in executor!
SPARK-10684! - StructType.interpretedOrdering need not to be serialised!
SPARK-10216 - Avoid creating empty files during overwrite into Hive table with group by query
Open Issues
Use case 2: Performance
18
TPC-H
Use case 2: Performance
19
Job Server
Hadoop DW Appliance (ongoing)
【 SKT Hadoop DW Appliance 】
Management & Automation
Core Software Solution
Hardware Acceleration
3
1
2
▪ Develop Interactive Spark SQL
▪ Develop Meta Query Engine
▪ Develop Flash Storage-based I/O Acceleration
▪ Develop FPGA-based CPU Acceleration
▪ Develop Data & System Security
▪ Workload Optimization & Automation
Industry Oriented Solution4
▪ Fault Detection & Classification in Manufacturing
▪ Mobile Network Data Analytic Solution
▪ Unstructured Data Collection/Processing Solution
Develop a Hadoop DW appliance combining optimized S/W layer and H/W
acceleration
20
H/W Acceleration Layer
Data Processing Layer
* Meta Query Engine
DW Management Layer
Industry"
Oriented
Solution
!
!
!
!
!
!
!
Monitoring DB Migration Security OptimizationPackaging
SQL Engine/Storage "
!
!
!
* SPARK HIVE
Legacy
RDBMS
FDC
Telco
others
Hadoop Storage DB Storage
* Flash based I/O Accelerator * FPGA Accelerator
2
1
3
4
21
Thank You!

More Related Content

PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
PPTX
Data Science at Scale by Sarah Guido
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Data Science at Scale by Sarah Guido
Spark Summit
 

What's hot (20)

PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
PPTX
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
PPTX
Lambda architecture with Spark
Vincent GALOPIN
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
PDF
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PDF
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
PPTX
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
Spark Summit
 
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
Lambda architecture with Spark
Vincent GALOPIN
 
Introduction to Apache Spark
Rahul Jain
 
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
Spark Summit
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Ad

Viewers also liked (16)

PPTX
Churn modelling
Yogesh Khandelwal
 
PPT
Idiro Analytics - What is Rotational Churn and how can we tackle it?
Idiro Analytics
 
PPT
Idiro Analytics - Social Network Analysis for Online Gaming
Idiro Analytics
 
PPTX
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
PDF
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
BAINIDA
 
PDF
Deriving economic value for CSPs with Big Data [read-only]
Flytxt
 
PPT
Idiro Analytics - Identifying Families using Social Network Analysis and Big ...
Idiro Analytics
 
PPSX
Telco Churn Roi V3
hkaul
 
PPT
Idiro Analytics - Analytics & Big Data
Idiro Analytics
 
PPTX
Social Network Analysis for Telecoms
Dataspora
 
PPTX
Predicting churn in telco industry: machine learning approach - Marko Mitić
Institute of Contemporary Sciences
 
PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
PPTX
Decide on technology stack & data architecture
SV.CO
 
PDF
How to use your CRM for upselling and cross-selling
Redspire Ltd
 
PDF
Big Data Analytics : A Social Network Approach
Andry Alamsyah
 
PPT
Big Data: Social Network Analysis
Michel Bruley
 
Churn modelling
Yogesh Khandelwal
 
Idiro Analytics - What is Rotational Churn and how can we tackle it?
Idiro Analytics
 
Idiro Analytics - Social Network Analysis for Online Gaming
Idiro Analytics
 
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
BAINIDA
 
Deriving economic value for CSPs with Big Data [read-only]
Flytxt
 
Idiro Analytics - Identifying Families using Social Network Analysis and Big ...
Idiro Analytics
 
Telco Churn Roi V3
hkaul
 
Idiro Analytics - Analytics & Big Data
Idiro Analytics
 
Social Network Analysis for Telecoms
Dataspora
 
Predicting churn in telco industry: machine learning approach - Marko Mitić
Institute of Contemporary Sciences
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
Decide on technology stack & data architecture
SV.CO
 
How to use your CRM for upselling and cross-selling
Redspire Ltd
 
Big Data Analytics : A Social Network Approach
Andry Alamsyah
 
Big Data: Social Network Analysis
Michel Bruley
 
Ad

Similar to Big Telco - Yousun Jeong (20)

PDF
Stsg17 speaker yousunjeong
Yousun Jeong
 
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
PPTX
SoCal BigData Day
John Park
 
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
PPTX
The modern analytics architecture
Joseph D'Antoni
 
PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
Stratebi Big Data
Stratebi
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Deutsche Telekom on Big Data
DataWorks Summit
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PPTX
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Hortonworks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PPTX
Empower Data-Driven Organizations
DataWorks Summit/Hadoop Summit
 
PDF
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PDF
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
Stsg17 speaker yousunjeong
Yousun Jeong
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
SoCal BigData Day
John Park
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
The modern analytics architecture
Joseph D'Antoni
 
Building a Big Data Pipeline
Jesus Rodriguez
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Big data processing with apache spark
sarith divakar
 
Stratebi Big Data
Stratebi
 
Bds session 13 14
Infinity Tech Solutions
 
Intro to Spark with Zeppelin
Hortonworks
 
Deutsche Telekom on Big Data
DataWorks Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Hortonworks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Empower Data-Driven Organizations
DataWorks Summit/Hadoop Summit
 
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 

Recently uploaded (20)

PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Chad Readey - An Independent Thinker
Chad Readey
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 

Big Telco - Yousun Jeong

  • 1. ‹#› Big Telco 
 Real-Time Network Analytics Yousun Jeong
  • 2. Who am I • Senior Software Engineer of SK Telecom, South Korea’s largest wireless communications provider • Work on commercial products (~ ’15)
 - She worked with Hadoop DW
 - She worked with IaaS(OpenStack)
 - She worked with PaaS(CloudFoundry)
 • Mail to : [email protected] 2
  • 3. 3 Table of Contents 1. Big Data in SK Telecom 2. Benefit of Spark 3. Spark Real Workload 
 Real-Time Network Analytics 4. Ongoing R&D
  • 4. Big Data in SKT in a Nutshell ✓ Data Size - Currently collecting 250 TB/day ! ✓ Big Data Management Infrastructure - Hadoop cluster (1400+ nodes); migrated from 
 MPP RDBMS ✓ Use cases
 - Real-Time Analytics of Base Stations
 - Network Enterprise DW ! ✓ Ongoing R&D
 - SKT Hadoop DW Appliance with H/W acceleration 4
  • 5. Operating over 1400 nodes (30 PB+) of Hadoop cluster SKT Hadoop Infrastructure • Optimized configuration • Fault tolerant and effective resource management system 5 Data Collector Data Collect " & pre-processing Main Cluster Analysis R&D Cluster ~250 TB/day (500+ node) Service! Logic Repository (400+ Node) (100+ node) Service Cluster (400+ node) Marketing NW 
 Analytics VoC SKT Hadoop Infra Data Feeding Data Feeding Commercialize Develop.
  • 6. Batch LayerInterface Layer Flume Kafka" HDFS 
 (Data Mart) oozie (workflow) Hive (ETL) Spark (ETL) Analytics Layer 1 2 Spark SQL Spark MlLib Spark GraphX Spark R YARN (Unified Resource Manager) Real-Time Layer NoSQL Elastic
 Search HDFS Data Service Layer BI Legacy App 3 Analytics Layer Batch Processing Layer - Hadoop EDW Real-Time Processing Layer – Real Time Analysis 3 1 2 【 Components 】 Spark Streaming" ! H/W Accelerator (SSD, FGPA) Cluster Manger Ambari SKT Big Data Reference Architecture Designed to handle both real-time & batch data processing and high level analysis using Spark as a core technology 6
  • 7. Benefit of Spark Spark help us to have the gains in processing speed and implement various big data applications easily and speedily ▪ Support for Event Stream Processing ▪ Fast Data Queries in Real Time ▪ Improved Programmer Productivity ▪ Fast Batch Processing of Large Data Set Why SKT use spark … 7
  • 8. Use cases: Summary Network Enterprise DW APOLLO • End-to-end network quality assurance and
 fault analysis in a timely manner • Real-time analysis of radio access network to improve operation efficiency Network analytics 8
  • 9. 9 DC
 Parser Kafka" Broker Kafka" Producer
 Kafka" Topic Spark Streaming Kafka Direct" Stream" 1 minute widow 10 s HDFS ES 10 s Real-Time Dashboard Spark SQL BI
 Analysis JDBC" ODBC 1 2 4 5 Data Collector" (Flume) 3 Spark
 MLlib 6 Timely Processing" Quick Response Requirements Parallelism • Executors • Partitions • Using Akka Use case 1: Requirements & Challenges
  • 10. “Hadoop S/W and Commodity H/W Based Cost-effective IT Infrastructure System” 【 SKT DW Infrastructure】 “High-price, High-performance Proprietary IT Infrastructure System” 【 Legacy IT Infrastructure 】 ※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System Structured/Un-structured Data Scale-out Structure (Petabyte, Exabyte)Data Structured Data Scale-up Structure (Terabyte) Commodity H/W (x86 Server)H/W High Performance H/W (MPP, Fabric Switch, etc.) Hadoop Architecture SQL on Hadoop S/W Proprietary S/W
 (RDBMS, etc.) Transaction/Batch Processing" (SQL) Hadoop File System Hadoop DW can handle telco big data with scalability & cost efficiency Use case 2: Hadoop based Enterprise DW 10 ※ MPP Massively Parallel Processing
  • 11. 11 Use case 2: Network Enterprise DW NMS#1 DBMS … NMS#1 DBMS NMS#N-1 DBMS [ Current ]
 Siloed Data & IT Management Access NW Core NW Transport Expected advantages • Unification of 130+ legacy DMBSs, each of which was managing separate network monitoring system, enabling thorough analysis over the entire network • Quick and accurate identification of root causes of network failure Data scientists need unified platform to collect data from all network equipment for management and analysis purpose NMS
 #1 … NMS
 #2 NMS
 #N-1 Legacy NMS
 #N Hadoop DW DW Legacy NEWN MS#1 … NEW
 NMS# N BI &
 Analytic … [ Goal (4Q, 2015) ]" Network Enterprise DW
  • 12. Network EDW is a Hadoop-based data warehouse built on Spark for various network statistics or raw data User Benefits • End-to-End quality assurance,
 Fault analysis • Reduces analysis lead time
 (days → minutes) • Saves TCO (1/5 less than legacy DW) ! Hadoop DW • Spark-SQL functions and query optimizer • Bulk-loading and timely processing of large data • SSD caching applied for 
 performance enhancement Acess Core Transport EMS EMS T-Pani EMS Hadoop DW DW Data Data Mart SQL on Hadoop
 (Spark SQL) IP EMS AnalyticsSQL ETL ETL O! D! S MQE*
 (Meta Query
 Engine) H/W Accelerator ! SSD Caching H/W Accelerator
 SSD Caching BI * MQE (Meta Query Engine) : Heterogeneous database integration query, including the Hadoop. Use case 2: Network Enterprise DW 12
  • 13. 13 https://github.com/bitnine-oss/octopus Use case 2: Meta Query Engine Features" 1. Subset of ANSI-SQL" 2. Queries on multiple databases 
 including Spark-SQL, Oracle." 3. SQL-based authorization" 4. User authentication" 5. Unified schema view
  • 14. Use case 2: Requirements & Challenges Timely Processing -ETL" Integrated BI Tools" Quick Response Requirements 14 MDS #1 MQE #1 HA Proxy Thrift Server 
 #1 Thrift Server 
 #2 Spark SQL HDFS YARN WEB MDS BI MQE Meta Store Octopus NW EDW # 96 ETL Spark 3 2 1 4
  • 15. Use case 2: YARN(Dynamic Resource Allocation) 15 spark.dynamicAllocation.enabled true! spark.shuffle.service.enabled true! spark.dynamicAllocation.minExecutors 50! spark.dynamicAllocation.maxExecutors 150! spark.dynamicAllocation.initialExecutors 50! spark.dynamicAllocation.cacheExecutorIdleTimeout 600! spark.dynamicAllocation.executorIdleTimeout! 5! spark.dynamicAllocation.schedulerBacklogTimeout! ! 5! spark.dynamicAllocation.sustainedSchedulerBacklogTimeout! 5 <property>! <name>yarn.nodemanager.aux-services</name>! <value>mapreduce_shuffle,spark_shuffle</value>! </property>! <property>! <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>! <value>org.apache.spark.network.yarn.YarnShuffleService</value>! </property> Configuration
  • 16. Use case 2: BI Integration 16 spark.sql.thriftServer.incrementalCollect true! spark.driver.maxResultSize 10g Configuration
  • 17. Use case 2: Patches 17 SPARK-7792! - HiveContext registerTempTable not thread safe! SPARK-7936! - Add configuration for initial size and limit of hash for aggregation! SPARK-8153! - Add configuration for disabling partial aggregation in runtime! SPARK-8285! - CombineSum should be calculated as unlimited decimal first! SPARK-8312! - Populate statistics info of hive tables if it's needed to be! SPARK-8333! - Spark failed to delete temp directory created by HiveContext! SPARK-8334 ! - Binary logical plan should provide more realistic statistics! SPARK-8357! - Memory leakage on unsafe aggregation path with empty input! SPARK-8420! - Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0! SPARK-8552! - Using incorrect database in multiple sessions! SPARK-8707! - RDD#toDebugString fails if any cached RDD has invalid partitions! SPARK-8826! - Fix ClassCastException in GeneratedAggregate! SPARK-9685! - Unspported dataType: char(X) in Hive! SPARK-10151! - Support invocation of hive macro! SPARK-10152! - Support Init script for hive-thriftserver! SPARK-10679! - javax.jdo.JDOFatalUserException in executor! SPARK-10684! - StructType.interpretedOrdering need not to be serialised! SPARK-10216 - Avoid creating empty files during overwrite into Hive table with group by query Open Issues
  • 18. Use case 2: Performance 18 TPC-H
  • 19. Use case 2: Performance 19 Job Server
  • 20. Hadoop DW Appliance (ongoing) 【 SKT Hadoop DW Appliance 】 Management & Automation Core Software Solution Hardware Acceleration 3 1 2 ▪ Develop Interactive Spark SQL ▪ Develop Meta Query Engine ▪ Develop Flash Storage-based I/O Acceleration ▪ Develop FPGA-based CPU Acceleration ▪ Develop Data & System Security ▪ Workload Optimization & Automation Industry Oriented Solution4 ▪ Fault Detection & Classification in Manufacturing ▪ Mobile Network Data Analytic Solution ▪ Unstructured Data Collection/Processing Solution Develop a Hadoop DW appliance combining optimized S/W layer and H/W acceleration 20 H/W Acceleration Layer Data Processing Layer * Meta Query Engine DW Management Layer Industry" Oriented Solution ! ! ! ! ! ! ! Monitoring DB Migration Security OptimizationPackaging SQL Engine/Storage " ! ! ! * SPARK HIVE Legacy RDBMS FDC Telco others Hadoop Storage DB Storage * Flash based I/O Accelerator * FPGA Accelerator 2 1 3 4