SlideShare a Scribd company logo
Kevin, Between (VCNC)
kevin@between.us
Powering a Startup with
Apache Spark
#EUent8
Seoul, South Korea
Gangnam Hongdae
Itaewon Myungdong
Powering a Startup with Apache Spark with Kevin Kim
2011: 100 beta users
2012: 1.0 release, 2M downloads
2013: 5M downloads, global launches
2014: Between 2.0, 10M downloads
2015: Between 3.0
2016: Starts monetization, 20M downloads
2017: Global expansion, new business, team of 60
put your #assignedhashtag here by setting the footer in view-header/footer
Kevin Kim
• Came from Seoul, South Korea

• Co-founder, used to be a product
developer

• Now a data analyst, engineer, team
leader

• Founder of Korea Spark User Group

• Committer and PMC member of
Apache Zeppelin
6
Between Data Team
7
Intro to Between Data Team
• Data engineer * 4

– Manager, engineer with various stack of knowledge and
experience

– Junior engineer, used to be a server engineer

– Senior engineer, has lots of exps and skills

– Data engineer, used to be a top level Android developer

• Hiring data analyst and machine learning expert
8
Between Data Team is doing..
• Analysis

– Service monitoring

– Analysis usage of new features and build product strategies

• Data Infrastructure

– Build and manage infrastructure

– Spark, Zeppelin, AWS, BI Tools, etc

• Third Party Management

– Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc)

– Google Analytics, Firebase, etc

– Ad Networks
9
Between Data Team is doing..
• Machine Learning Study & Research

– For the next business model

• Support team

– To build business, product, monetization strategies

• Performance Marketing Analysis

– Monitoring effectiveness of marketing budgets

• Product Development

– Improves client performance, server architecture, etc
10
11
7 PM ~
12
Sunset @ Between Office
Technologies
13
Requirements
• Big Data
– 2TB/day of log data from millions of DAU

– 20M of users

• Small Team
– Team of 4, need to support 50

• Tiny Budget
– Company is just over BEP (Break Even Point)

• Need very efficient tech stack!
14
Way We Work
• Use Apache Spark as a general processing engine

• Scriptify everything with Apache Zeppelin

• Heavy utilization of AWS and Spot instances to cut cost

• Proper selection of BI Dashboard Tools
15
Apache Spark, General Engine
• Definitely the best way to deal with big data (as you all know!)

• It’s performance, agility exactly meets startup requirements

– Used Spark from 2014

• Great match with Cloud Service, especially with Spot instance

– Utilizing burst nature of Cloud Service
16
Scriptify Everything with Zeppelin
• Doing everything on Zeppelin!

• Daily batch tasks in form of Spark scripts (using
Zeppelin scheduler)

• Ad hoc analysis

• Cluster control scripts

• The world first user of Zeppelin!

• More than 200 Zeppelin notebooks
17
AWS Cloud
• Spot Instance is my friend!

– Mostly use spot instance for analysis

– only 10 ~ 20% of cost compare to on-demand instances

• Dynamic cluster launch with Auto Scale

– Launch clusters automatically for batch analysis

– Manually launch more clusters on Zeppelin, with Auto Scale script

– Automatically diminish clusters when no usage
18
BI Dashboard Tools
• Use Zeppelin as a dashboard using Spark SQL with ZEPL

• Holistics (holistics.io) or Dash (plot.ly/products/dash/)
19
Questions & Challenges
20
RDD API or DataFrame API?
• Now Spark has very different style of APIs

– Programmatic RDD API

– SQL-like DataFrame, DataSet API

• In case of having many, simple ad-hoc queries

– DataFrame works

• Having more complex, deep dive analytic questions

– RDD works

• For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks
21
Sushi or Cooked Data?
• Keeping data in a raw form as possible!

– Doing ETL’s usually makes trouble, increasing management cost

– The Sushi Principle (Joseph & Robert in Strata)

– Drastically reduce operation & management cost

– Apache Spark is a great tool for extracting insight from raw data
22
fresh data!
To Hire Data Analyst or Not?
• For data analyst, expected skill set are..

– Excel, SQL, R, ..

• Those skills are not expected..

– Programatic API like Spark RDD

– Cooking raw data

• Prefer data engineer with analytic skills

• May need to add some ETL tasks to work with data analyst
23
Better, Faster Team Support?
• Better - Zeppelin is great for analyzing data, but not enough for sharing data for team 

– We have really few alternatives

– Increase of using BI dashboard tools?

– Still finding a good way

• Faster - Launching a Spark cluster takes few minutes

– Not bad, but we want it faster

– Google BigQuery or AWS Athena

– SQL Database with ETL
24
Future Plan?
• Prepare for exploding # of data operations!

– Team is growing, business is growing

– # of tasks

– # of 3rd party data products

– Communication cost

• Operations with machine learning & deep learning

– Better way to manage task & data flow
25
Let’s wrap up..
26
What Matters for Us
• Support Team

– Each Team should see proper data and make good decision from it

– Regular meetings, fast response to adhoc data request

– Ultimately, our every activity should be related to company’s business 

• Technical Lead

– Technical investments for competence of both company and individual

– Working in Between should be a best experience for each individuals

• Social Impact

– Our activity on work has valuable impact for society?

– Open source, activity on community
27
How Apache Spark is Powering a Startup?
• One great tool for general purpose

– Daily batch tasks

– Agile, adhoc analysis

– Drawing dashboard

– Many more..

• Helps saving time, reducing cost of data operations

• Great experience for engineer and analyst 

• Sharing know-how’s to / from community
28
Work as a data engineer at Startup
• Fascinating, fast evolution of tech

• Need hard work and labor

• Data work will shine only when it is understood and used by teammates
29
Two Peasants Digging, Vincent van GoghTwo Men Digging, Jean-Francois Millet
Thank you!
30

More Related Content

What's hot (20)

PDF
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
PDF
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PDF
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
PDF
Operational Tips For Deploying Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Operational Tips For Deploying Apache Spark
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 

Similar to Powering a Startup with Apache Spark with Kevin Kim (20)

PPTX
Data lake – On Premise VS Cloud
Idan Tohami
 
PPTX
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
PDF
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
PDF
(ATS6-DEV02) Web Application Strategies
BIOVIA
 
PPTX
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
PDF
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Jade Global
 
PPTX
Practical introduction to hadoop
inside-BigData.com
 
PDF
Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
 
PPTX
Spark SQL
Caserta
 
DOC
PradeepDWH
Pradeep Pandey
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
PPTX
Retail & CPG
Tata Consultancy Services
 
PDF
Meta scale kognitio hadoop webinar
Michael Hiskey
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
PDF
Optimize with Open Source
EDB
 
PDF
SnappyData Toronto Meetup Nov 2017
SnappyData
 
PDF
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
Wilco Turnhout
 
Data lake – On Premise VS Cloud
Idan Tohami
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
(ATS6-DEV02) Web Application Strategies
BIOVIA
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Jade Global
 
Practical introduction to hadoop
inside-BigData.com
 
Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
 
Spark SQL
Caserta
 
PradeepDWH
Pradeep Pandey
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Meta scale kognitio hadoop webinar
Michael Hiskey
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Optimize with Open Source
EDB
 
SnappyData Toronto Meetup Nov 2017
SnappyData
 
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
Wilco Turnhout
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Research Methodology Overview Introduction
ayeshagul29594
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 

Powering a Startup with Apache Spark with Kevin Kim

  • 1. Kevin, Between (VCNC) [email protected] Powering a Startup with Apache Spark #EUent8
  • 5. 2011: 100 beta users 2012: 1.0 release, 2M downloads 2013: 5M downloads, global launches 2014: Between 2.0, 10M downloads 2015: Between 3.0 2016: Starts monetization, 20M downloads 2017: Global expansion, new business, team of 60
  • 6. put your #assignedhashtag here by setting the footer in view-header/footer Kevin Kim • Came from Seoul, South Korea • Co-founder, used to be a product developer • Now a data analyst, engineer, team leader • Founder of Korea Spark User Group • Committer and PMC member of Apache Zeppelin 6
  • 8. Intro to Between Data Team • Data engineer * 4 – Manager, engineer with various stack of knowledge and experience – Junior engineer, used to be a server engineer – Senior engineer, has lots of exps and skills – Data engineer, used to be a top level Android developer • Hiring data analyst and machine learning expert 8
  • 9. Between Data Team is doing.. • Analysis – Service monitoring – Analysis usage of new features and build product strategies • Data Infrastructure – Build and manage infrastructure – Spark, Zeppelin, AWS, BI Tools, etc • Third Party Management – Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc) – Google Analytics, Firebase, etc – Ad Networks 9
  • 10. Between Data Team is doing.. • Machine Learning Study & Research – For the next business model • Support team – To build business, product, monetization strategies • Performance Marketing Analysis – Monitoring effectiveness of marketing budgets • Product Development – Improves client performance, server architecture, etc 10
  • 11. 11
  • 12. 7 PM ~ 12 Sunset @ Between Office
  • 14. Requirements • Big Data – 2TB/day of log data from millions of DAU – 20M of users • Small Team – Team of 4, need to support 50 • Tiny Budget – Company is just over BEP (Break Even Point) • Need very efficient tech stack! 14
  • 15. Way We Work • Use Apache Spark as a general processing engine • Scriptify everything with Apache Zeppelin • Heavy utilization of AWS and Spot instances to cut cost • Proper selection of BI Dashboard Tools 15
  • 16. Apache Spark, General Engine • Definitely the best way to deal with big data (as you all know!) • It’s performance, agility exactly meets startup requirements – Used Spark from 2014 • Great match with Cloud Service, especially with Spot instance – Utilizing burst nature of Cloud Service 16
  • 17. Scriptify Everything with Zeppelin • Doing everything on Zeppelin! • Daily batch tasks in form of Spark scripts (using Zeppelin scheduler) • Ad hoc analysis • Cluster control scripts • The world first user of Zeppelin! • More than 200 Zeppelin notebooks 17
  • 18. AWS Cloud • Spot Instance is my friend! – Mostly use spot instance for analysis – only 10 ~ 20% of cost compare to on-demand instances • Dynamic cluster launch with Auto Scale – Launch clusters automatically for batch analysis – Manually launch more clusters on Zeppelin, with Auto Scale script – Automatically diminish clusters when no usage 18
  • 19. BI Dashboard Tools • Use Zeppelin as a dashboard using Spark SQL with ZEPL • Holistics (holistics.io) or Dash (plot.ly/products/dash/) 19
  • 21. RDD API or DataFrame API? • Now Spark has very different style of APIs – Programmatic RDD API – SQL-like DataFrame, DataSet API • In case of having many, simple ad-hoc queries – DataFrame works • Having more complex, deep dive analytic questions – RDD works • For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks 21
  • 22. Sushi or Cooked Data? • Keeping data in a raw form as possible! – Doing ETL’s usually makes trouble, increasing management cost – The Sushi Principle (Joseph & Robert in Strata) – Drastically reduce operation & management cost – Apache Spark is a great tool for extracting insight from raw data 22 fresh data!
  • 23. To Hire Data Analyst or Not? • For data analyst, expected skill set are.. – Excel, SQL, R, .. • Those skills are not expected.. – Programatic API like Spark RDD – Cooking raw data • Prefer data engineer with analytic skills • May need to add some ETL tasks to work with data analyst 23
  • 24. Better, Faster Team Support? • Better - Zeppelin is great for analyzing data, but not enough for sharing data for team – We have really few alternatives – Increase of using BI dashboard tools? – Still finding a good way • Faster - Launching a Spark cluster takes few minutes – Not bad, but we want it faster – Google BigQuery or AWS Athena – SQL Database with ETL 24
  • 25. Future Plan? • Prepare for exploding # of data operations! – Team is growing, business is growing – # of tasks – # of 3rd party data products – Communication cost • Operations with machine learning & deep learning – Better way to manage task & data flow 25
  • 27. What Matters for Us • Support Team – Each Team should see proper data and make good decision from it – Regular meetings, fast response to adhoc data request – Ultimately, our every activity should be related to company’s business • Technical Lead – Technical investments for competence of both company and individual – Working in Between should be a best experience for each individuals • Social Impact – Our activity on work has valuable impact for society? – Open source, activity on community 27
  • 28. How Apache Spark is Powering a Startup? • One great tool for general purpose – Daily batch tasks – Agile, adhoc analysis – Drawing dashboard – Many more.. • Helps saving time, reducing cost of data operations • Great experience for engineer and analyst • Sharing know-how’s to / from community 28
  • 29. Work as a data engineer at Startup • Fascinating, fast evolution of tech • Need hard work and labor • Data work will shine only when it is understood and used by teammates 29 Two Peasants Digging, Vincent van GoghTwo Men Digging, Jean-Francois Millet