Powering a Startup with Apache Spark with Kevin Kim

Kevin, Between (VCNC)
kevin@between.us
Powering a Startup with
Apache Spark
#EUent8

Gangnam Hongdae
Itaewon Myungdong

2011: 100 beta users
2012: 1.0 release, 2M downloads
2013: 5M downloads, global launches
2014: Between 2.0, 10M downloads
2015: Between 3.0
2016: Starts monetization, 20M downloads
2017: Global expansion, new business, team of 60

put your #assignedhashtag here by setting the footer in view-header/footer
Kevin Kim
• Came from Seoul, South Korea

• Co-founder, used to be a product
developer

• Now a data analyst, engineer, team
leader

• Founder of Korea Spark User Group

• Committer and PMC member of
Apache Zeppelin
6

Intro to Between Data Team
• Data engineer * 4

– Manager, engineer with various stack of knowledge and
experience

– Junior engineer, used to be a server engineer

– Senior engineer, has lots of exps and skills

– Data engineer, used to be a top level Android developer

• Hiring data analyst and machine learning expert
8

Between Data Team is doing..
• Analysis

– Service monitoring

– Analysis usage of new features and build product strategies

• Data Infrastructure

– Build and manage infrastructure

– Spark, Zeppelin, AWS, BI Tools, etc

• Third Party Management

– Mobile Attribution Tools for marketing (Kochava, Tune, Appsflyer, etc)

– Google Analytics, Firebase, etc

– Ad Networks
9

Between Data Team is doing..
• Machine Learning Study & Research

– For the next business model

• Support team

– To build business, product, monetization strategies

• Performance Marketing Analysis

– Monitoring effectiveness of marketing budgets

• Product Development

– Improves client performance, server architecture, etc
10

7 PM ~
12
Sunset @ Between Office

Requirements
• Big Data
– 2TB/day of log data from millions of DAU

– 20M of users

• Small Team
– Team of 4, need to support 50

• Tiny Budget
– Company is just over BEP (Break Even Point)

• Need very efficient tech stack!
14

Way We Work
• Use Apache Spark as a general processing engine

• Scriptify everything with Apache Zeppelin

• Heavy utilization of AWS and Spot instances to cut cost

• Proper selection of BI Dashboard Tools
15

Apache Spark, General Engine
• Definitely the best way to deal with big data (as you all know!)

• It’s performance, agility exactly meets startup requirements

– Used Spark from 2014

• Great match with Cloud Service, especially with Spot instance

– Utilizing burst nature of Cloud Service
16

Scriptify Everything with Zeppelin
• Doing everything on Zeppelin!

• Daily batch tasks in form of Spark scripts (using
Zeppelin scheduler)

• Ad hoc analysis

• Cluster control scripts

• The world first user of Zeppelin!

• More than 200 Zeppelin notebooks
17

AWS Cloud
• Spot Instance is my friend!

– Mostly use spot instance for analysis

– only 10 ~ 20% of cost compare to on-demand instances

• Dynamic cluster launch with Auto Scale

– Launch clusters automatically for batch analysis

– Manually launch more clusters on Zeppelin, with Auto Scale script

– Automatically diminish clusters when no usage
18

BI Dashboard Tools
• Use Zeppelin as a dashboard using Spark SQL with ZEPL

• Holistics (holistics.io) or Dash (plot.ly/products/dash/)
19

RDD API or DataFrame API?
• Now Spark has very different style of APIs

– Programmatic RDD API

– SQL-like DataFrame, DataSet API

• In case of having many, simple ad-hoc queries

– DataFrame works

• Having more complex, deep dive analytic questions

– RDD works

• For a while, mostly use RDD, DataFrame for ML or simple ad hoc tasks
21

Sushi or Cooked Data?
• Keeping data in a raw form as possible!

– Doing ETL’s usually makes trouble, increasing management cost

– The Sushi Principle (Joseph & Robert in Strata)

– Drastically reduce operation & management cost

– Apache Spark is a great tool for extracting insight from raw data
22
fresh data!

To Hire Data Analyst or Not?
• For data analyst, expected skill set are..

– Excel, SQL, R, ..

• Those skills are not expected..

– Programatic API like Spark RDD

– Cooking raw data

• Prefer data engineer with analytic skills

• May need to add some ETL tasks to work with data analyst
23

Better, Faster Team Support?
• Better - Zeppelin is great for analyzing data, but not enough for sharing data for team

– We have really few alternatives

– Increase of using BI dashboard tools?

– Still finding a good way

• Faster - Launching a Spark cluster takes few minutes

– Not bad, but we want it faster

– Google BigQuery or AWS Athena

– SQL Database with ETL
24

Future Plan?
• Prepare for exploding # of data operations!

– Team is growing, business is growing

– # of tasks

– # of 3rd party data products

– Communication cost

• Operations with machine learning & deep learning

– Better way to manage task & data flow
25

What Matters for Us
• Support Team

– Each Team should see proper data and make good decision from it

– Regular meetings, fast response to adhoc data request

– Ultimately, our every activity should be related to company’s business

• Technical Lead

– Technical investments for competence of both company and individual

– Working in Between should be a best experience for each individuals

• Social Impact

– Our activity on work has valuable impact for society?

– Open source, activity on community
27

How Apache Spark is Powering a Startup?
• One great tool for general purpose

– Daily batch tasks

– Agile, adhoc analysis

– Drawing dashboard

– Many more..

• Helps saving time, reducing cost of data operations

• Great experience for engineer and analyst

• Sharing know-how’s to / from community
28

Work as a data engineer at Startup
• Fascinating, fast evolution of tech

• Need hard work and labor

• Data work will shine only when it is understood and used by teammates
29
Two Peasants Digging, Vincent van GoghTwo Men Digging, Jean-Francois Millet

Powering a Startup with Apache Spark with Kevin Kim

More Related Content

What's hot (20)

Similar to Powering a Startup with Apache Spark with Kevin Kim (20)

More from Spark Summit (20)

Recently uploaded (20)

Powering a Startup with Apache Spark with Kevin Kim