SlideShare a Scribd company logo
Coupang Confidential and Proprietary
이 문서는 쿠팡의 대외비이며 지적자산입니다
Journey to the Continuous and
Scalable Big Data Platform
Matthew (정재화), Coupang
Coupang Confidential and Proprietary
About me
02
• Software Development Manager of BigData & DW Platform team
• 8+ years Hadoop experience
• Apache Tajo Committer and PMC
• blrunner78@gmail.com
• Blog : https://blrunner.tistory.com
• The author of Hadoop tech hand book
Coupang Confidential and Proprietary
Agenda
03
1. On-Premise
2. Cloud 1.0
3. Cloud 2.0
4. Airflow as a Service
5. Zeppelin as a Service
Coupang Confidential and Proprietary
Motivation
04
The purpose of a business is to
create and keep a customer
- Peter Drucker -
Coupang Confidential and Proprietary
1. On-Premise
Coupang Confidential and Proprietary
Architecture
06
• Aggregations and Joins
• MapReduce
• Hive/Pig/Spark
• Oozie
Logs
• Client Logs
• Server Logs
• Adhoc Query
• HiveRDBMS
External Data
ETL Cluster Read-Only Cluster
Coupang Confidential and Proprietary
Team's Responsibility
07
• Architect, build and operate our data infrastructure and tools
• Create and maintain company-wide data pipeline
• Troubleshoot and resolve all issues as users arise
Coupang Confidential and Proprietary
Areas of Improvements
08
• Pros
• A wide variety of workloads
• Continuous increase in users
• Cons
• Multiple copies of Data
• Lack of Elasticity
• Operation overhead
Coupang Confidential and Proprietary
2. Cloud 1.0
Coupang Confidential and Proprietary
Architecture : Decouple compute and storage
010
Domain Cluster #N
Domain Cluster #2
Centralized Resources
Hive
Meta store
Cloud Storage
Batch Cluster
HiveServer2
Ad-hoc Cluster
HiveServer2
Domain Cluster #1
HiveServer2
- Batch Jobs
- High throughput
- fault tolerant, ETL
- Ad-hoc Queries
- Low latency
- Interactive Analysis
- In-memory
Coupang Confidential and Proprietary
Team's Responsibility
011
• Architect, build and operate our data infrastructure and tools
• Troubleshoot and resolve all issues as users arise
• Implement company-wide data pipelines
Coupang Confidential and Proprietary
Areas of Improvements
012
• Pros
• Allows Parsing, Enriching of Data for Custom Need
• Independent scale of CPU and storage capacity
• Cons
• Learning Curve for Cloud Infrastructure
• Operation overhead
• Users want latest tools and more features
Coupang Confidential and Proprietary
3. Cloud 2.0
Coupang Confidential and Proprietary
High Level Architecture
014
Storage
Data Processing Tools
Scheduler Tools
Security
Airflow
LDAP Authentication Apache Ranger ACL & Audit
Zeppelin
Monitoring
Computing Clusters
Cloud Storage
Data Platorm
Portal
Coupang Confidential and Proprietary
Various types of Computing Clusters
015
Centralized Resource
Hive
Meta Store
Cloud
Storage
Transient Cluster
- Batch Jobs
Persistent Cluster
- Interactive Queries
Workload Specific Cluster
Coupang Confidential and Proprietary
Team's Responsibility
016
• Architect, build and our data infrastructure and tools
• Create data APIs and data services
• Support users using SLA policies
• Maintaining security and data privacy
• Application Knowledge Support Artifacts, etc.
Coupang Confidential and Proprietary
Areas of Improvements
017
• Pros
• Onboard lots of users and variety of jobs
• Easier management and added features
• Cons
• Unintended infrastructure costs have increased
• A wide variety of client tools and Dev environments
• Various types of users
Coupang Confidential and Proprietary
Lessons & Learnings
018
• Distribute traffic instead of concentrating the one place
• Optimize all types of system resources in clusters
• Enforce the Lifecycle of Hadoop Cluster
• Monitor clusters and send alarms from the efficiency perspective
• Training Users Continuously and building the community culture
Coupang Confidential and Proprietary
4. Airflow as a Service
Coupang Confidential and Proprietary
Why we love Airflow?
020
• Define Workflows as code
• Makes Workflows more maintainable, versionable, and testable
• More flexible execution and workflow generation
• Lots of features
• Sensor
• Workflow Profiling
• SLA alert
• Rich Web Interface
• Scalable Worker Processes
• In-house Airflow
Coupang Confidential and Proprietary
Airflow : deployment process
021
Cloud Storage
Coupang Confidential and Proprietary
5. Zeppelin as a Service
Coupang Confidential and Proprietary
Why we love Zeppelin?
023
• Easy spark development in personal computer
• Customized Presto Interpreter
• Run presto query easily without complex JDBC configuration
• Export the heavy data file to local machine without exception
• Persistent Storage for Notebook
Coupang Confidential and Proprietary
Zeppelin Architecture
024
Coupang Confidential and Proprietary
Areas of Improvements
025
• Users
• Load all notebooks in the main page -> Too slow
• Big notebook can consume most resources -> Zeppelin Pending
• Platform team
• Spark interpreter doesn’t support YARN cluster mode
• Doesn’t support the life cycle for notebooks
• Difficult to upgrade and improve existing zeppelins gracefully
Coupang Confidential and Proprietary
Resolution
026
• Upgrade Zeppelin to 0.8.1
• Main Page Improvements
• Yarn Cluster Mode for Spark Interpreter
• Interpreter Lifecycle manager
• Interpreter Recovery
• Containerized Zeppelin on Kubernetes
Coupang Confidential and Proprietary
Summary
027
• Understand who is the immediate customer
• Focus on the truly important things
• Detect and solve problems immediately
• Leverage the identity of infrastructure
• Best Practice is not best for you
Coupang Confidential and Proprietary
SELECT question FROM you
https://boards.greenhouse.io/coupang/
Coupang Confidential and Proprietary
Thank you

More Related Content

What's hot (20)

PDF
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Isabelle Van Campenhoudt
 
PDF
FOSDEM 2015 - NoSQL and SQL the best of both worlds
Andrew Morgan
 
PDF
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
walk2talk srl
 
PDF
Oracle WebLogic 12c New Multitenancy features
Michel Schildmeijer
 
PPTX
2 Speed IT powered by Microsoft Azure and Minecraft
Sriram Hariharan
 
PPTX
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
PPTX
Azure database services for PostgreSQL and MySQL
Amit Banerjee
 
PDF
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
RightScale
 
PPTX
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Marco Obinu
 
PDF
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Zohar Elkayam
 
PDF
Project Sherpa: How RightScale Went All in on Docker
RightScale
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Continuent
 
PPTX
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
JeremyOtt5
 
PPTX
Cloud Design Patterns - Hong Kong Codeaholics
Taswar Bhatti
 
PPTX
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
Payara
 
PPTX
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
RightScale
 
PPTX
8 cloud design patterns you ought to know - Update Conference 2018
Taswar Bhatti
 
PPTX
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
Codemotion
 
PPTX
Upgrade your SQL Server like a Ninja
Amit Banerjee
 
PPTX
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Piyush Kumar
 
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Isabelle Van Campenhoudt
 
FOSDEM 2015 - NoSQL and SQL the best of both worlds
Andrew Morgan
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
walk2talk srl
 
Oracle WebLogic 12c New Multitenancy features
Michel Schildmeijer
 
2 Speed IT powered by Microsoft Azure and Minecraft
Sriram Hariharan
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
Azure database services for PostgreSQL and MySQL
Amit Banerjee
 
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
RightScale
 
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Marco Obinu
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Zohar Elkayam
 
Project Sherpa: How RightScale Went All in on Docker
RightScale
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Continuent
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
JeremyOtt5
 
Cloud Design Patterns - Hong Kong Codeaholics
Taswar Bhatti
 
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
Payara
 
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
RightScale
 
8 cloud design patterns you ought to know - Update Conference 2018
Taswar Bhatti
 
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
Codemotion
 
Upgrade your SQL Server like a Ninja
Amit Banerjee
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Piyush Kumar
 

Similar to [Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정 (20)

PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
PPT
When small problems become big problems
Adrian Cole
 
PPTX
Ten tools for ten big data areas 01 informatica
Will Du
 
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
PPTX
Big data journey to the cloud 5.30.18 asher bartch
Cloudera, Inc.
 
PDF
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
TechWell
 
PDF
Optimize with Open Source
EDB
 
PPT
Praxistaugliche notes strategien 4 cloud
Roman Weber
 
PPTX
Location-independent SharePoint
Riverbed Technology
 
PDF
Designing your API Server for mobile apps
Mugunth Kumar
 
PPTX
Oracle Database Cloud Service
Jean-Philippe PINTE
 
PPTX
(ATS3-PLAT08) Optimizing Protocol Performance
BIOVIA
 
PDF
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
Nordic APIs
 
PPTX
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
Datapolis
 
PPT
Apex ace update
Ayesha Fayyaz
 
PPTX
Data platform modernization with Databricks.pptx
CalvinSim10
 
PPT
ow.ppt
ssuser96a63c
 
PPT
kjdiakdnfdifjadsjkjklljlldasgjdjdljgfldjgldjgldjgl.ppt
Brahamam Veera
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
When small problems become big problems
Adrian Cole
 
Ten tools for ten big data areas 01 informatica
Will Du
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Big data journey to the cloud 5.30.18 asher bartch
Cloudera, Inc.
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
TechWell
 
Optimize with Open Source
EDB
 
Praxistaugliche notes strategien 4 cloud
Roman Weber
 
Location-independent SharePoint
Riverbed Technology
 
Designing your API Server for mobile apps
Mugunth Kumar
 
Oracle Database Cloud Service
Jean-Philippe PINTE
 
(ATS3-PLAT08) Optimizing Protocol Performance
BIOVIA
 
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
Nordic APIs
 
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
Datapolis
 
Apex ace update
Ayesha Fayyaz
 
Data platform modernization with Databricks.pptx
CalvinSim10
 
ow.ppt
ssuser96a63c
 
kjdiakdnfdifjadsjkjklljlldasgjdjdljgfldjgldjgldjgl.ppt
Brahamam Veera
 
Ad

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Ad

[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

  • 1. Coupang Confidential and Proprietary 이 문서는 쿠팡의 대외비이며 지적자산입니다 Journey to the Continuous and Scalable Big Data Platform Matthew (정재화), Coupang
  • 2. Coupang Confidential and Proprietary About me 02 • Software Development Manager of BigData & DW Platform team • 8+ years Hadoop experience • Apache Tajo Committer and PMC • [email protected] • Blog : https://blrunner.tistory.com • The author of Hadoop tech hand book
  • 3. Coupang Confidential and Proprietary Agenda 03 1. On-Premise 2. Cloud 1.0 3. Cloud 2.0 4. Airflow as a Service 5. Zeppelin as a Service
  • 4. Coupang Confidential and Proprietary Motivation 04 The purpose of a business is to create and keep a customer - Peter Drucker -
  • 5. Coupang Confidential and Proprietary 1. On-Premise
  • 6. Coupang Confidential and Proprietary Architecture 06 • Aggregations and Joins • MapReduce • Hive/Pig/Spark • Oozie Logs • Client Logs • Server Logs • Adhoc Query • HiveRDBMS External Data ETL Cluster Read-Only Cluster
  • 7. Coupang Confidential and Proprietary Team's Responsibility 07 • Architect, build and operate our data infrastructure and tools • Create and maintain company-wide data pipeline • Troubleshoot and resolve all issues as users arise
  • 8. Coupang Confidential and Proprietary Areas of Improvements 08 • Pros • A wide variety of workloads • Continuous increase in users • Cons • Multiple copies of Data • Lack of Elasticity • Operation overhead
  • 9. Coupang Confidential and Proprietary 2. Cloud 1.0
  • 10. Coupang Confidential and Proprietary Architecture : Decouple compute and storage 010 Domain Cluster #N Domain Cluster #2 Centralized Resources Hive Meta store Cloud Storage Batch Cluster HiveServer2 Ad-hoc Cluster HiveServer2 Domain Cluster #1 HiveServer2 - Batch Jobs - High throughput - fault tolerant, ETL - Ad-hoc Queries - Low latency - Interactive Analysis - In-memory
  • 11. Coupang Confidential and Proprietary Team's Responsibility 011 • Architect, build and operate our data infrastructure and tools • Troubleshoot and resolve all issues as users arise • Implement company-wide data pipelines
  • 12. Coupang Confidential and Proprietary Areas of Improvements 012 • Pros • Allows Parsing, Enriching of Data for Custom Need • Independent scale of CPU and storage capacity • Cons • Learning Curve for Cloud Infrastructure • Operation overhead • Users want latest tools and more features
  • 13. Coupang Confidential and Proprietary 3. Cloud 2.0
  • 14. Coupang Confidential and Proprietary High Level Architecture 014 Storage Data Processing Tools Scheduler Tools Security Airflow LDAP Authentication Apache Ranger ACL & Audit Zeppelin Monitoring Computing Clusters Cloud Storage Data Platorm Portal
  • 15. Coupang Confidential and Proprietary Various types of Computing Clusters 015 Centralized Resource Hive Meta Store Cloud Storage Transient Cluster - Batch Jobs Persistent Cluster - Interactive Queries Workload Specific Cluster
  • 16. Coupang Confidential and Proprietary Team's Responsibility 016 • Architect, build and our data infrastructure and tools • Create data APIs and data services • Support users using SLA policies • Maintaining security and data privacy • Application Knowledge Support Artifacts, etc.
  • 17. Coupang Confidential and Proprietary Areas of Improvements 017 • Pros • Onboard lots of users and variety of jobs • Easier management and added features • Cons • Unintended infrastructure costs have increased • A wide variety of client tools and Dev environments • Various types of users
  • 18. Coupang Confidential and Proprietary Lessons & Learnings 018 • Distribute traffic instead of concentrating the one place • Optimize all types of system resources in clusters • Enforce the Lifecycle of Hadoop Cluster • Monitor clusters and send alarms from the efficiency perspective • Training Users Continuously and building the community culture
  • 19. Coupang Confidential and Proprietary 4. Airflow as a Service
  • 20. Coupang Confidential and Proprietary Why we love Airflow? 020 • Define Workflows as code • Makes Workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of features • Sensor • Workflow Profiling • SLA alert • Rich Web Interface • Scalable Worker Processes • In-house Airflow
  • 21. Coupang Confidential and Proprietary Airflow : deployment process 021 Cloud Storage
  • 22. Coupang Confidential and Proprietary 5. Zeppelin as a Service
  • 23. Coupang Confidential and Proprietary Why we love Zeppelin? 023 • Easy spark development in personal computer • Customized Presto Interpreter • Run presto query easily without complex JDBC configuration • Export the heavy data file to local machine without exception • Persistent Storage for Notebook
  • 24. Coupang Confidential and Proprietary Zeppelin Architecture 024
  • 25. Coupang Confidential and Proprietary Areas of Improvements 025 • Users • Load all notebooks in the main page -> Too slow • Big notebook can consume most resources -> Zeppelin Pending • Platform team • Spark interpreter doesn’t support YARN cluster mode • Doesn’t support the life cycle for notebooks • Difficult to upgrade and improve existing zeppelins gracefully
  • 26. Coupang Confidential and Proprietary Resolution 026 • Upgrade Zeppelin to 0.8.1 • Main Page Improvements • Yarn Cluster Mode for Spark Interpreter • Interpreter Lifecycle manager • Interpreter Recovery • Containerized Zeppelin on Kubernetes
  • 27. Coupang Confidential and Proprietary Summary 027 • Understand who is the immediate customer • Focus on the truly important things • Detect and solve problems immediately • Leverage the identity of infrastructure • Best Practice is not best for you
  • 28. Coupang Confidential and Proprietary SELECT question FROM you https://boards.greenhouse.io/coupang/
  • 29. Coupang Confidential and Proprietary Thank you