SlideShare a Scribd company logo
SEOUL
© 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
실시간 빅데이터 및 스트리밍 분석
김일호 – AWS Solutions Architect
Agenda
• Batch Processing: Amazon Elastic MapReduce (EMR)
• Real-time Processing: Amazon Kinesis
• Cost-saving Tips
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Batch processing
Amazon Elastic MapReduce (EMR)
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Control the cluster
Easy to deploy
AWS Management Console Command Line
Or use the Amazon EMR API with your favorite SDK.
Easy to monitor and debug
Integrated with Amazon CloudWatch
Monitor Cluster, Node, and IO
Monitor Debug
Hue
Amazon S3 and Hadoop distributed file system (HDFS)
Hue
Query Editor
Hue
Job Browser
Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Use bootstrap actions to install applications…
https://github.com/awslabs/emr-bootstrap-actions
…or to configure Hadoop
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-
hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name
Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Amazon EMR: Leveraging Amazon S3
Amazon S3 as your persistent data store
• Amazon S3
– Designed for 99.999999999% durability
– Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at
same data in Amazon S3
EMRFS makes it easier to leverage Amazon S3
• Better performance and error handling options
• Transparent to applications – just read/write to “s3://”
• Consistent view
– For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side encryption
• Faster listing using EMRFS metadata
EMRFS support for Amazon S3 client-side encryption
Amazon S3
AmazonS3encryption
clients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number of
objects
Without Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of Amazon S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
Optimize to leverage HDFS
• Iterative workloads
– If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to HDFS for processing.
Amazon EMR: Design patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent daily
Logs stored in
Amazon S3
Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
Real-time Processing
Amazon Kinesis
Real-time analytics
Real-time ingestion
• Highly scalable
• Durable
• Elastic
• Re-playable reads
Continuous processing
• Load-balancing incoming streams
• Fault-tolerance, check-pointing and replay
• Elastic
• Enables multiple apps to process in parallel
Continuous data flow
Low end-to-end latency
Continuous, real-time workloads
+
Data ingestion
Global top 10
example.com
Starting simple...
Global top-10
Distributing the workload…
example.com
Global top10
Local top 10
Local top 10
Local top 10
Or using an elastic data broker…
example.com
Global top 10
Data
record
Stream
Shard
Partition key
Worker
My top 10
Data recordSequence number
14 17 18 21 23
Amazon Kinesis – managed stream
example.com
Amazon
Kinesis
AWSendpoint
Amazon
S3
Amazon
DynamoDB
Amazon
Redshift
Data
sources
Availability
Zone
Availability
Zone
Data
sources
Data
sources
Data
sources
Data
sources
Availability
Zone
Shard 1
Shard 2
Shard N
[Data
archive]
[Metric
extraction]
[Sliding-window
analysis]
[Machine
learning]
App. 1
App. 2
App. 3
App. 4
Amazon EMR
Amazon Kinesis – common data broker
Amazon Kinesis – stream and shards
•Stream: A named entity to
capture and store data
•Shards: Unit of capacity
•Put – 1 MB/sec or 1000
TPS
•Get - 2 MB/sec or 5 TPS
•Scale by adding or removing
shards
•Replay in 24-hr. window
How to size your Amazon Kinesis stream
Consider 2 producers, each producing 2 KB records at 500 TPS:
Minimum of 2 shards for ingress of 2 MB/s
2 Applications can read with egress of 4MB/s
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Producers
Application
How to size your Amazon Kinesis stream
Consider 3 consuming applications each processing the data
Simple! Add another shard to the stream to spread the load
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Application
Application
Producers
Shard
Amazon Kinesis – distributed streams
• From batch to continuous processing
• Scale UP or DOWN without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability
– Records stored across multiple Availability Zones
• Run multiple parallel Amazon Kinesis applications
Data processing
Batch
Micro
batch
Real
time
Pattern for real-time analytics…
Batch
analysis
Data Warehouse
Hadoop
Notifications
& alerts
Dashboards/
visualizations
APIsStreaming
analytics
Data
streams
Deep learning
Dashboards/
visualizations
Spark-Streaming
Apache Storm
Amazon KCL
Data
archive
Real-time analytics
• Streaming
– Event-based response within seconds; for example,
detecting whether a transaction is a fraud or not
• Micro-batch
– Operational insights within minutes; for example,
monitor transactions from different regions
Kinesis
Client
Library
Amazon Kinesis Client Library (Amazon KCL)
• Distributed to handle
multiple shards
• Fault tolerant
• Elastically adjusts to shard
count
• Helps with distributed
processing
Amazon
Kinesis
Stream
Amazon EC2
Amazon EC2
Amazon EC2
Amazon KCL design components
• Worker: The processing unit that maps to each application
instance
• Record processor: The processing unit that processes data
from a shard of an Amazon Kinesis stream
• Check-pointer: Keeps track of the records that have already
been processed in a given shard
Amazon KCL restarts the processing of the shard at the last-
known processed record if a worker fails
Amazon Kinesis Connector Library
• Amazon S3
– Archival of data
• Amazon Redshift
– Micro-batching loads
• Amazon DynamoDB
– Real-time Counters
• Elasticsearch
– Search and Index
S3 Dynamo DB Amazon
Redshift
Amazon
Kinesis
Read data directly into
Hive, Pig, Streaming,
and Cascading from
Amazon Kinesis
Real-time sources into batch-oriented systems
Multi-application support & check-pointing
EMR integration with Amazon
Kinesis
DStream
RDD@T1 RDD@T2
Messages
Receiver
Spark streaming – Basic concepts
• Higher-level abstraction called Discretized Streams
(DStreams)
• Represented as sequences of Resilient Distributed
Datasets (RDDs)
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
Apache Storm: Basic concepts
• Streams: Unbounded sequence of tuples
• Spout: Source of stream
• Bolts: Processes that input streams and output new streams
• Topologies: Network of spouts and bolts
https://github.com/awslabs/kinesis-storm-spout
Batch
Micro
batch
Real
time
Putting it together…
Producer Amazon
Kinesis
App Client
EMRS3
Amazon KCL
DynamoDB
Amazon
Redshift BI tools
Amazon KCL
Amazon KCL
Ref. re:invent 2014 BDT310
Cost-saving tips
• Use Amazon S3 as your persistent data store (only pay for compute
when you need it!).
• Use Amazon EC2 Spot Instances (especially with task nodes) to
save 80 percent or more on the Amazon EC2 cost.
• Use Amazon EC2 Reserved Instances if you have steady
workloads.
• Create CloudWatch alerts to notify you if a cluster is underutilized
so that you can shut it down (e.g. Mappers running == 0 for more
than N hours).
• Contact your sales rep about custom pricing options, if you are
spending more than $10K per month on Amazon EMR.
SEOUL
© 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

More Related Content

PDF
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
Amazon Web Services Korea
 
PDF
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
Amazon Web Services Korea
 
PDF
미디어 산업의 변혁을 가져온 Elemental Cloud :: Dan Marshall :: AWS Summit Seoul 2016
Amazon Web Services Korea
 
PDF
AWS Summit Seoul 2015 - 일본 AWS 게임 고객사례 - Gungho, Grani, Nintendo를 중심으로
Amazon Web Services Korea
 
PDF
Amazon ElastiCache (Dan Zamansky) - AWS DB Day
Amazon Web Services Korea
 
PDF
20150724 제10회 부산 모바일 포럼 - 클라우드컴퓨팅과 함께하는 아마존 웹 서비스
Amazon Web Services Korea
 
PDF
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Amazon Web Services Korea
 
PDF
Amazon Aurora (Debanjan Saha) - AWS DB Day
Amazon Web Services Korea
 
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
Amazon Web Services Korea
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
Amazon Web Services Korea
 
미디어 산업의 변혁을 가져온 Elemental Cloud :: Dan Marshall :: AWS Summit Seoul 2016
Amazon Web Services Korea
 
AWS Summit Seoul 2015 - 일본 AWS 게임 고객사례 - Gungho, Grani, Nintendo를 중심으로
Amazon Web Services Korea
 
Amazon ElastiCache (Dan Zamansky) - AWS DB Day
Amazon Web Services Korea
 
20150724 제10회 부산 모바일 포럼 - 클라우드컴퓨팅과 함께하는 아마존 웹 서비스
Amazon Web Services Korea
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Amazon Web Services Korea
 
Amazon Aurora (Debanjan Saha) - AWS DB Day
Amazon Web Services Korea
 

What's hot (13)

PDF
Premiers pas et bonnes pratiques sur Amazon AWS - Carlos Condé
Publicis Sapient Engineering
 
PDF
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
PDF
Em tempo real: Ingestão, processamento e analise de dados
Amazon Web Services LATAM
 
PDF
교육의 진화, 클라우드는 어떤 역할을 하는가 :: Vincent Quah :: AWS Summit Seoul 2016
Amazon Web Services Korea
 
PDF
20160503 Amazed by AWS | Tips about Performance on AWS
Amazon Web Services Korea
 
PDF
AWS 클라우드가 이끄는 공공기관 혁신 :: Brad Coughlan :: AWS Summit Seoul 2016
Amazon Web Services Korea
 
PDF
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PDF
AWS re:Invent 2016 recap (part 2)
Julien SIMON
 
PPTX
AWS를 활용한 미디어 스트리밍 서비스
Amazon Web Services Korea
 
PDF
Riot Games 글로벌 게임 운영을 위한 Docker 및 Amazon ECS 활용사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
PDF
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PDF
메가존과 AWS가 공개하는 AWS 비용 최적화 전략-메가존 김성용 매니저 및 AWS 이우상 매니저:: AWS Cloud Track 3 Ga...
Amazon Web Services Korea
 
PPTX
AWS basics session
Sharad Gupta
 
Premiers pas et bonnes pratiques sur Amazon AWS - Carlos Condé
Publicis Sapient Engineering
 
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
Em tempo real: Ingestão, processamento e analise de dados
Amazon Web Services LATAM
 
교육의 진화, 클라우드는 어떤 역할을 하는가 :: Vincent Quah :: AWS Summit Seoul 2016
Amazon Web Services Korea
 
20160503 Amazed by AWS | Tips about Performance on AWS
Amazon Web Services Korea
 
AWS 클라우드가 이끄는 공공기관 혁신 :: Brad Coughlan :: AWS Summit Seoul 2016
Amazon Web Services Korea
 
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
Amazon Web Services Korea
 
AWS re:Invent 2016 recap (part 2)
Julien SIMON
 
AWS를 활용한 미디어 스트리밍 서비스
Amazon Web Services Korea
 
Riot Games 글로벌 게임 운영을 위한 Docker 및 Amazon ECS 활용사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
메가존과 AWS가 공개하는 AWS 비용 최적화 전략-메가존 김성용 매니저 및 AWS 이우상 매니저:: AWS Cloud Track 3 Ga...
Amazon Web Services Korea
 
AWS basics session
Sharad Gupta
 
Ad

Viewers also liked (8)

PDF
롯데닷컴의 AWS 클라우드 활용 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
PDF
EC2 컨테이너 서비스 고객사례 Vingle - 조휘철 소프트웨어 엔지니어 :: AWS Container Day
Amazon Web Services Korea
 
PDF
아마존 닷컴의 클라우드 활용 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
PDF
AWS 클라우드 서비스 소개 및 사례 (방희란) - AWS 101 세미나
Amazon Web Services Korea
 
PDF
AWS Enterprise Summit :: 클라우드 도입 사례를 통한 적용 대상과 실행 전략 (정우진 이사)
Amazon Web Services Korea
 
PDF
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
Amazon Web Services Korea
 
PDF
AWS 클라우드로 천만명 웹 서비스 확장하기 - 윤석찬 백승현 - AWS Summit 2016
Amazon Web Services Korea
 
PDF
AWS 클라우드 기반 확장성 높은 천만 사용자 웹 서비스 만들기 - 윤석찬
Amazon Web Services Korea
 
롯데닷컴의 AWS 클라우드 활용 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
EC2 컨테이너 서비스 고객사례 Vingle - 조휘철 소프트웨어 엔지니어 :: AWS Container Day
Amazon Web Services Korea
 
아마존 닷컴의 클라우드 활용 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
AWS 클라우드 서비스 소개 및 사례 (방희란) - AWS 101 세미나
Amazon Web Services Korea
 
AWS Enterprise Summit :: 클라우드 도입 사례를 통한 적용 대상과 실행 전략 (정우진 이사)
Amazon Web Services Korea
 
AWS로 사용자 천만 명 서비스 만들기 (윤석찬)- 클라우드 태권 2015
Amazon Web Services Korea
 
AWS 클라우드로 천만명 웹 서비스 확장하기 - 윤석찬 백승현 - AWS Summit 2016
Amazon Web Services Korea
 
AWS 클라우드 기반 확장성 높은 천만 사용자 웹 서비스 만들기 - 윤석찬
Amazon Web Services Korea
 
Ad

More from Amazon Web Services Korea (20)

PDF
[D3T1S01] Gen AI를 위한 Amazon Aurora 활용 사례 방법
Amazon Web Services Korea
 
PDF
[D3T1S06] Neptune Analytics with Vector Similarity Search
Amazon Web Services Korea
 
PDF
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
PDF
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
PDF
[D3T1S07] AWS S3 - 클라우드 환경에서 데이터베이스 보호하기
Amazon Web Services Korea
 
PDF
[D3T1S05] Aurora 혼합 구성 아키텍처를 사용하여 예상치 못한 트래픽 급증 대응하기
Amazon Web Services Korea
 
PDF
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
PDF
[D3T2S01] Amazon Aurora MySQL 메이저 버전 업그레이드 및 Amazon B/G Deployments 실습
Amazon Web Services Korea
 
PDF
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
PDF
AWS Modern Infra with Storage Roadshow 2023 - Day 2
Amazon Web Services Korea
 
PDF
AWS Modern Infra with Storage Roadshow 2023 - Day 1
Amazon Web Services Korea
 
PDF
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
Amazon Web Services Korea
 
PDF
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon Web Services Korea
 
PDF
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Web Services Korea
 
PDF
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Amazon Web Services Korea
 
PDF
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
Amazon Web Services Korea
 
PDF
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Amazon Web Services Korea
 
PDF
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon Web Services Korea
 
PDF
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon Web Services Korea
 
PDF
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Amazon Web Services Korea
 
[D3T1S01] Gen AI를 위한 Amazon Aurora 활용 사례 방법
Amazon Web Services Korea
 
[D3T1S06] Neptune Analytics with Vector Similarity Search
Amazon Web Services Korea
 
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
[D3T1S07] AWS S3 - 클라우드 환경에서 데이터베이스 보호하기
Amazon Web Services Korea
 
[D3T1S05] Aurora 혼합 구성 아키텍처를 사용하여 예상치 못한 트래픽 급증 대응하기
Amazon Web Services Korea
 
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
[D3T2S01] Amazon Aurora MySQL 메이저 버전 업그레이드 및 Amazon B/G Deployments 실습
Amazon Web Services Korea
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
AWS Modern Infra with Storage Roadshow 2023 - Day 2
Amazon Web Services Korea
 
AWS Modern Infra with Storage Roadshow 2023 - Day 1
Amazon Web Services Korea
 
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
Amazon Web Services Korea
 
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon Web Services Korea
 
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Web Services Korea
 
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Amazon Web Services Korea
 
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
Amazon Web Services Korea
 
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Amazon Web Services Korea
 
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon Web Services Korea
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon Web Services Korea
 
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Amazon Web Services Korea
 

Recently uploaded (20)

PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 

AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

  • 1. SEOUL © 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
  • 2. 실시간 빅데이터 및 스트리밍 분석 김일호 – AWS Solutions Architect
  • 3. Agenda • Batch Processing: Amazon Elastic MapReduce (EMR) • Real-time Processing: Amazon Kinesis • Cost-saving Tips
  • 4. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 5. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 7. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  • 8. Easy to deploy AWS Management Console Command Line Or use the Amazon EMR API with your favorite SDK.
  • 9. Easy to monitor and debug Integrated with Amazon CloudWatch Monitor Cluster, Node, and IO Monitor Debug
  • 10. Hue Amazon S3 and Hadoop distributed file system (HDFS)
  • 13. Try different configurations to find your optimal architecture. CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 14. Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing. Resizable clusters
  • 15. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 16. Use bootstrap actions to install applications… https://github.com/awslabs/emr-bootstrap-actions
  • 17. …or to configure Hadoop --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided) Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml core C c hdfs-site.xml hdfs H h mapred-site.xml mapred M m yarn-site.xml yarn Y y
  • 18. Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real-time sources into batch-oriented systems Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  • 20. Amazon S3 as your persistent data store • Amazon S3 – Designed for 99.999999999% durability – Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at same data in Amazon S3
  • 21. EMRFS makes it easier to leverage Amazon S3 • Better performance and error handling options • Transparent to applications – just read/write to “s3://” • Consistent view – For consistent list and read-after-write for new puts • Support for Amazon S3 server-side and client-side encryption • Faster listing using EMRFS metadata
  • 22. EMRFS support for Amazon S3 client-side encryption Amazon S3 AmazonS3encryption clients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 23. Amazon S3 EMRFS metadata in Amazon DynamoDB • List and read-after-write consistency • Faster list operations Number of objects Without Consistent Views With Consistent Views 1,000,000 147.72 29.70 100,000 12.70 3.69 Fast listing of Amazon S3 objects using EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  • 24. Optimize to leverage HDFS • Iterative workloads – If you’re processing the same dataset more than once • Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing.
  • 25. Amazon EMR: Design patterns
  • 26. Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 27. Amazon EMR example #2: Long-running cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 28. Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multipetabyte warehouse http://techblog.netflix.com/2014/10/using-presto-in-our-big- data-platform.html
  • 30. Real-time analytics Real-time ingestion • Highly scalable • Durable • Elastic • Re-playable reads Continuous processing • Load-balancing incoming streams • Fault-tolerance, check-pointing and replay • Elastic • Enables multiple apps to process in parallel Continuous data flow Low end-to-end latency Continuous, real-time workloads +
  • 33. Global top-10 Distributing the workload… example.com
  • 34. Global top10 Local top 10 Local top 10 Local top 10 Or using an elastic data broker… example.com
  • 35. Global top 10 Data record Stream Shard Partition key Worker My top 10 Data recordSequence number 14 17 18 21 23 Amazon Kinesis – managed stream example.com Amazon Kinesis
  • 36. AWSendpoint Amazon S3 Amazon DynamoDB Amazon Redshift Data sources Availability Zone Availability Zone Data sources Data sources Data sources Data sources Availability Zone Shard 1 Shard 2 Shard N [Data archive] [Metric extraction] [Sliding-window analysis] [Machine learning] App. 1 App. 2 App. 3 App. 4 Amazon EMR Amazon Kinesis – common data broker
  • 37. Amazon Kinesis – stream and shards •Stream: A named entity to capture and store data •Shards: Unit of capacity •Put – 1 MB/sec or 1000 TPS •Get - 2 MB/sec or 5 TPS •Scale by adding or removing shards •Replay in 24-hr. window
  • 38. How to size your Amazon Kinesis stream Consider 2 producers, each producing 2 KB records at 500 TPS: Minimum of 2 shards for ingress of 2 MB/s 2 Applications can read with egress of 4MB/s Shard Shard 2 KB * 500 TPS = 1000 KB/s 2 KB * 500 TPS = 1000 KB/s Application Producers Application
  • 39. How to size your Amazon Kinesis stream Consider 3 consuming applications each processing the data Simple! Add another shard to the stream to spread the load Shard Shard 2 KB * 500 TPS = 1000 KB/s 2 KB * 500 TPS = 1000 KB/s Application Application Application Producers Shard
  • 40. Amazon Kinesis – distributed streams • From batch to continuous processing • Scale UP or DOWN without losing sequencing • Workers can replay records for up to 24 hours • Scale up to GB/sec without losing durability – Records stored across multiple Availability Zones • Run multiple parallel Amazon Kinesis applications
  • 42. Batch Micro batch Real time Pattern for real-time analytics… Batch analysis Data Warehouse Hadoop Notifications & alerts Dashboards/ visualizations APIsStreaming analytics Data streams Deep learning Dashboards/ visualizations Spark-Streaming Apache Storm Amazon KCL Data archive
  • 43. Real-time analytics • Streaming – Event-based response within seconds; for example, detecting whether a transaction is a fraud or not • Micro-batch – Operational insights within minutes; for example, monitor transactions from different regions Kinesis Client Library
  • 44. Amazon Kinesis Client Library (Amazon KCL) • Distributed to handle multiple shards • Fault tolerant • Elastically adjusts to shard count • Helps with distributed processing Amazon Kinesis Stream Amazon EC2 Amazon EC2 Amazon EC2
  • 45. Amazon KCL design components • Worker: The processing unit that maps to each application instance • Record processor: The processing unit that processes data from a shard of an Amazon Kinesis stream • Check-pointer: Keeps track of the records that have already been processed in a given shard Amazon KCL restarts the processing of the shard at the last- known processed record if a worker fails
  • 46. Amazon Kinesis Connector Library • Amazon S3 – Archival of data • Amazon Redshift – Micro-batching loads • Amazon DynamoDB – Real-time Counters • Elasticsearch – Search and Index S3 Dynamo DB Amazon Redshift Amazon Kinesis
  • 47. Read data directly into Hive, Pig, Streaming, and Cascading from Amazon Kinesis Real-time sources into batch-oriented systems Multi-application support & check-pointing EMR integration with Amazon Kinesis
  • 48. DStream RDD@T1 RDD@T2 Messages Receiver Spark streaming – Basic concepts • Higher-level abstraction called Discretized Streams (DStreams) • Represented as sequences of Resilient Distributed Datasets (RDDs) http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
  • 49. Apache Storm: Basic concepts • Streams: Unbounded sequence of tuples • Spout: Source of stream • Bolts: Processes that input streams and output new streams • Topologies: Network of spouts and bolts https://github.com/awslabs/kinesis-storm-spout
  • 50. Batch Micro batch Real time Putting it together… Producer Amazon Kinesis App Client EMRS3 Amazon KCL DynamoDB Amazon Redshift BI tools Amazon KCL Amazon KCL
  • 52. Cost-saving tips • Use Amazon S3 as your persistent data store (only pay for compute when you need it!). • Use Amazon EC2 Spot Instances (especially with task nodes) to save 80 percent or more on the Amazon EC2 cost. • Use Amazon EC2 Reserved Instances if you have steady workloads. • Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours). • Contact your sales rep about custom pricing options, if you are spending more than $10K per month on Amazon EMR.
  • 53. SEOUL © 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved