SlideShare a Scribd company logo
AWS Big Data Demystified #1.2
Big Data Architecture
Lessons Learned
Omid Vahdaty, Big Data Ninja
Disclaimer
● I am not the best, I simply love
what I do VERY much.
● You are more than welcome to
challenge me or anything I have
to say as I could be wrong.
● This Lecture has evolves over
time, this is the 3nd iteration.
● Feel Free to send me comments
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
In the Past(web,api, ops db, data warehouse)
4
Then came Big Data...
5
Then came the cloud...
6
Then came the invoice ...
It keeps
growing….
TODAY’S BIG DATA
APPLICATION STACK
PaaS and DC...
MY BIG DATA
APPLICATION STACK
“demystified”...
MY AWS BIG DATA
APPLICATION STACK
With some “Myst”...
Jargon
Big data?
Architecture?
Considerations?
Challenges?
How to get started?
Big Jargon & Basics concepts u should know
https://amazon-aws-big-data-demystified.ninja/2019/02/18/big-data-jargon-faqs-
and-everything-you-wanted-to-know-and-didnt-ask-about-big-data/
● What is Big Data?
● scale out / up ?
● structured/semi structured/unstructured data ?
● ACID?
● OLAP VS OLTP? == Analytics VS operational
● DIY = Do it Yourself
● PaaS = Platform as a service
Big Data =
When your data outgrows
your infrastructure ability to process
● Volume (x TB processing per day)
● Velocity ( x GB/s)
● Variety (JSON, CSV, events etc)
● Veracity (how much of the data is accurate?)
Challenges creating big data architecture?
● What is the business use case ? How fast do u need the insights?
○ 15 min - 24 hours delay and above → use batch
○ Less than 15 min?
■ Might be batch - depends data source is files or events?.
■ Streaming?
● Sub seconds delay?
● Sub minute delay?
■ Streaming with in flight analytics ?
○ How complex is the compute jobs? Aggregations? joins?
Challenges creating big data architecture?
● What is the Velocity?
○ Under 100K events per second? Not a problem
○ Over 1M events per second? Costly. But doable.
○ Over 1B events per seconds? Not trivial at all.
● Volume ?
○ ~1TB a day ? Not a problem
○ Over ? it depends.
○ Over a petabyte? Well…. It depends.
● Veracity (how are you going to handle different data sources?)
○ Structured (CSV)
○ Semi structured (JSON,XML)
○ Unstructured (pictures, movies etc)
Challenges creating big data architecture?
● Performance targets?
● Costs targets?
● Security restrictions?
● Regulation restriction? privacy?
● Which technology to choose?
● Datacenter or cloud?
● Latency?
● Throughput?
● Concurrency?
● Security Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
Cloud Architecture rules of thumb...
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
○ Do use more b/c: maintenance
○ Training time
○ complexity/simplicity
How to get started on big data architecture?
Get answers to the below:
1. What is the business use case?
a. Volume? velocity? Variety? vercity/
b. Did you map all of data sources?
2. Where should we build a data platform?
a. What is the product? → requirements.
b. Cloud? datacenter?
3. Architecture?
a. DIY? Paas? , Pay as you go? Or fixed? Decoupled?
b. Fast? Cheap? simple?
4. Did you Communicate you plans?
5. Did you map all known challenges?
Business Use Case?
Use Case 1: Analyzing browsing history
● Data Collection: browsing history from an ISP
● Product - derives user intent and interest for marketing purposes.
● Challenges
○ Velocity: 1 TB per day
○ History of: 3M
○ Remote DC
○ Enterprise grade security
○ Privacy
Use Case 2: Insights from location based data
● Data collection: from a Mobile operator
● Products:
○ derives user intent and interest for marketing purposes.
○ derive location based intent for marketing purposes.
● Challenges
○ Velocity: 4GB/s …
○ Scalability: Rate was expected double every year...
○ Remote DC
○ Enterprise grade security
○ Privacy
○ Edge analytics
Use Case 3: Analyzing location based events.
● Data collection: streaming
● Product: building location based audiences
● Challenges: minimizing DevOps work on maintenance
Getting started
(notice the markings on upper right
corner)
I didn't choose
this technology
I choose this
technology
So what is the product?
● Big data platform that
○ ingests data from multiple sources (cloud and DC)
○ Analyzes the data
○ Generates insights :
■ Smart Segments (online marketing)
■ Smart reports (for marketer)
■ Audience analysis (for agencies)
● Customers?
○ Marketers
○ Publishers
○ Agencies
Where to build the data platform?
We choose AWS because
● After a long competitive analysis we choose AWS because, it seems to
have all the relevant features For all our big data products and wallas
publisher products
● The project was challenging enough, without adding the complexity of a
learning curve (learning new cloud). We already knew how to work with
AWS
● Of course, there business aspects as well.
My Big Data product does:
● Data Ingestion
○ Online
■ messaging
■ Streaming
○ Offline
■ Batch
■ Performance aspects
● Data Transformation (Hive)
○ JSON, CSV, TXT, PARQUET, Binary
● Data Modeling - (R, ML, AI, DEEP, SPARK)
● Data Visualization (choose your poison)
● PII regulation + GPDR regulation
● And: Performance... Cost… Security… Simple... Cloud best practices...
Big Data Generic Architecture
Data Ingestion
(file based ETL from remote DC)
Data Transformation ( row to colunar + cleansing)
Data Modeling ( joins/agg/ML/R)
Data Presentation
Text,
RAW
Data Ingestion
A layer in your big data architecture
designed to do one thing : ingest
data via Batch or Streaming, I.e
move (only) data from point A to
point B. from source data to the
next layer in the architecture
(decoupled).
Big Data Generic Architecture | Data Ingestion
Data Ingestion
Data Transformation
Data Modeling
Data Presentation
Batch Data collection considerations
● Every hour , about 30GB compressed CSV file
● Why s3
○ Multi part upload
○ S3 CLI
○ S3 SDK
○ (tip : gzip! )
● Why ETL Client - needs to run at remote DC
● Why NOT your own ETL client
○ Involves code →
■ Bugs?
■ maintenance
○ Don't analyze data at Edge , cant go back in time.
● Why Not Streaming?
○ less accurate
○ Expensive
S3 Considerations
● Security
○ at rest: server side S3-Managed Keys (SSE-S3)
○ at transit: SSL / VPN
○ Hardening: user, IP ACL, write permission only.
● Upload
○ AWS s3 cli
○ Multi part upload
○ Aborting Incomplete Multipart Uploads Using a
Bucket Lifecycle Policy
○ Consider S3 CLI Sync command instead of CP
Sqoop - ETL
● Open source , part of EMR
● HDFS to RDMS and back. Via JDBC.
● E.g BiDirectional ETL from RDS to HDFS
● Unlikely use case: ETL from customer source operational DB.
Flume & Kafka
● Opens source project for streaming & messaging
● Popular
● Generic
● Good practice for many use cases. (a meetup by it self)
● Highly durable, scalable, extension etc.
● Downside : DIY, Non trivial to get started
Data Transfer Options
● Direct Connect (4GB/s?)
● For all other use case
○ S3 multipart upload
○ Compression
○ Security
■ Data at motion
■ Data at rest
Quick intro to Stream ingestion
● Kinesis Client Library (code)
● AWS lambda (code)
● EMR (managed hadoop)
● Third party (DIY)
○ Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
○ Storm - Most real time (sub millisec), java code based.
○ Flink (similar to spark)
Kinesis family of products
● Kinesis Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo, ...
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
○ Not managed!
● Kinesis Analytics - in flight analytics.
● Kin. Firehose - Park you data @ destination.
Kinesis Firehose - for Data parking
● Not for fast lane - no in flight analytics
● Ingest , transform and load to:
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
Comparison of Kinesis products
● Streams
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Buffering 60 sec minimum.
○ For larger “events”
Data
Transformation
A layer in your big data architecture
designed to : Transform and
Cleanse data (row data to columnar
data and convert data types, Fix
bugs in data)
Big Data Generic Architecture | Transformation
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper (hbase)
● zeppelin
EMR Architecture
● Master node
● Core nodes - like data nodes (with storage: HDFS)
● Task nodes - (extends compute)
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR lesson learned...
● Bigger instance type is good architecture
● Use spot instances - for the tasks only.
● Don't always use TEZ (MR? Spark?)
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Use Steps to automate steps on running cluster
● Use Glue to share Hive MetaStore
● Good Cost reduction article on EMR
So use EMR for ...
● Most dominant
○ Hive
○ Spark
○ Presto
● And many more….
● Good for:
○ Data transformation
○ Data modeling
○ Batch
○ Machine learning
Hive
● SQL over hadoop.
● Engine: spark, tez, MR
● JDBC / ODBC
● Not good when need to shuffle.
● Not peta scale.
● SerDe json, parquet,regex,text etc.
● Dynamic partitions
● Insert overwrite
● Data Transformation
● Convert to Columnar
Presto
● SQL over hadoop
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
● No insert overwrite
● No dynamic partitions.
● Has some connectors : redshift and more
● https://amazon-aws-big-data-
demystified.ninja/2018/07/02/aws-emr-
presto-demystified-everything-you-
wanted-to-know-about-presto/
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know
SQL, for system people, for those who want
to avoid java/scala
● Fair fight compared to hive in term of
performance only
● Good for unstructured files ETL : file to file ,
and use sqoop.
Hue
● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Orchestration
● EMR Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Other options: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
Big Data Generic Architecture | Transformation
Data Ingestion
S3
Data Transformation
Data Modeling
Data Visualization
Data Modeling
A layer in your big data architecture
designed to Model data: Joins,
Aggregations, nightly jobs, Machine
learning
Big Data Generic Architecture | Modeling
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Spark
● In memory
● X10 to X100 times faster from hive
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark Streaming
● Near real time (1 sec latency)
● like batch of 1sec windows
● Streaming jobs with API
● DIY = Not relevant to us...
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark flavours
● Standalone
● With yarn
● With mesos
Spark Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.
Spark SQL
● Same syntax as hive
● Optional JDBC via thrift
● Non trivial learning curve
● Upto X10 faster than hive.
● Works well with Zeppelin (out of the box)
● Does not replaces Hive
● Spark not always faster than hive
● insert overwrite -
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● works on top of Hive/ Spark SQL
● Inside EMR.
● Uses in the background:
○ Shiro
○ Livy
R + spark R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute.
● Not everything works perfect.
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries
○ which are long lasting.
● DO NOT USE FOR transformation.
● Good for : DW, Complex Joins.
Redshift spectrum
● Extension of Redshift, use external table on S3.
● Require redshift cluster.
● Not possible for CTAS to s3, complex data structure, joins.
● Good for
○ Read only Queries
○ Aggregations on Exabyte.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases. (aggregations)?
● Raw data? S3.
● When to use emr and when redshift?
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift? athena?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scale ? redshift
● Complex joins → Redshift
Sage Maker
● Web notebook (jupiter
based) for data science
● Connects to all your data
sources (s3,athena etc)
● Help you manage the
entire lifecycle machine
learning
● Managed Service
● Used to create a ML to
predict cookie gender
AWS Glue
Shared meta store
Helps with some data transformation (managed service)
Automatic Schema discovery
AWD RDS (aurora, postgres, mysql)
● We used RDS aurora as Operational DB
● We did not use it for big data analytics although it supports upto 64Tb
● It is row based.
● The syntax is missing analytical functions
Big Data Generic Architecture | Modeling
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Data
Presentation Used ONLY for presenting data for
operational applications or BI, Use
managed service to ensure HA.
Big Data Generic Architecture | Presentation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Athena
● Presto SQL
● In memory
● Hive metastore for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
● Now supports CTAS (No inserts are supported)
● Good for:
○ Read only SQL,
○ Ad hoc query,
○ low cost,
○ Managed
● Good cost reduction article on athena
Visualize
● QuickSight
● Managed Visualizer, simple, cheap
Summery Take Away Message &
Action Items
Big Data Generic Architecture | Summary
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Summary: Lesson learned
● Decouple, Decouple, Decouple
● Productivity of Data Science and Data engineering
○ Common language of both teams IS SQL!
○ Minimize the life cycle from dev to production of ETL and ML jobs
● Minimize the amount DB’s used
○ Different syntax (presto/hive/redshift)
○ Different data types
○ Minimize ETLS via External Tables+Glue!
● Not always Streaming is justified (what is the business use case? PaaS?)
● Spark SQL
○ Sometimes faster than redshift
○ Sometimes slower than hive
○ Learning curve is non trivial
Summery: Common Q&A
1. Can this architecture be done on another cloud?
2. Redshift VS EMR ?
3. Athena VS Redshift?
4. Cost reduction on EMR?
5. Cost Reduction on Athena?
6. Exporting data from Google Analytics into AWS?
Lesson learned: Big Data Architecture ?
Faster! Cheaper! Simpler!
How to get started | Call for Action
Lectures: AWS Big Data Demystified lectures #1 until #4
AWS Big Data Demystified Meetup Big Data Demystified meetup
Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://big-data-demystified.ninja/
● Join our meetups subscribe to youtube channels
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.meetup.com/Big-Data-Demystified/
○ Big Data Demystified YouTube
○ AWS Big Data Demystified YouTube
○ WhatsApp group

More Related Content

What's hot (16)

PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
 
PPTX
NoSQL for the SQL Server Pro
Lynn Langit
 
PDF
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
PPTX
When to Use MongoDB
MongoDB
 
PPTX
Google Cloud Spanner Preview
DoiT International
 
PDF
Big data real time architectures
Daniel Marcous
 
PDF
Lambda architecture @ Indix
Rajesh Muppalla
 
PDF
Shift: Real World Migration from MongoDB to Cassandra
DataStax
 
PPTX
REDSHIFT - Amazon
Douglas Bernardini
 
PPTX
Case studies session 2
HBaseCon
 
PPTX
Welcome | MariaDB today and our vision for the future
MariaDB plc
 
PPTX
Big data in Azure
Venkatesh Narayanan
 
PPTX
Visual Mapping of Clickstream Data
DataWorks Summit
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
PPTX
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
ScyllaDB
 
PDF
Argus Production Monitoring at Salesforce
HBaseCon
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
 
NoSQL for the SQL Server Pro
Lynn Langit
 
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
When to Use MongoDB
MongoDB
 
Google Cloud Spanner Preview
DoiT International
 
Big data real time architectures
Daniel Marcous
 
Lambda architecture @ Indix
Rajesh Muppalla
 
Shift: Real World Migration from MongoDB to Cassandra
DataStax
 
REDSHIFT - Amazon
Douglas Bernardini
 
Case studies session 2
HBaseCon
 
Welcome | MariaDB today and our vision for the future
MariaDB plc
 
Big data in Azure
Venkatesh Narayanan
 
Visual Mapping of Clickstream Data
DataWorks Summit
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
ScyllaDB
 
Argus Production Monitoring at Salesforce
HBaseCon
 

Similar to AWS Big Data Demystified #1.2 | Big Data architecture lessons learned (20)

PDF
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
PDF
Data Platform in the Cloud
Amihay Zer-Kavod
 
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PDF
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
PDF
Building data "Py-pelines"
Rob Winters
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
PPTX
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
PPTX
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Anthony Scata
 
PDF
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
PDF
Simply Business' Data Platform
Dani Solà Lagares
 
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
PDF
Apache Storm Concepts
André Dias
 
PDF
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
PDF
Designing for operability and managability
Gaurav Bahrani
 
PPTX
Big Data on Cloud Native Platform
Sunil Govindan
 
PPTX
Big Data on Cloud Native Platform
Sunil Govindan
 
PDF
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
PDF
Cloud arch patterns
Corey Huinker
 
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Data Platform in the Cloud
Amihay Zer-Kavod
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Building data "Py-pelines"
Rob Winters
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Anthony Scata
 
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Simply Business' Data Platform
Dani Solà Lagares
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Apache Storm Concepts
André Dias
 
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
Designing for operability and managability
Gaurav Bahrani
 
Big Data on Cloud Native Platform
Sunil Govindan
 
Big Data on Cloud Native Platform
Sunil Govindan
 
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Cloud arch patterns
Corey Huinker
 
Ad

More from Omid Vahdaty (20)

PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PPTX
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
PPTX
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
PPTX
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
PPTX
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty
 
PDF
Making your analytics talk business | Big Data Demystified
Omid Vahdaty
 
PPTX
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty
 
PPTX
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty
 
PDF
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty
 
PPTX
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty
 
PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
PPTX
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
PPTX
Emr spark tuning demystified
Omid Vahdaty
 
PPTX
Emr zeppelin & Livy demystified
Omid Vahdaty
 
PPTX
Zeppelin and spark sql demystified
Omid Vahdaty
 
PPTX
Introduction to AWS Big Data
Omid Vahdaty
 
PPTX
Aws s3 security
Omid Vahdaty
 
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
PPTX
Introduction to aws dynamo db
Omid Vahdaty
 
PPTX
Hive vs. Impala
Omid Vahdaty
 
Data Pipline Observability meetup
Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Omid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Emr spark tuning demystified
Omid Vahdaty
 
Emr zeppelin & Livy demystified
Omid Vahdaty
 
Zeppelin and spark sql demystified
Omid Vahdaty
 
Introduction to AWS Big Data
Omid Vahdaty
 
Aws s3 security
Omid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
Introduction to aws dynamo db
Omid Vahdaty
 
Hive vs. Impala
Omid Vahdaty
 
Ad

Recently uploaded (20)

PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Day2 B2 Best.pptx
helenjenefa1
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

  • 1. AWS Big Data Demystified #1.2 Big Data Architecture Lessons Learned Omid Vahdaty, Big Data Ninja
  • 2. Disclaimer ● I am not the best, I simply love what I do VERY much. ● You are more than welcome to challenge me or anything I have to say as I could be wrong. ● This Lecture has evolves over time, this is the 3nd iteration. ● Feel Free to send me comments
  • 4. In the Past(web,api, ops db, data warehouse) 4
  • 5. Then came Big Data... 5
  • 6. Then came the cloud... 6
  • 7. Then came the invoice ...
  • 9. TODAY’S BIG DATA APPLICATION STACK PaaS and DC...
  • 10. MY BIG DATA APPLICATION STACK “demystified”...
  • 11. MY AWS BIG DATA APPLICATION STACK With some “Myst”...
  • 13. Big Jargon & Basics concepts u should know https://amazon-aws-big-data-demystified.ninja/2019/02/18/big-data-jargon-faqs- and-everything-you-wanted-to-know-and-didnt-ask-about-big-data/ ● What is Big Data? ● scale out / up ? ● structured/semi structured/unstructured data ? ● ACID? ● OLAP VS OLTP? == Analytics VS operational ● DIY = Do it Yourself ● PaaS = Platform as a service
  • 14. Big Data = When your data outgrows your infrastructure ability to process ● Volume (x TB processing per day) ● Velocity ( x GB/s) ● Variety (JSON, CSV, events etc) ● Veracity (how much of the data is accurate?)
  • 15. Challenges creating big data architecture? ● What is the business use case ? How fast do u need the insights? ○ 15 min - 24 hours delay and above → use batch ○ Less than 15 min? ■ Might be batch - depends data source is files or events?. ■ Streaming? ● Sub seconds delay? ● Sub minute delay? ■ Streaming with in flight analytics ? ○ How complex is the compute jobs? Aggregations? joins?
  • 16. Challenges creating big data architecture? ● What is the Velocity? ○ Under 100K events per second? Not a problem ○ Over 1M events per second? Costly. But doable. ○ Over 1B events per seconds? Not trivial at all. ● Volume ? ○ ~1TB a day ? Not a problem ○ Over ? it depends. ○ Over a petabyte? Well…. It depends. ● Veracity (how are you going to handle different data sources?) ○ Structured (CSV) ○ Semi structured (JSON,XML) ○ Unstructured (pictures, movies etc)
  • 17. Challenges creating big data architecture? ● Performance targets? ● Costs targets? ● Security restrictions? ● Regulation restriction? privacy? ● Which technology to choose? ● Datacenter or cloud? ● Latency? ● Throughput? ● Concurrency? ● Security Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 18. Cloud Architecture rules of thumb... ● Decouple : ○ Store ○ Process ○ Store ○ Process ○ insight... ● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud ○ Do use more b/c: maintenance ○ Training time ○ complexity/simplicity
  • 19. How to get started on big data architecture? Get answers to the below: 1. What is the business use case? a. Volume? velocity? Variety? vercity/ b. Did you map all of data sources? 2. Where should we build a data platform? a. What is the product? → requirements. b. Cloud? datacenter? 3. Architecture? a. DIY? Paas? , Pay as you go? Or fixed? Decoupled? b. Fast? Cheap? simple? 4. Did you Communicate you plans? 5. Did you map all known challenges?
  • 21. Use Case 1: Analyzing browsing history ● Data Collection: browsing history from an ISP ● Product - derives user intent and interest for marketing purposes. ● Challenges ○ Velocity: 1 TB per day ○ History of: 3M ○ Remote DC ○ Enterprise grade security ○ Privacy
  • 22. Use Case 2: Insights from location based data ● Data collection: from a Mobile operator ● Products: ○ derives user intent and interest for marketing purposes. ○ derive location based intent for marketing purposes. ● Challenges ○ Velocity: 4GB/s … ○ Scalability: Rate was expected double every year... ○ Remote DC ○ Enterprise grade security ○ Privacy ○ Edge analytics
  • 23. Use Case 3: Analyzing location based events. ● Data collection: streaming ● Product: building location based audiences ● Challenges: minimizing DevOps work on maintenance
  • 24. Getting started (notice the markings on upper right corner) I didn't choose this technology I choose this technology
  • 25. So what is the product? ● Big data platform that ○ ingests data from multiple sources (cloud and DC) ○ Analyzes the data ○ Generates insights : ■ Smart Segments (online marketing) ■ Smart reports (for marketer) ■ Audience analysis (for agencies) ● Customers? ○ Marketers ○ Publishers ○ Agencies
  • 26. Where to build the data platform?
  • 27. We choose AWS because ● After a long competitive analysis we choose AWS because, it seems to have all the relevant features For all our big data products and wallas publisher products ● The project was challenging enough, without adding the complexity of a learning curve (learning new cloud). We already knew how to work with AWS ● Of course, there business aspects as well.
  • 28. My Big Data product does: ● Data Ingestion ○ Online ■ messaging ■ Streaming ○ Offline ■ Batch ■ Performance aspects ● Data Transformation (Hive) ○ JSON, CSV, TXT, PARQUET, Binary ● Data Modeling - (R, ML, AI, DEEP, SPARK) ● Data Visualization (choose your poison) ● PII regulation + GPDR regulation ● And: Performance... Cost… Security… Simple... Cloud best practices...
  • 29. Big Data Generic Architecture Data Ingestion (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Presentation Text, RAW
  • 30. Data Ingestion A layer in your big data architecture designed to do one thing : ingest data via Batch or Streaming, I.e move (only) data from point A to point B. from source data to the next layer in the architecture (decoupled).
  • 31. Big Data Generic Architecture | Data Ingestion Data Ingestion Data Transformation Data Modeling Data Presentation
  • 32. Batch Data collection considerations ● Every hour , about 30GB compressed CSV file ● Why s3 ○ Multi part upload ○ S3 CLI ○ S3 SDK ○ (tip : gzip! ) ● Why ETL Client - needs to run at remote DC ● Why NOT your own ETL client ○ Involves code → ■ Bugs? ■ maintenance ○ Don't analyze data at Edge , cant go back in time. ● Why Not Streaming? ○ less accurate ○ Expensive
  • 33. S3 Considerations ● Security ○ at rest: server side S3-Managed Keys (SSE-S3) ○ at transit: SSL / VPN ○ Hardening: user, IP ACL, write permission only. ● Upload ○ AWS s3 cli ○ Multi part upload ○ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy ○ Consider S3 CLI Sync command instead of CP
  • 34. Sqoop - ETL ● Open source , part of EMR ● HDFS to RDMS and back. Via JDBC. ● E.g BiDirectional ETL from RDS to HDFS ● Unlikely use case: ETL from customer source operational DB.
  • 35. Flume & Kafka ● Opens source project for streaming & messaging ● Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc. ● Downside : DIY, Non trivial to get started
  • 36. Data Transfer Options ● Direct Connect (4GB/s?) ● For all other use case ○ S3 multipart upload ○ Compression ○ Security ■ Data at motion ■ Data at rest
  • 37. Quick intro to Stream ingestion ● Kinesis Client Library (code) ● AWS lambda (code) ● EMR (managed hadoop) ● Third party (DIY) ○ Spark streaming (latency min =1 sec) , near real time, with lot of libraries. ○ Storm - Most real time (sub millisec), java code based. ○ Flink (similar to spark)
  • 38. Kinesis family of products ● Kinesis Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo, ... ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ○ Not managed! ● Kinesis Analytics - in flight analytics. ● Kin. Firehose - Park you data @ destination.
  • 39. Kinesis Firehose - for Data parking ● Not for fast lane - no in flight analytics ● Ingest , transform and load to: ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service
  • 40. Comparison of Kinesis products ● Streams ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Buffering 60 sec minimum. ○ For larger “events”
  • 41. Data Transformation A layer in your big data architecture designed to : Transform and Cleanse data (row data to columnar data and convert data types, Fix bugs in data)
  • 42. Big Data Generic Architecture | Transformation Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 43. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin
  • 44. EMR Architecture ● Master node ● Core nodes - like data nodes (with storage: HDFS) ● Task nodes - (extends compute) ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 45. EMR lesson learned... ● Bigger instance type is good architecture ● Use spot instances - for the tasks only. ● Don't always use TEZ (MR? Spark?) ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Use Steps to automate steps on running cluster ● Use Glue to share Hive MetaStore ● Good Cost reduction article on EMR
  • 46. So use EMR for ... ● Most dominant ○ Hive ○ Spark ○ Presto ● And many more…. ● Good for: ○ Data transformation ○ Data modeling ○ Batch ○ Machine learning
  • 47. Hive ● SQL over hadoop. ● Engine: spark, tez, MR ● JDBC / ODBC ● Not good when need to shuffle. ● Not peta scale. ● SerDe json, parquet,regex,text etc. ● Dynamic partitions ● Insert overwrite ● Data Transformation ● Convert to Columnar
  • 48. Presto ● SQL over hadoop ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries ● No insert overwrite ● No dynamic partitions. ● Has some connectors : redshift and more ● https://amazon-aws-big-data- demystified.ninja/2018/07/02/aws-emr- presto-demystified-everything-you- wanted-to-know-about-presto/
  • 49. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 50. Hue ● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 51. Orchestration ● EMR Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Other options: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
  • 52. Big Data Generic Architecture | Transformation Data Ingestion S3 Data Transformation Data Modeling Data Visualization
  • 53. Data Modeling A layer in your big data architecture designed to Model data: Joins, Aggregations, nightly jobs, Machine learning
  • 54. Big Data Generic Architecture | Modeling Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 55. Spark ● In memory ● X10 to X100 times faster from hive ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 56. Spark Streaming ● Near real time (1 sec latency) ● like batch of 1sec windows ● Streaming jobs with API ● DIY = Not relevant to us...
  • 57. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 58. Spark flavours ● Standalone ● With yarn ● With mesos
  • 59. Spark Downside ● Compute intensive ● Performance gain over mapreduce is not guaranteed. ● Streaming processing is actually batch with very small window.
  • 60. Spark SQL ● Same syntax as hive ● Optional JDBC via thrift ● Non trivial learning curve ● Upto X10 faster than hive. ● Works well with Zeppelin (out of the box) ● Does not replaces Hive ● Spark not always faster than hive ● insert overwrite -
  • 61. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● works on top of Hive/ Spark SQL ● Inside EMR. ● Uses in the background: ○ Shiro ○ Livy
  • 62. R + spark R ● Open source package for statistical computing. ● Works with EMR ● “Matlab” equivalent ● Works with spark ● Not for developer :) for statistician ● R is single threaded - use spark R to distribute. ● Not everything works perfect.
  • 63. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries ○ which are long lasting. ● DO NOT USE FOR transformation. ● Good for : DW, Complex Joins.
  • 64. Redshift spectrum ● Extension of Redshift, use external table on S3. ● Require redshift cluster. ● Not possible for CTAS to s3, complex data structure, joins. ● Good for ○ Read only Queries ○ Aggregations on Exabyte.
  • 65. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. (aggregations)? ● Raw data? S3. ● When to use emr and when redshift?
  • 66. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? athena? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scale ? redshift ● Complex joins → Redshift
  • 67. Sage Maker ● Web notebook (jupiter based) for data science ● Connects to all your data sources (s3,athena etc) ● Help you manage the entire lifecycle machine learning ● Managed Service ● Used to create a ML to predict cookie gender
  • 68. AWS Glue Shared meta store Helps with some data transformation (managed service) Automatic Schema discovery
  • 69. AWD RDS (aurora, postgres, mysql) ● We used RDS aurora as Operational DB ● We did not use it for big data analytics although it supports upto 64Tb ● It is row based. ● The syntax is missing analytical functions
  • 70. Big Data Generic Architecture | Modeling Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 71. Data Presentation Used ONLY for presenting data for operational applications or BI, Use managed service to ensure HA.
  • 72. Big Data Generic Architecture | Presentation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 73. Athena ● Presto SQL ● In memory ● Hive metastore for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions ● Now supports CTAS (No inserts are supported) ● Good for: ○ Read only SQL, ○ Ad hoc query, ○ low cost, ○ Managed ● Good cost reduction article on athena
  • 74. Visualize ● QuickSight ● Managed Visualizer, simple, cheap
  • 75. Summery Take Away Message & Action Items
  • 76. Big Data Generic Architecture | Summary Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 77. Summary: Lesson learned ● Decouple, Decouple, Decouple ● Productivity of Data Science and Data engineering ○ Common language of both teams IS SQL! ○ Minimize the life cycle from dev to production of ETL and ML jobs ● Minimize the amount DB’s used ○ Different syntax (presto/hive/redshift) ○ Different data types ○ Minimize ETLS via External Tables+Glue! ● Not always Streaming is justified (what is the business use case? PaaS?) ● Spark SQL ○ Sometimes faster than redshift ○ Sometimes slower than hive ○ Learning curve is non trivial
  • 78. Summery: Common Q&A 1. Can this architecture be done on another cloud? 2. Redshift VS EMR ? 3. Athena VS Redshift? 4. Cost reduction on EMR? 5. Cost Reduction on Athena? 6. Exporting data from Google Analytics into AWS?
  • 79. Lesson learned: Big Data Architecture ? Faster! Cheaper! Simpler!
  • 80. How to get started | Call for Action Lectures: AWS Big Data Demystified lectures #1 until #4 AWS Big Data Demystified Meetup Big Data Demystified meetup
  • 81. Stay in touch... ● Omid Vahdaty ● +972-54-2384178 ● https://big-data-demystified.ninja/ ● Join our meetups subscribe to youtube channels ○ https://www.meetup.com/AWS-Big-Data-Demystified/ ○ https://www.meetup.com/Big-Data-Demystified/ ○ Big Data Demystified YouTube ○ AWS Big Data Demystified YouTube ○ WhatsApp group