SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Claudio Pontili, XPeppers
31/05/2017
Big problems Big Data, simple
AWS solutions
How can you squeeze money from your
big data, a real case scenario with
Mondadori
Claudio Pontili…..some logo
https://www.linkedin.com/in/claudiopontili/
AWS Authorized Instructor and Solution Architect at XPeppers
XPeppers
Office in Lugano
Arnoldo Mondadori Editore is the biggest publishing
company in Italy.
• A group containing internet sites like www.eprice.com
www.giallozafferano.com www.mediasetpremium.it
• More than 200M of daily unique visitors (or cookies)
• Selling advertising on their public web sites
Squeezing money from Big Data
How can you improve the revenue from 200M of visitors?
• Identify gender and age for each visitor improving
convertion rate of advertising
• How? Running a Machine Learning algorithm on each
unique visitor (or cookie)Big Data
• What’s the current solution of Mondadori group?
Requirements
• Create a linearly scalable Big Data architecture on AWS
to analize Milions of cookie
• We can analize them every night and send the result to
the advertising platform
• Use AWS Managed services for lower Time-to-Market
and costs
• Last but not least, it must be cheaper than
Aggregate All Data in an S3 Data Lake
Surrounded by a collection of the right tools
Amazon EMR
Amazon Redshift Amazon DynamoDB Amazon RDS
AWS Data Pipeline
Amazon
S3
Amazon S3
Cassandra Storm
Spark
Streaming
Amazon Kinesis
Hadoop Cluster using Amazon EMR
• Managed cluster platform
• Run Hadoop, Spark, Presto, Pig, etc
• Launch a cluster in minutes
• Deploy multiple clusters
• Resize a running cluster
• Using spot instances to improve costs
• Cost-effectively process vast amounts of data
• Use S3 as Data Lake
Pro
• EMR permits to create huge
Hadoop cluster and running it
for a few hours
• Query S3 data using SQL
command with Pig
• Running huge and scalable
computation
First architecture using EMR, Pro and Cons
Cons
• What’s the right tool on
Hadoop Platform?
• Java, Python, or what?
• High learning curve
Towards a serverless architecture
Serverless computing allows you to build and run
applications and services without having to manage
infrastructure.
• Athena (released in preview on November 2016)
• Batch (release on November 2016)
• Lambda
• Kinesis Firehose
Amazon Athena
• Interactive query service
• Runs interactive SQL queries on Amazon S3
data
• No need to load or aggregate data:
“schema-on-read”
• Cross-region queries are supported
• Supports ANSI SQL operators and functions
• No infrastructure or administration to create,
manage, or scale (serverless)
• $5 per TB of data scanned (use compression)
From 200M cookies to 10M using Athena
• We don’t need to calculate the gender everynight on
200M cookie. We can run machine learning only on new
cookies.
• How can you compute “Delta cookies”  Sql LEFT JOIN
between “Today cookies” and “Yesterday cookies”
Athena console interface Hue
AWS Athena JDBC connection
http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
AWS Athena API CLI and SDKs
Use AWS Lambda to run 10M of computation
Fully managed compute service that runs stateless code
(Node.js, Java, and Python, C#) in response to an event or
on a time-based interval.
Allows you to run code without managing infrastructure like
Amazon EC2 instances and Auto Scaling groups.
3.2 M of seconds Free forever than 0.000000208 per 100 ms
Pro
• Scalable infrastructure
• We can easily move Python
code
• Cheap
First architecture using EMR, Pro and Cons
Cons
• Maximum size of code
uncompressed 250 MB
• Stateless and indipendend
code  We need a tool to
collect results
Collecting Lambdas result with Kinesis
• Ingest streaming data
• Process data in real time
• Store terabytes of data per hour
Kinesis Analytics
For all developers, data scientists
Easily analyze data
streams using standard
SQL queries
Kinesis Firehose
For all developers, data scientists
Easily load massive
volumes of data
Kinesis Streams
For technical developers
Collect and stream
data for ordered,
real-time processing
Collecting Lambda result with Kinesis Firehose
Capture and submit
streaming data to Firehose
Firehose transforms and loads streaming data
continuously into S3, Redshift, and Amazon
Elasticsearch domains
Analyze streaming data using your favorite BI tools
Mondadori Big Data Serverless architecture
Daily cost of the architecture
• S3 Compressed Storage 5,2 GB 0.12 USD
• Athena 10,4 GB Compressed data scanned 0.05 USD
• 10 M of Lambdas 500 ms for each cookie 41.7 USD
• Kinesis Firehose 50 GB uncompressed 1.55 USD
• Total price 55 USD per day 1 hour of computation
• What happens if we double the number of visitors from
200M to 400M?
• Linearly scale double cost same computation time
AWS Batch….future improvement
• Fully managed batch primitives
• Focus on your application (shell scripting,
Linux executables, Docker images) and their
resourse requirements
• No code limits if we move to Docker
• Support for spot instances
• No charge, you only pay for the underlying
resource that you consume
• Integrated Monitoring and Logging
• Job retries
• Support of Lambda functions soon
AWS Machine learning….future improvement
• AWS cloud-base service for predictive
analytics
• Use tools and wizards to create machine
learning models
• Use simple APIs to obtain predictions for your
application
• No need to write custom code or have
supporting infrastructure
• Use models to process new data and
generate predictions
Questions!?!?!?!?
https://www.linkedin.com/in/claudiopontili/
Big problems Big data, simple AWS solution

More Related Content

What's hot (9)

PDF
Introduction to aws data pipeline services
ArcBlock
 
PPTX
Understanding event data
yalisassoon
 
PDF
[2C6]Everyplay_Big_Data
NAVER D2
 
PDF
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...
Fastly
 
PPTX
Finding new Customers using D&B and Excel Power Query
Lynn Langit
 
PDF
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
MongoDB
 
PDF
Gaming in the Cloud at Playhubs Oct 2015
Ian Massingham
 
PDF
Amazon Redshift (February 2016)
Julien SIMON
 
Introduction to aws data pipeline services
ArcBlock
 
Understanding event data
yalisassoon
 
[2C6]Everyplay_Big_Data
NAVER D2
 
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...
Fastly
 
Finding new Customers using D&B and Excel Power Query
Lynn Langit
 
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
MongoDB
 
Gaming in the Cloud at Playhubs Oct 2015
Ian Massingham
 
Amazon Redshift (February 2016)
Julien SIMON
 

Similar to Big problems Big data, simple AWS solution (20)

PPTX
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
PPTX
Building Data Pipelines on AWS
rudolf eremyan
 
PDF
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
BEEVA_es
 
PDF
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Luis Gonzalez
 
PDF
¿Quién es Amazon Web Services?
Software Guru
 
PDF
Architecting Data in the AWS Ecosystem
SingleStore
 
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PDF
Architecting Data Lakes on AWS
Sajith Appukuttan
 
PPTX
TECHTalks - Philadelphia PA - Brien Blandford
EagleDream Technologies
 
PPTX
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
PDF
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
PDF
20141021 AWS Cloud Taekwon - Big Data on AWS
Amazon Web Services Korea
 
PPTX
From raw data to business insights. A modern data lake
javier ramirez
 
PDF
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS Riyadh User Group
 
PPTX
Amazon Web Services
Jisc
 
PDF
Serverless Big Data Architectures: Serverless Data Analytics
Kristana Kane
 
PDF
Big data and Analytics on AWS
2nd Watch
 
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
PDF
Lean Enterprise, Microservices and Big Data
Stylight
 
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Building Data Pipelines on AWS
rudolf eremyan
 
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
BEEVA_es
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Luis Gonzalez
 
¿Quién es Amazon Web Services?
Software Guru
 
Architecting Data in the AWS Ecosystem
SingleStore
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Architecting Data Lakes on AWS
Sajith Appukuttan
 
TECHTalks - Philadelphia PA - Brien Blandford
EagleDream Technologies
 
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
20141021 AWS Cloud Taekwon - Big Data on AWS
Amazon Web Services Korea
 
From raw data to business insights. A modern data lake
javier ramirez
 
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS Riyadh User Group
 
Amazon Web Services
Jisc
 
Serverless Big Data Architectures: Serverless Data Analytics
Kristana Kane
 
Big data and Analytics on AWS
2nd Watch
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Lean Enterprise, Microservices and Big Data
Stylight
 
Ad

Recently uploaded (20)

PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
unit 2_2 copy right fdrgfdgfai and sm.pptx
nepmithibai2024
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PDF
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PDF
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PPTX
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PPTX
INTEGRATION OF ICT IN LEARNING AND INCORPORATIING TECHNOLOGY
kvshardwork1235
 
PPT
Computer Securityyyyyyyy - Chapter 1.ppt
SolomonSB
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
PPTX
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
PPT
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
PPTX
internet básico presentacion es una red global
70965857
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
PPTX
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
unit 2_2 copy right fdrgfdgfai and sm.pptx
nepmithibai2024
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
Orchestrating things in Angular application
Peter Abraham
 
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
INTEGRATION OF ICT IN LEARNING AND INCORPORATIING TECHNOLOGY
kvshardwork1235
 
Computer Securityyyyyyyy - Chapter 1.ppt
SolomonSB
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
internet básico presentacion es una red global
70965857
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
Ad

Big problems Big data, simple AWS solution

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Claudio Pontili, XPeppers 31/05/2017 Big problems Big Data, simple AWS solutions How can you squeeze money from your big data, a real case scenario with Mondadori
  • 4. Arnoldo Mondadori Editore is the biggest publishing company in Italy. • A group containing internet sites like www.eprice.com www.giallozafferano.com www.mediasetpremium.it • More than 200M of daily unique visitors (or cookies) • Selling advertising on their public web sites
  • 5. Squeezing money from Big Data How can you improve the revenue from 200M of visitors? • Identify gender and age for each visitor improving convertion rate of advertising • How? Running a Machine Learning algorithm on each unique visitor (or cookie)Big Data • What’s the current solution of Mondadori group?
  • 6. Requirements • Create a linearly scalable Big Data architecture on AWS to analize Milions of cookie • We can analize them every night and send the result to the advertising platform • Use AWS Managed services for lower Time-to-Market and costs • Last but not least, it must be cheaper than
  • 7. Aggregate All Data in an S3 Data Lake Surrounded by a collection of the right tools Amazon EMR Amazon Redshift Amazon DynamoDB Amazon RDS AWS Data Pipeline Amazon S3 Amazon S3 Cassandra Storm Spark Streaming Amazon Kinesis
  • 8. Hadoop Cluster using Amazon EMR • Managed cluster platform • Run Hadoop, Spark, Presto, Pig, etc • Launch a cluster in minutes • Deploy multiple clusters • Resize a running cluster • Using spot instances to improve costs • Cost-effectively process vast amounts of data • Use S3 as Data Lake
  • 9. Pro • EMR permits to create huge Hadoop cluster and running it for a few hours • Query S3 data using SQL command with Pig • Running huge and scalable computation First architecture using EMR, Pro and Cons Cons • What’s the right tool on Hadoop Platform? • Java, Python, or what? • High learning curve
  • 10. Towards a serverless architecture Serverless computing allows you to build and run applications and services without having to manage infrastructure. • Athena (released in preview on November 2016) • Batch (release on November 2016) • Lambda • Kinesis Firehose
  • 11. Amazon Athena • Interactive query service • Runs interactive SQL queries on Amazon S3 data • No need to load or aggregate data: “schema-on-read” • Cross-region queries are supported • Supports ANSI SQL operators and functions • No infrastructure or administration to create, manage, or scale (serverless) • $5 per TB of data scanned (use compression)
  • 12. From 200M cookies to 10M using Athena • We don’t need to calculate the gender everynight on 200M cookie. We can run machine learning only on new cookies. • How can you compute “Delta cookies”  Sql LEFT JOIN between “Today cookies” and “Yesterday cookies”
  • 14. AWS Athena JDBC connection http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
  • 15. AWS Athena API CLI and SDKs
  • 16. Use AWS Lambda to run 10M of computation Fully managed compute service that runs stateless code (Node.js, Java, and Python, C#) in response to an event or on a time-based interval. Allows you to run code without managing infrastructure like Amazon EC2 instances and Auto Scaling groups. 3.2 M of seconds Free forever than 0.000000208 per 100 ms
  • 17. Pro • Scalable infrastructure • We can easily move Python code • Cheap First architecture using EMR, Pro and Cons Cons • Maximum size of code uncompressed 250 MB • Stateless and indipendend code  We need a tool to collect results
  • 18. Collecting Lambdas result with Kinesis • Ingest streaming data • Process data in real time • Store terabytes of data per hour Kinesis Analytics For all developers, data scientists Easily analyze data streams using standard SQL queries Kinesis Firehose For all developers, data scientists Easily load massive volumes of data Kinesis Streams For technical developers Collect and stream data for ordered, real-time processing
  • 19. Collecting Lambda result with Kinesis Firehose Capture and submit streaming data to Firehose Firehose transforms and loads streaming data continuously into S3, Redshift, and Amazon Elasticsearch domains Analyze streaming data using your favorite BI tools
  • 20. Mondadori Big Data Serverless architecture
  • 21. Daily cost of the architecture • S3 Compressed Storage 5,2 GB 0.12 USD • Athena 10,4 GB Compressed data scanned 0.05 USD • 10 M of Lambdas 500 ms for each cookie 41.7 USD • Kinesis Firehose 50 GB uncompressed 1.55 USD • Total price 55 USD per day 1 hour of computation • What happens if we double the number of visitors from 200M to 400M? • Linearly scale double cost same computation time
  • 22. AWS Batch….future improvement • Fully managed batch primitives • Focus on your application (shell scripting, Linux executables, Docker images) and their resourse requirements • No code limits if we move to Docker • Support for spot instances • No charge, you only pay for the underlying resource that you consume • Integrated Monitoring and Logging • Job retries • Support of Lambda functions soon
  • 23. AWS Machine learning….future improvement • AWS cloud-base service for predictive analytics • Use tools and wizards to create machine learning models • Use simple APIs to obtain predictions for your application • No need to write custom code or have supporting infrastructure • Use models to process new data and generate predictions