Big problems Big data, simple AWS solution

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Claudio Pontili, XPeppers
31/05/2017
Big problems Big Data, simple
AWS solutions
How can you squeeze money from your
big data, a real case scenario with
Mondadori

Claudio Pontili…..some logo
https://www.linkedin.com/in/claudiopontili/
AWS Authorized Instructor and Solution Architect at XPeppers

Arnoldo Mondadori Editore is the biggest publishing
company in Italy.
• A group containing internet sites like www.eprice.com
www.giallozafferano.com www.mediasetpremium.it
• More than 200M of daily unique visitors (or cookies)
• Selling advertising on their public web sites

Squeezing money from Big Data
How can you improve the revenue from 200M of visitors?
• Identify gender and age for each visitor improving
convertion rate of advertising
• How? Running a Machine Learning algorithm on each
unique visitor (or cookie)Big Data
• What’s the current solution of Mondadori group?

Requirements
• Create a linearly scalable Big Data architecture on AWS
to analize Milions of cookie
• We can analize them every night and send the result to
the advertising platform
• Use AWS Managed services for lower Time-to-Market
and costs
• Last but not least, it must be cheaper than

Aggregate All Data in an S3 Data Lake
Surrounded by a collection of the right tools
Amazon EMR
Amazon Redshift Amazon DynamoDB Amazon RDS
AWS Data Pipeline
Amazon
S3
Amazon S3
Cassandra Storm
Spark
Streaming
Amazon Kinesis

Hadoop Cluster using Amazon EMR
• Managed cluster platform
• Run Hadoop, Spark, Presto, Pig, etc
• Launch a cluster in minutes
• Deploy multiple clusters
• Resize a running cluster
• Using spot instances to improve costs
• Cost-effectively process vast amounts of data
• Use S3 as Data Lake

Pro
• EMR permits to create huge
Hadoop cluster and running it
for a few hours
• Query S3 data using SQL
command with Pig
• Running huge and scalable
computation
First architecture using EMR, Pro and Cons
Cons
• What’s the right tool on
Hadoop Platform?
• Java, Python, or what?
• High learning curve

Towards a serverless architecture
Serverless computing allows you to build and run
applications and services without having to manage
infrastructure.
• Athena (released in preview on November 2016)
• Batch (release on November 2016)
• Lambda
• Kinesis Firehose

Amazon Athena
• Interactive query service
• Runs interactive SQL queries on Amazon S3
data
• No need to load or aggregate data:
“schema-on-read”
• Cross-region queries are supported
• Supports ANSI SQL operators and functions
• No infrastructure or administration to create,
manage, or scale (serverless)
• $5 per TB of data scanned (use compression)

From 200M cookies to 10M using Athena
• We don’t need to calculate the gender everynight on
200M cookie. We can run machine learning only on new
cookies.
• How can you compute “Delta cookies”  Sql LEFT JOIN
between “Today cookies” and “Yesterday cookies”

AWS Athena JDBC connection
http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html

Use AWS Lambda to run 10M of computation
Fully managed compute service that runs stateless code
(Node.js, Java, and Python, C#) in response to an event or
on a time-based interval.
Allows you to run code without managing infrastructure like
Amazon EC2 instances and Auto Scaling groups.
3.2 M of seconds Free forever than 0.000000208 per 100 ms

Pro
• Scalable infrastructure
• We can easily move Python
code
• Cheap
First architecture using EMR, Pro and Cons
Cons
• Maximum size of code
uncompressed 250 MB
• Stateless and indipendend
code  We need a tool to
collect results

Collecting Lambdas result with Kinesis
• Ingest streaming data
• Process data in real time
• Store terabytes of data per hour
Kinesis Analytics
For all developers, data scientists
Easily analyze data
streams using standard
SQL queries
Kinesis Firehose
For all developers, data scientists
Easily load massive
volumes of data
Kinesis Streams
For technical developers
Collect and stream
data for ordered,
real-time processing

Collecting Lambda result with Kinesis Firehose
Capture and submit
streaming data to Firehose
Firehose transforms and loads streaming data
continuously into S3, Redshift, and Amazon
Elasticsearch domains
Analyze streaming data using your favorite BI tools

Mondadori Big Data Serverless architecture

Daily cost of the architecture
• S3 Compressed Storage 5,2 GB 0.12 USD
• Athena 10,4 GB Compressed data scanned 0.05 USD
• 10 M of Lambdas 500 ms for each cookie 41.7 USD
• Kinesis Firehose 50 GB uncompressed 1.55 USD
• Total price 55 USD per day 1 hour of computation
• What happens if we double the number of visitors from
200M to 400M?
• Linearly scale double cost same computation time

AWS Batch….future improvement
• Fully managed batch primitives
• Focus on your application (shell scripting,
Linux executables, Docker images) and their
resourse requirements
• No code limits if we move to Docker
• Support for spot instances
• No charge, you only pay for the underlying
resource that you consume
• Integrated Monitoring and Logging
• Job retries
• Support of Lambda functions soon

AWS Machine learning….future improvement
• AWS cloud-base service for predictive
analytics
• Use tools and wizards to create machine
learning models
• Use simple APIs to obtain predictions for your
application
• No need to write custom code or have
supporting infrastructure
• Use models to process new data and
generate predictions

Questions!?!?!?!?
https://www.linkedin.com/in/claudiopontili/

Big problems Big data, simple AWS solution

Big problems Big data, simple AWS solution

More Related Content

What's hot (9)

Similar to Big problems Big data, simple AWS solution (20)

Recently uploaded (20)

Big problems Big data, simple AWS solution