SlideShare a Scribd company logo
P U B L I C S E C T O R
S U M M I T
B OGOTA
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Building Data Lakes &Analytics onAWS
Mauro Assis
Researcher
Earth System Science Center
INPE
Angelo Carvalho
Specialist Solutions Architect - Analytics
Public Sector for Latin America, Canada and Caribbean
AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Organizations that successfully generate
business value from their data will outperform
their peers. An Aberdeen survey showed that
organizations that implemented a data lake
outperform similar companies by 9% in
organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: DrivingValuefrom Data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Traditionally,AnalyticsUsed to Look LikeThis
OLTP ERP CRM LOB
Data warehouse
Business intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/year
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lakes Extend theTraditionalApproach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and non-relational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lakes fromAWS
Analytics
• Unmatched durability and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grained access
• Fastest performance by retrieving subsets of data
• The greatest variety of ways to bring data in
• 2x as many integrations with partners
• Analyze with broadest set of analytics & machine
learning (ML) services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes,Analytics,and IoTPortfolio fromAWS
Broadest,deepestsetofanalyticservices
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lakes,Analytics,and IoTPortfolio fromAWS
Broadest,deepestsetofanalyticservices
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lake on Amazon S3 with AWS Glue
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Other Ways of Populating the Catalog
Call the AWS Glue CreateTable API
Create table manually
Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
How Do IDriveValue?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data in Amazon
S3 using standard SQL.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
FamiliarTechnologiesUnder theCovers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
(Eg. SELECT * FROM tableName)
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
(Eg. CREATE TABLE, ALTER TABLE,
MSCK REPAIR)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Exploring data with Amazon Athena
On-premises Data
Web app data
Amazon RDS
Other Databases
Streaming data
AMAZON
QUICKSIGHT
AMAZON
SAGEMAKER
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Hadoop/SparkAnalytics
• Distributed processing
• Diverse analytics
• Batch/Script (Hive/Pig)
• Interactive (Spark, Presto)
• Real-time (Spark)
• Machine Learning (Spark)
• NoSQL (HBase)
• For many use cases
• Log and clickstream analysis
• Machine learning
• Real-time analytics
• Large-scale analytics
• Genomics
• ETL
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Hadoop/SparkAnalyticsonAWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop/Spark
Object Storage
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
EMR – Enterprise-grade Hadoop &Spark
DeploylatestreleasesinHadoopandSparkecosystems
• Nineteen open-source
projects: Apache Hadoop,
Spark, HBase, Presto, and
more
• Updated with the latest
open source frameworks
within 30 days of release
Hadoop
Ganglia
HBase
Hive&
Catalog
Hue
Mahout
Oozie
Phoenix
Pig
Presto
Spark
Tez
Zeppelin
Zookeeper
Flink
Livy
MXNet
Sqoop
Emr-4.0.0
July2015
2.6.0 1.0.0 0.10.0 0.14.0 1.4.1
Emr-4.7.0
June2016
2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8
Emr-5.3.0
January2017
2.7.3 3.7.2
1.2.3
+
S3
2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4
Emr-5.11.0
December2017
2.7.3 3.7.2
1.3.1
+
S3
2.3.2 4.0.1 0.13.0 4.3.0 4.11.0 0.17.0 .187 2.2.1 1.4.6 0.8.4 0.7.3 3.4.10 1.3.2 0.4.0 0.12.0
EMR releases
AmazonS3 –Source ofTruth,MultipleClusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDFS
EC2 Instance Memory
Intermediates stored on
local disk or HDFSLocal
HDFS
EC2 Instance Memory
Intermediates stored on
local disk or HDFSLocal
Transient ETL Job
Source of Truth
HDFS
HDFS
HDFS
Local Intermediate HDFS/Storage
Local Intermediate HDFS/Storage
External Metadata Management
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
Transient ETL Job
Source of Truth
HDFS
Describes Data in S3
MySQL DB
instance
Customershaveoptions
Glue Data
Catalog
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Reprocess data with Amazon EMR (Spark)
On-premise data
Web app data
Amazon RDS
Other Databases
Streaming data
AMAZON
QUICKSIGHT
AMAZON
SAGEMAKER
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
MachineLearning onYour DataLake
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU IoT (Greengrass)
Vision:
Amazon Rekognition Image
Amazon Rekognition Video
Speech:
Amazon Polly
Amazon Transcribe
Language:
Amazon Lex
Amazon Translate
Amazon Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
ML in the Hands of Every Developer
Amazon SageMaker
1 2 3 4
I I I I
Notebook Instances Algorithms ML Training Service ML Hosting Service
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Machine Learning with Amazon
SageMaker
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
AMAZON
QUICKSIGHT
AMAZON
SAGEMAKER
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Mapping Amazon Biomass using
Amazon Analytics
www.ccst.inpe.br
Mauro Assis
assismauro@hotmail.com
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
EARTH SYSTEM SCIENCE
CENTERSTRATEGIC GOALS
Development and improvement of earth system models,
monitoring networks and socio-political analyzes, aiming at the
construction and analysis of scenarios of environmental
changes and climate projections.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The question:
How much does the Amazon forest weigh?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The previous question:
Why map Amazon forest biomass?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Biomass map process
• 68 million pixels (250 x 250m)
• 4 million km² area
• ~1000 LiDAR flights data
• Each flight: 6.5 billion of data recs
• 10 bands of satellite data for each pixel
• 4 to 6 h/map generation
• 16 CPU/32 gbyte RAM/21 Tb HD
• Random Forest algorithm
• Python H2O
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
LiDAR
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
LiDAR
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Uncertainty map
• Propagate error from field to random forest extrapolation
• 1000 biomass values normally distributed for each pixel
• A thousand maps to generate…
• … How???
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The answer: Analytics processing on AWS
• AWS engages a partner (DataRain)
• Two PoCs
• Four EC2 instances Linux 64 cores/256 Gbytes each
• Anaconda/H2O Python environment
• Script with lots of parallel processing
• Divided Amazon area into16 segments
• Two operators to run everything in 40 hours
• We downloaded the 1000 maps and sumarize at INPE
• It tooks about 2 days to generate the final map
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Architecture
AWS CloudINPE
Researcher
Desktops Amazon
S3
Amazon EC2 Amazon EBS
Internet
Amazon EC2 Amazon EBS
Amazon EC2 Amazon EBS
Amazon EC2 Amazon EBS
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Main benefits
• Uncertainty map itself
• DataRain (AWS partner) support
• First time we use cloud services at INPE
• Map obtained before the end of the project
• ROI
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Return of the investment
• LiDAR Fligts costs $2,000 each
• 1000 flights => $2M
• To update the model: 100~150 flights
• 150 flights => $300k
• Cost of map generation: $10,000
• Money saved in the next map update:
$2M – $300k – $10k = $1.69 M
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Agilityand InnovationAreKey
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Mauro Assis
assismauro@hotmail.com
Angelo Carvalho
carvaa@amazon.com
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T

More Related Content

PPTX
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
PPTX
Data Lake na área da saúde- AWS
Amazon Web Services LATAM
 
PPTX
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
javier ramirez
 
PPTX
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
Steven Hsieh
 
PDF
Value of Data Beyond Analytics by Darin Briskman
Sameer Kenkare
 
PDF
Building a modern data platform on AWS. Utrecht AWS Dev Day
javier ramirez
 
PPTX
Construindo data lakes e analytics com AWS
Amazon Web Services LATAM
 
PDF
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Alluxio, Inc.
 
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
Data Lake na área da saúde- AWS
Amazon Web Services LATAM
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
javier ramirez
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
Steven Hsieh
 
Value of Data Beyond Analytics by Darin Briskman
Sameer Kenkare
 
Building a modern data platform on AWS. Utrecht AWS Dev Day
javier ramirez
 
Construindo data lakes e analytics com AWS
Amazon Web Services LATAM
 
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Alluxio, Inc.
 

More from AWS Summits (20)

PDF
AWS Summit Singapore 2019 | The Smart Way to Build an AI & ML Strategy for Yo...
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Bridging Start-ups and Enterprises
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Hiring a Global Rock Star Team: Tips and Tricks
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Five Common Technical Challenges for Startups
AWS Summits
 
PDF
AWS Summit Singapore 2019 | A Founder's Journey to Exit
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Realising Business Value with AWS Analytics Services
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Amazon Digital User Engagement Solutions
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Microsoft DevOps on AWS
AWS Summits
 
PDF
AWS Summit Singapore 2019 | The Serverless Lifecycle: Development and Operati...
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Accelerating Enterprise Cloud Transformation by M...
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Operating Microservices at Hyperscale
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Realising Business Value
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Latest Trends for Cloud-Native Application Develo...
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Pragmatic Container Security
AWS Summits
 
PDF
AWS Summit Singapore 2019 | Enterprise Migration Journey Roadmap
AWS Summits
 
AWS Summit Singapore 2019 | The Smart Way to Build an AI & ML Strategy for Yo...
AWS Summits
 
AWS Summit Singapore 2019 | Bridging Start-ups and Enterprises
AWS Summits
 
AWS Summit Singapore 2019 | Hiring a Global Rock Star Team: Tips and Tricks
AWS Summits
 
AWS Summit Singapore 2019 | Five Common Technical Challenges for Startups
AWS Summits
 
AWS Summit Singapore 2019 | A Founder's Journey to Exit
AWS Summits
 
AWS Summit Singapore 2019 | Realising Business Value with AWS Analytics Services
AWS Summits
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summits
 
AWS Summit Singapore 2019 | Amazon Digital User Engagement Solutions
AWS Summits
 
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summits
 
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summits
 
AWS Summit Singapore 2019 | Microsoft DevOps on AWS
AWS Summits
 
AWS Summit Singapore 2019 | The Serverless Lifecycle: Development and Operati...
AWS Summits
 
AWS Summit Singapore 2019 | Accelerating Enterprise Cloud Transformation by M...
AWS Summits
 
AWS Summit Singapore 2019 | Operating Microservices at Hyperscale
AWS Summits
 
AWS Summit Singapore 2019 | Autoscaling Your Kubernetes Workloads
AWS Summits
 
AWS Summit Singapore 2019 | Realising Business Value
AWS Summits
 
AWS Summit Singapore 2019 | Latest Trends for Cloud-Native Application Develo...
AWS Summits
 
AWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summits
 
AWS Summit Singapore 2019 | Pragmatic Container Security
AWS Summits
 
AWS Summit Singapore 2019 | Enterprise Migration Journey Roadmap
AWS Summits
 
Ad

Building Data Lakes & Analytics on AWS

  • 1. P U B L I C S E C T O R S U M M I T B OGOTA
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Building Data Lakes &Analytics onAWS Mauro Assis Researcher Earth System Science Center INPE Angelo Carvalho Specialist Solutions Architect - Analytics Public Sector for Latin America, Canada and Caribbean AWS
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Organizations that successfully generate business value from their data will outperform their peers. An Aberdeen survey showed that organizations that implemented a data lake outperform similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence Most Important: DrivingValuefrom Data
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Traditionally,AnalyticsUsed to Look LikeThis OLTP ERP CRM LOB Data warehouse Business intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/year
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Data Lakes Extend theTraditionalApproach Data warehouse Business intelligence OLTP ERP CRM LOB • Relational and non-relational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Data lake Big data processing, real-time, machine learning
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Data Lakes fromAWS Analytics • Unmatched durability and availability at EB scale • Best security, compliance, and audit capabilities • Object-level controls for fine-grained access • Fastest performance by retrieving subsets of data • The greatest variety of ways to bring data in • 2x as many integrations with partners • Analyze with broadest set of analytics & machine learning (ML) services Machine learning Real-time dataOn-premises Data Lake on AWS movementdata movement
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Managed ML Service Deep Learning AMIs Video and Image Recognition Conversational Interfaces Deep-Learning Video Camera Natural Language Processing Language Translation Speech Recognition Text-to-Speech Interactive Analysis Hadoop & Spark Data Warehousing Full-text search Real-time analytics Dashboards & Visualizations Dedicated Network connection Secure appliances Ruggedized Shipping Container Database migration Connect Devices to AWS Real-time Data Streams Real-time Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement Data Lakes,Analytics,and IoTPortfolio fromAWS Broadest,deepestsetofanalyticservices
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Data Lakes,Analytics,and IoTPortfolio fromAWS Broadest,deepestsetofanalyticservices Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Job AuthoringData Catalog Job Execution Apache Hive Metastore compatible Integrated with AWS services Automatic crawling Discover Auto-generates ETL code Python and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy AWS Glue
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Data Lake on Amazon S3 with AWS Glue On-premises data Web app data Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Other Ways of Populating the Catalog Call the AWS Glue CreateTable API Create table manually Run Hive DDL statement Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T How Do IDriveValue? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Amazon Athena Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T FamiliarTechnologiesUnder theCovers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions (Eg. SELECT * FROM tableName) Used for DDL functionality Complex data types Multitude of formats Supports data partitioning (Eg. CREATE TABLE, ALTER TABLE, MSCK REPAIR)
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Exploring data with Amazon Athena On-premises Data Web app data Amazon RDS Other Databases Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Hadoop/SparkAnalytics • Distributed processing • Diverse analytics • Batch/Script (Hive/Pig) • Interactive (Spark, Presto) • Real-time (Spark) • Machine Learning (Spark) • NoSQL (HBase) • For many use cases • Log and clickstream analysis • Machine learning • Real-time analytics • Large-scale analytics • Genomics • ETL YARN (Hadoop Resource Manager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data Lake on AWS
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Hadoop/SparkAnalyticsonAWS YARN (Hadoop Resource Manager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data Lake on AWS Amazon S3 Amazon EMR Managed Hadoop/Spark Object Storage
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T EMR – Enterprise-grade Hadoop &Spark DeploylatestreleasesinHadoopandSparkecosystems • Nineteen open-source projects: Apache Hadoop, Spark, HBase, Presto, and more • Updated with the latest open source frameworks within 30 days of release Hadoop Ganglia HBase Hive& Catalog Hue Mahout Oozie Phoenix Pig Presto Spark Tez Zeppelin Zookeeper Flink Livy MXNet Sqoop Emr-4.0.0 July2015 2.6.0 1.0.0 0.10.0 0.14.0 1.4.1 Emr-4.7.0 June2016 2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8 Emr-5.3.0 January2017 2.7.3 3.7.2 1.2.3 + S3 2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4 Emr-5.11.0 December2017 2.7.3 3.7.2 1.3.1 + S3 2.3.2 4.0.1 0.13.0 4.3.0 4.11.0 0.17.0 .187 2.2.1 1.4.6 0.8.4 0.7.3 3.4.10 1.3.2 0.4.0 0.12.0 EMR releases
  • 19. AmazonS3 –Source ofTruth,MultipleClusters Amazon S3 Interactive Spark Cluster Amazon EMR Amazon EMR HDFS HDFS EC2 Instance Memory Intermediates stored on local disk or HDFSLocal HDFS EC2 Instance Memory Intermediates stored on local disk or HDFSLocal Transient ETL Job Source of Truth HDFS HDFS HDFS Local Intermediate HDFS/Storage Local Intermediate HDFS/Storage
  • 20. External Metadata Management Amazon S3 Interactive Spark Cluster Amazon EMR Amazon EMR HDFS Transient ETL Job Source of Truth HDFS Describes Data in S3 MySQL DB instance Customershaveoptions Glue Data Catalog
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Reprocess data with Amazon EMR (Spark) On-premise data Web app data Amazon RDS Other Databases Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T MachineLearning onYour DataLake Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 23. Frameworks & Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Greengrass) Vision: Amazon Rekognition Image Amazon Rekognition Video Speech: Amazon Polly Amazon Transcribe Language: Amazon Lex Amazon Translate Amazon Comprehend Apache MXNet PyTorch Cognitive Toolkit Keras Caffe2 & Caffe TensorFlow Gluon Application Services Platform Services Amazon Machine Learning Mechanical Turk Spark & EMR Amazon SageMaker AWS DeepLens ML in the Hands of Every Developer
  • 24. Amazon SageMaker 1 2 3 4 I I I I Notebook Instances Algorithms ML Training Service ML Hosting Service
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Machine Learning with Amazon SageMaker On-premises data Web app data Amazon RDS Other databases Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Mapping Amazon Biomass using Amazon Analytics www.ccst.inpe.br Mauro Assis [email protected]
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T EARTH SYSTEM SCIENCE CENTERSTRATEGIC GOALS Development and improvement of earth system models, monitoring networks and socio-political analyzes, aiming at the construction and analysis of scenarios of environmental changes and climate projections.
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T The question: How much does the Amazon forest weigh?
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T The previous question: Why map Amazon forest biomass?
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Biomass map process • 68 million pixels (250 x 250m) • 4 million km² area • ~1000 LiDAR flights data • Each flight: 6.5 billion of data recs • 10 bands of satellite data for each pixel • 4 to 6 h/map generation • 16 CPU/32 gbyte RAM/21 Tb HD • Random Forest algorithm • Python H2O
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T LiDAR
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T LiDAR
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Uncertainty map • Propagate error from field to random forest extrapolation • 1000 biomass values normally distributed for each pixel • A thousand maps to generate… • … How???
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T The answer: Analytics processing on AWS • AWS engages a partner (DataRain) • Two PoCs • Four EC2 instances Linux 64 cores/256 Gbytes each • Anaconda/H2O Python environment • Script with lots of parallel processing • Divided Amazon area into16 segments • Two operators to run everything in 40 hours • We downloaded the 1000 maps and sumarize at INPE • It tooks about 2 days to generate the final map
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Architecture AWS CloudINPE Researcher Desktops Amazon S3 Amazon EC2 Amazon EBS Internet Amazon EC2 Amazon EBS Amazon EC2 Amazon EBS Amazon EC2 Amazon EBS
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Main benefits • Uncertainty map itself • DataRain (AWS partner) support • First time we use cloud services at INPE • Map obtained before the end of the project • ROI
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Return of the investment • LiDAR Fligts costs $2,000 each • 1000 flights => $2M • To update the model: 100~150 flights • 150 flights => $300k • Cost of map generation: $10,000 • Money saved in the next map update: $2M – $300k – $10k = $1.69 M
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Agilityand InnovationAreKey Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Mauro Assis [email protected] Angelo Carvalho [email protected]
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T