SlideShare a Scribd company logo
DAL
AUG 9, 2017
#datapopup
Building Serverless Data
Pipelines in the Cloud
Manisha Sule
Director of Big Data Analytics, Linux Academy.
Board Member on SMU’s Big Data Advisory Board.
linkedin.com/in/manisha-sule
@tweetDataS
Agenda
1. What is serverless?
2. Big Data architectures and best practices
3. AWS Server less services:
 Lambda
 Kinesis (Streams, Firehose, Analytics)
 DynamoDB
 S3
 Athena
4. Analytics for CoudAssessments.com
What is Server less?
Source: https://www.slideshare.net/CodeOps/serverless-architecture-a-gentle-overview?qid=aecf8d27-8b16-4da5-987f-600fe1cb0655&v=&b=&from_search=5
Server less architectures
 Depend on 3rd party services, known as Backend As a Service (BaaS).
 Distributed system that reacts to events and triggers.
 Dynamically scales, based on demand
 Utilized ephemeral (short-lived) containers or computational resources in the cloud.
Advantages of Server less
 Fully managed, cloud manages servers.
 Highly Available, scalable, no provisioning needed and zero administration.
 Not just compute containers, but also includes NoSQL databases, interactive query services,
storage services, messaging services.
 Cost efficient, never have to pay for idle time.
 Support for continuous integration/ continuous delivery pipelines.
 Developers can focus on architecture and code only.
 Gartner terms as fPaaS, lists several use cases. Utility logic, scheduled processing, event-
driven architecture, micro services, full blown applications
AWS Serverless Application Model
Template based mechanism of defining and deploying serverless applications.
Source : AWS Tech Talk Webinar
Big Data Lambda architecture
Requirements of Big Data architectures:
1. Processing real time streams.
2. Processing batch data.
3. Real time ETL.
4. Enrich real time data with batch data.
5. Queries must be answerable using
batch data and real time data.
Big Data best practices
1. Build decoupled architecture, decouple data->store->process->store steps.
2. Use right tools: Latency, throughput, access patterns, data structures.
3. Cost effective: Big data, not big cost.
AWS Managed vs Serverless services
Need to manage servers, their scale, their location,
software updates etc.
 Elastic Map Reduce: Managed Hadoop
framework, includes Apache Spark,
Zeppelin, Hbase, Flink etc.
 ElasticSearch: For log analytics, full text
search, application monitoring, and more.
Fully integrated with Kibana and LogStash.
 RedShift: Fully managed data warehouse,
to analyze data and integrate with BI tools.
 RDS: Database service to setup, operate
and scale a database in the cloud.
Automatically available in all availability zones
in the region, set on a regional level in the AWS
infrastructure. HA and fault tolerant.
 Lambda
 Kinesis
 S3
 DynamoDB
 Athena
 API Gateway
 CloudWatch
 QuickSight
 IoT
 Cognito
 SQS
AWS Lambda
• Heart of serverless architecture patterns.
• Stateless, event driven code. Supports Node.js, Python, Java, C#.
• No infrastructure to manage.
• No risk of over provisioning or under provisioning, don’t pay for idle time
• Logging and operation monitoring is in-built.
• Efficient performance at scale. If a thousand requests come in, it scales automatically.
• Allows to skip the boring and the hard part. Easy to author, deploy and focus on business
logic.
AWS Kinesis Streams
What is it?: High throughput, low latency, service for real time
data processing over large distributed data streams. Stores
streaming data for a period of 24 hours, during which data can
be read, processed, stored in real time.
How to use it? Configure producer data sources to emit data
into the stream. Build consuming applications that read and
process data from that stream in real-time.
Applications: Real-time metrics and reporting. Extracting
metrics and generating KPIs to power reports and dashboards
at real-time speeds. Used for streaming data that needs custom
processing.
Why use it? Amazon Kinesis Streams has simple pay-as–you-
go pricing, with no up-front costs or minimum fees, and you’ll
only pay for the resources you consume. Guarantees durability
and availability of data. Also maintains order of data.
Source:
https://www.slideshare.net/frodriguezolivera/aws-
kinesis-streams
AWS Kinesis vs Kafka
Both are data ingest frameworks for streaming data with durability, reliability and scalability.
Differences:
1. Kafka is open source. User is responsible for managing, installing clusters.
2. Kinesis is a managed service by AWS and saves cost and effort in managing servers.
3. Kafka’s costs includes DevOps engineers and storage and compute servers.
4. Kinesis being serverless, resource and human costs are much lower.
AWS Kinesis Firehose
What is it? Fully managed service that offers an easy to use solution to collect and deliver
streaming data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.
How to use it? Configure and use. No code needed.
Applications: Load streaming data into S3, Redshift, ElasticSearch that can connect to BI tools
for real time analysis. Unlike Kinesis streams, Firehose is used when data does not need
custom processing.
Why to use it?: Seamlessly scales to match data throughput without intervention.
AWS Kinesis Analytics
What is it? Fully managed service to process streaming data with SQL.
How to use it? Configure input stream, write queries and configure output stream.
Applications: Perform continual processing on streaming data.
Why to use it?: Pre-processing, basic analytics like aggregates, filtering, advanced analytics like
anomaly detection, alerting and triggering.
AWS Kinesis: serverless stream processing
Kinesis Streams: With Lambda, allows stateless processing of data. Ingests from multiple
producers and delivers to multiple destinations. Needs management of scale using shards.
Kinesis Firehose: Transform streaming data with Lambda and guaranteed delivery to S3,
Redshift or Elastic Search.
Kinesis Analytics: Stateful processing of streaming data, like aggregations over a time period.
When to use which approach?
AWS DynamoDB
• Fully managed NoSQL Database that supports both key-value and document store models.
• Other than the primary key, the table is schema less.
• Supports 32 levels of nested attributes.
• In memory cache allows response times to reduce to microseconds.
AWS DynamoDB Stream processing
• Durability and high availability
• Managed streams
• Performant
• Native integration with Lambda.
Source: AWS Webinars
AWS S3
Object storage that provides you a highly reliable, secure, and scalable storage for all your data,
big or small. It is designed to deliver 99.999999999% durability, and scale past trillions of objects.
AWS Athena
 Launched at AWS re:Invent Novemebr 2016.
 Interactive query service, to analyze data stored in S3 buckets.
 Serverless, no infrastructure setup needed.
 Pay only for the queries you run; $5 per terabyte scanned by the queries
 Works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet.
 Uses Presto with full SQL support.
 Ideal for quick ad-hoc querying as well as complex analysis.
 Powers real time dashboards.
Linux Academy launches Cloud Assessments
(https://www.cloudassessments.com/)
1. Assess: Enroll in Quests (Example: AWS CSA) and take assessments that test real-
world AWS skills on live cloud environments.
2. Learn: Lean learning, based on your performance, you are presented a tailor made
learning path.
3. Earn: Earn proven skills and ability to pass certification exams, earn badges and
micro certifications.
Linux Academy and AWS Partnership
Give nonprofit teams and individuals unlimited access to our entire library of cloud certification training
content to facilitate cloud building skills for all levels:
• More than 2,500 self-paced video courses
• 209 total hours of AWS course training
• 438 Linux training hours
• 105 OpenStack training hours
• More than 60 hands-on, scenario-based labs for AWS skill building
• Live AWS lab servers for practicing newly-acquired skills
• Quizzes, study guides, flash cards, study groups, and practice exams
Analytics for CloudAssessments.com
(https://www.cloudassessments.com/)
1. Descriptive Analytics: Dashboards with charts and graphs
• Historical views
• Real time views
2. Anomaly Detection: detect abuse of system, operational inefficiencies
3. Recommendation Engine: to provide custom tailor-made learning paths
4. Predictive analytics: Predict student performance
5. Chat bots: Virtual assistants for learning guidance.
Real time processing using Kinesis streams and
Kinesis Analytics
Big Data architecture using AWS serverless
Thank you!

More Related Content

What's hot (20)

PDF
Introducing Databricks Delta
Databricks
 
PDF
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Cathrine Wilhelmsen
 
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
CCG
 
PPTX
Big Data Use Cases
boorad
 
PPTX
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
PDF
Scaling Privacy in a Spark Ecosystem
Databricks
 
PPTX
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
PDF
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
PPTX
Azure databricks by usama whaba khan
Usama Wahab Khan Cloud, Data and AI
 
PDF
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Databricks
 
PDF
Big Data Architecture
Guido Schmutz
 
PPTX
Great Expectations Presentation
Adam Doyle
 
PPTX
BDaas- BigData as a service
Agile Testing Alliance
 
PDF
Definitive Guide to Select Right Data Warehouse (2020)
Sprinkle Data Inc
 
PDF
Hadoop Big Data Lakes Keynote
Mark van Rijmenam
 
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
PDF
Transforming GE Healthcare with Data Platform Strategy
Databricks
 
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
PDF
Auckland SQL Saturday - Azure Data Lake
Sergio Zenatti Filho
 
PPT
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 
Introducing Databricks Delta
Databricks
 
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Cathrine Wilhelmsen
 
Analytics in a Day Ft. Synapse Virtual Workshop
CCG
 
Big Data Use Cases
boorad
 
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
Scaling Privacy in a Spark Ecosystem
Databricks
 
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
Azure databricks by usama whaba khan
Usama Wahab Khan Cloud, Data and AI
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Databricks
 
Big Data Architecture
Guido Schmutz
 
Great Expectations Presentation
Adam Doyle
 
BDaas- BigData as a service
Agile Testing Alliance
 
Definitive Guide to Select Right Data Warehouse (2020)
Sprinkle Data Inc
 
Hadoop Big Data Lakes Keynote
Mark van Rijmenam
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Transforming GE Healthcare with Data Platform Strategy
Databricks
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
Auckland SQL Saturday - Azure Data Lake
Sergio Zenatti Filho
 
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 

Similar to Building Data Analytics pipelines in the cloud using serverless technology (20)

PDF
Em tempo real: Ingestão, processamento e analise de dados
Amazon Web Services LATAM
 
PDF
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PDF
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Amazon Web Services Korea
 
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
PDF
Path to the future #4 - Ingestão, processamento e análise de dados em tempo real
Amazon Web Services LATAM
 
PPTX
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
PDF
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
BEEVA_es
 
PDF
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Luis Gonzalez
 
PDF
Serverless Big Data Architectures: Serverless Data Analytics
Kristana Kane
 
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
PDF
¿Quién es Amazon Web Services?
Software Guru
 
PDF
Big data on aws
Serkan Özal
 
PDF
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
PDF
Confluent_AWS_ImmersionDay_Q42023.pdf
Ahmed791434
 
PDF
Serverless in Big Data
Eric Johnson
 
PDF
Builders' Day - Building Data Lakes for Analytics On AWS LC
Amazon Web Services LATAM
 
PDF
AWS Floor28 - WildRydes Serverless Data Processsing workshop (Ver2)
Adir Sharabi
 
PDF
Big problems Big Data, simple solutions
Claudio Pontili
 
PDF
Architecting Data Lakes on AWS
Sajith Appukuttan
 
PDF
AWS data engineer online course | AWS data engineer training
Accentfuture
 
Em tempo real: Ingestão, processamento e analise de dados
Amazon Web Services LATAM
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
Amazon Web Services Korea
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
Path to the future #4 - Ingestão, processamento e análise de dados em tempo real
Amazon Web Services LATAM
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing
BEEVA_es
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Luis Gonzalez
 
Serverless Big Data Architectures: Serverless Data Analytics
Kristana Kane
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
¿Quién es Amazon Web Services?
Software Guru
 
Big data on aws
Serkan Özal
 
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Ahmed791434
 
Serverless in Big Data
Eric Johnson
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Amazon Web Services LATAM
 
AWS Floor28 - WildRydes Serverless Data Processsing workshop (Ver2)
Adir Sharabi
 
Big problems Big Data, simple solutions
Claudio Pontili
 
Architecting Data Lakes on AWS
Sajith Appukuttan
 
AWS data engineer online course | AWS data engineer training
Accentfuture
 
Ad

More from Domino Data Lab (20)

PDF
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
PDF
Racial Bias in Policing: an analysis of Illinois traffic stops data
Domino Data Lab
 
PPTX
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
PPTX
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
PPTX
Leveraging Data Science in the Automotive Industry
Domino Data Lab
 
PDF
Summertime Analytics: Predicting E. coli and West Nile Virus
Domino Data Lab
 
PPTX
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
PDF
GeoViz: A Canvas for Data Science
Domino Data Lab
 
PPTX
Managing Data Science | Lessons from the Field
Domino Data Lab
 
PDF
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
PDF
Leveraged Analytics at Scale
Domino Data Lab
 
PDF
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab
 
PDF
Making Big Data Smart
Domino Data Lab
 
PPTX
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Domino Data Lab
 
PPTX
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
PPTX
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
PDF
The Role and Importance of Curiosity in Data Science
Domino Data Lab
 
PDF
Fuzzy Matching to the Rescue
Domino Data Lab
 
PDF
How to Effectively Combine Numerical Features and Categorical Features
Domino Data Lab
 
PDF
Building Up Local Models of Customers
Domino Data Lab
 
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Domino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
Leveraging Data Science in the Automotive Industry
Domino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Domino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
GeoViz: A Canvas for Data Science
Domino Data Lab
 
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
Leveraged Analytics at Scale
Domino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab
 
Making Big Data Smart
Domino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Domino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
Domino Data Lab
 
Fuzzy Matching to the Rescue
Domino Data Lab
 
How to Effectively Combine Numerical Features and Categorical Features
Domino Data Lab
 
Building Up Local Models of Customers
Domino Data Lab
 
Ad

Recently uploaded (20)

DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 

Building Data Analytics pipelines in the cloud using serverless technology

  • 2. Building Serverless Data Pipelines in the Cloud Manisha Sule Director of Big Data Analytics, Linux Academy. Board Member on SMU’s Big Data Advisory Board. linkedin.com/in/manisha-sule @tweetDataS
  • 3. Agenda 1. What is serverless? 2. Big Data architectures and best practices 3. AWS Server less services:  Lambda  Kinesis (Streams, Firehose, Analytics)  DynamoDB  S3  Athena 4. Analytics for CoudAssessments.com
  • 4. What is Server less? Source: https://www.slideshare.net/CodeOps/serverless-architecture-a-gentle-overview?qid=aecf8d27-8b16-4da5-987f-600fe1cb0655&v=&b=&from_search=5
  • 5. Server less architectures  Depend on 3rd party services, known as Backend As a Service (BaaS).  Distributed system that reacts to events and triggers.  Dynamically scales, based on demand  Utilized ephemeral (short-lived) containers or computational resources in the cloud.
  • 6. Advantages of Server less  Fully managed, cloud manages servers.  Highly Available, scalable, no provisioning needed and zero administration.  Not just compute containers, but also includes NoSQL databases, interactive query services, storage services, messaging services.  Cost efficient, never have to pay for idle time.  Support for continuous integration/ continuous delivery pipelines.  Developers can focus on architecture and code only.  Gartner terms as fPaaS, lists several use cases. Utility logic, scheduled processing, event- driven architecture, micro services, full blown applications
  • 7. AWS Serverless Application Model Template based mechanism of defining and deploying serverless applications. Source : AWS Tech Talk Webinar
  • 8. Big Data Lambda architecture Requirements of Big Data architectures: 1. Processing real time streams. 2. Processing batch data. 3. Real time ETL. 4. Enrich real time data with batch data. 5. Queries must be answerable using batch data and real time data.
  • 9. Big Data best practices 1. Build decoupled architecture, decouple data->store->process->store steps. 2. Use right tools: Latency, throughput, access patterns, data structures. 3. Cost effective: Big data, not big cost.
  • 10. AWS Managed vs Serverless services Need to manage servers, their scale, their location, software updates etc.  Elastic Map Reduce: Managed Hadoop framework, includes Apache Spark, Zeppelin, Hbase, Flink etc.  ElasticSearch: For log analytics, full text search, application monitoring, and more. Fully integrated with Kibana and LogStash.  RedShift: Fully managed data warehouse, to analyze data and integrate with BI tools.  RDS: Database service to setup, operate and scale a database in the cloud. Automatically available in all availability zones in the region, set on a regional level in the AWS infrastructure. HA and fault tolerant.  Lambda  Kinesis  S3  DynamoDB  Athena  API Gateway  CloudWatch  QuickSight  IoT  Cognito  SQS
  • 11. AWS Lambda • Heart of serverless architecture patterns. • Stateless, event driven code. Supports Node.js, Python, Java, C#. • No infrastructure to manage. • No risk of over provisioning or under provisioning, don’t pay for idle time • Logging and operation monitoring is in-built. • Efficient performance at scale. If a thousand requests come in, it scales automatically. • Allows to skip the boring and the hard part. Easy to author, deploy and focus on business logic.
  • 12. AWS Kinesis Streams What is it?: High throughput, low latency, service for real time data processing over large distributed data streams. Stores streaming data for a period of 24 hours, during which data can be read, processed, stored in real time. How to use it? Configure producer data sources to emit data into the stream. Build consuming applications that read and process data from that stream in real-time. Applications: Real-time metrics and reporting. Extracting metrics and generating KPIs to power reports and dashboards at real-time speeds. Used for streaming data that needs custom processing. Why use it? Amazon Kinesis Streams has simple pay-as–you- go pricing, with no up-front costs or minimum fees, and you’ll only pay for the resources you consume. Guarantees durability and availability of data. Also maintains order of data. Source: https://www.slideshare.net/frodriguezolivera/aws- kinesis-streams
  • 13. AWS Kinesis vs Kafka Both are data ingest frameworks for streaming data with durability, reliability and scalability. Differences: 1. Kafka is open source. User is responsible for managing, installing clusters. 2. Kinesis is a managed service by AWS and saves cost and effort in managing servers. 3. Kafka’s costs includes DevOps engineers and storage and compute servers. 4. Kinesis being serverless, resource and human costs are much lower.
  • 14. AWS Kinesis Firehose What is it? Fully managed service that offers an easy to use solution to collect and deliver streaming data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service. How to use it? Configure and use. No code needed. Applications: Load streaming data into S3, Redshift, ElasticSearch that can connect to BI tools for real time analysis. Unlike Kinesis streams, Firehose is used when data does not need custom processing. Why to use it?: Seamlessly scales to match data throughput without intervention.
  • 15. AWS Kinesis Analytics What is it? Fully managed service to process streaming data with SQL. How to use it? Configure input stream, write queries and configure output stream. Applications: Perform continual processing on streaming data. Why to use it?: Pre-processing, basic analytics like aggregates, filtering, advanced analytics like anomaly detection, alerting and triggering.
  • 16. AWS Kinesis: serverless stream processing Kinesis Streams: With Lambda, allows stateless processing of data. Ingests from multiple producers and delivers to multiple destinations. Needs management of scale using shards. Kinesis Firehose: Transform streaming data with Lambda and guaranteed delivery to S3, Redshift or Elastic Search. Kinesis Analytics: Stateful processing of streaming data, like aggregations over a time period. When to use which approach?
  • 17. AWS DynamoDB • Fully managed NoSQL Database that supports both key-value and document store models. • Other than the primary key, the table is schema less. • Supports 32 levels of nested attributes. • In memory cache allows response times to reduce to microseconds.
  • 18. AWS DynamoDB Stream processing • Durability and high availability • Managed streams • Performant • Native integration with Lambda. Source: AWS Webinars
  • 19. AWS S3 Object storage that provides you a highly reliable, secure, and scalable storage for all your data, big or small. It is designed to deliver 99.999999999% durability, and scale past trillions of objects.
  • 20. AWS Athena  Launched at AWS re:Invent Novemebr 2016.  Interactive query service, to analyze data stored in S3 buckets.  Serverless, no infrastructure setup needed.  Pay only for the queries you run; $5 per terabyte scanned by the queries  Works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet.  Uses Presto with full SQL support.  Ideal for quick ad-hoc querying as well as complex analysis.  Powers real time dashboards.
  • 21. Linux Academy launches Cloud Assessments (https://www.cloudassessments.com/) 1. Assess: Enroll in Quests (Example: AWS CSA) and take assessments that test real- world AWS skills on live cloud environments. 2. Learn: Lean learning, based on your performance, you are presented a tailor made learning path. 3. Earn: Earn proven skills and ability to pass certification exams, earn badges and micro certifications.
  • 22. Linux Academy and AWS Partnership Give nonprofit teams and individuals unlimited access to our entire library of cloud certification training content to facilitate cloud building skills for all levels: • More than 2,500 self-paced video courses • 209 total hours of AWS course training • 438 Linux training hours • 105 OpenStack training hours • More than 60 hands-on, scenario-based labs for AWS skill building • Live AWS lab servers for practicing newly-acquired skills • Quizzes, study guides, flash cards, study groups, and practice exams
  • 23. Analytics for CloudAssessments.com (https://www.cloudassessments.com/) 1. Descriptive Analytics: Dashboards with charts and graphs • Historical views • Real time views 2. Anomaly Detection: detect abuse of system, operational inefficiencies 3. Recommendation Engine: to provide custom tailor-made learning paths 4. Predictive analytics: Predict student performance 5. Chat bots: Virtual assistants for learning guidance.
  • 24. Real time processing using Kinesis streams and Kinesis Analytics
  • 25. Big Data architecture using AWS serverless