SlideShare a Scribd company logo
Chicago Cloud Conference 2020
Architecting Analytic
Pipelines on GCP
Who am I?
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He is
an avid open-source contributor.
Data/Platform Architect at Otus
Mariano Gonzalez
Most importantly, I am just a person trying to learn about and share big
data technologies and approaches.
Agenda
● Goal for this session
● Overview of GCP services
● Apache Beam and GCP Dataflow
● Natural Language Processing for sentiment analysis
● Demo ETL/Analytics
● QA
Goal for this Session
Find an elegant way to build and deploy data/analytic
pipelines that:
● Support for multiple workloads
● Scale compute and storage independently
● Backed up by manage services
● Cost effective
Common Architecture Analytics Pipeline
Data Storage
Different
Types and
Formats of
Data
Analytic/Data
Pipelines
User
Overview of GCP services - App Engine
● Good alternative if K8s infrastructure is not in place
● Easy deployment
○ Similar to AWS SAM from a CLI perspective
○ Similar to AWS Beanstalk from a deployment perspective
● Well integrated with other cloud services
○ GCP docker Registry
● Multiple Runtimes
○ Custom (Docker)
○ JVM/Node/Python
Overview of GCP services - Storage
● Hot - durable, available performance object storage for frequently accessed data
○ Amazon S3 Standard
○ Microsoft Azure Hot Blob Storage
○ Google Cloud Storage standard
● Cool - storage class for data that is accessed less frequently, but requires rapid access
when needed
○ Amazon S3 Standard I/A and S3 Standard Z-I/A
○ Microsoft Azure Cool Blob Storage
○ Google Cloud Storage Nearline
● Cold - secure, durable, and low-cost storage service for data archiving
○ Amazon S3 Glacier
○ Microsoft Azure Blob Archive Storage
○ Google Cloud Storage Coldline
Overview of GCP services - Pubsub
Why not just use Kafka?
● Fully managed services
○ Both system can have fully managed version in the cloud
● Cloud vs On-prem
○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka
can be used as a both cloud service and on-prem service
● Message duplication
○ Kafka manage the offsets via zookeeper
○ Pubsub works using acknowledging the message
Overview of GCP services - Pubsub
Why not just use Kafka?
● Retention policy
○ Both Kafka and Pubsub have options to configure the maximum retention
time
● Consumers Group vs Subscriptions
○ Pubsub use subscriptions, you create a subscription and then you start
reading messages from that subscription
○ Kafka use the concept of "consumer group" and "partition"
Overview of GCP services - BigQuery
● Query engines probably one of the most competed service today:
○ Snowflake
○ Presto
○ Redshift
● How are these warehouses different?
● Presto
○ Self hosted open source solution
● Pre-RA3 Redshift
○ Somewhat more fully managed, but still requires the user to configure individual
compute clusters with a fixed amount of memory, compute and storage
● Redshift RA3
○ Closer to the user experience of Snowflake by separating compute from storage
● Snowflake
○ The user only configures the size and number of compute clusters
○ Every compute cluster sees the same data
○ Compute clusters can be created and removed in seconds
Overview of GCP services - BigQuery
BigQuery
● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number
of "compute slots"
● Pure serverless model, where the user submits queries one at a time and pays per query
● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your
workload
A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A
"spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization
will be much cheaper in on-demand mode.
Overview of GCP services - BigQuery
What is Google Cloud Dataflow?
● Data processing service for both:
○ batch
○ real-time data streaming applications
● Benefits
○ Enables developers to set up analytic pipelines immediately
● Nextgen MapReduce
○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce
brought to a single type of computational for batch processing jobs
○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency
processing).
Overview of GCP services - Dataflow
Apache Beam SDK and Dataflow Runner
Google Cloud Dataflow overlaps with services such as:
● Amazon Kinesis
● Apache Storm
● Apache Spark
● Facebook Flux
$ java -jar build/libs/transformation-1.0-all.jar 
--project=ccc-2020-289323 
--runner=DataflowRunner 
--streaming=true 
--region=us-east1 
--tempLocation=gs://chicago-cloud-conference-2020/temp/ 
--stagingLocation=gs://chicago-cloud-conference-2020/jars/ 
--filesToStage=build/libs/transformation-1.0-all.jar 
--maxNumWorkers=2 
--numWorkers=1
Apache Beam SDK and Dataflow Runner
Overview of GCP services - Dataproc
On demand Hadoop Cluster
● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight)
Dataproc is the fastest to provision
● Easy runtime customization via PIP commands
● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR
- Apache Zeppelin)
$ gcloud beta dataproc clusters create cluster-name 
--optional-components=ANACONDA,JUPYTER 
--image-version=1.4 
--enable-component-gateway 
--bucket=chicago-cloud-conference-2020 
--region=us-east1 
--project=ccc-2020-289323 
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
Overview of GCP services - Cloud Natural Language API
● What can we do Cloud Natural Language API?
○ Reveal the structure and meaning of text via machine learning models
○ Extract information about people, places, and events, mentioned in text
documents, news articles or blog posts
○ Understand sentiment about product on social media or parse intent from
customer conversations happening in a call center or a messaging app
● How can we use it?
○ Analyze text uploaded as part of a HTTP request
○ Integrate with Google Cloud Storage
NLP - Sentiment Analysis
Two type of metrics to consider:
1. Score
a. It ranges between -1.0 (negative) and
1.0 (positive) and corresponds to the
general emotional tendency of the text
1. Magnitude
a. Indicates the general intensity of
emotion (both positive and negative) in
a given text, between 0.0 and inf
b. Magnitude is not normalized and each
expression of emotion in the text (both
positive and negative) contributes to the
value
Sentiment Sample Values
Positive score: 0.8, magnitude: 3.0
Negative score: -0.6, magnitude: 4.0
Neutral score: 0.1, magnitude: 0.0
Mixed score: 0.0, magnitude: 4.0
Demo - ETL
• Extract – Diferentes fuentes (Twitter for this case)
• Transform – Cleanup and data presentation
• Load – Columnar format
https://github.com/eschizoid/ccc-2020
Demo - Analytics
Conclusion
•Cost effect solution if you
know your data access
patterns
•Full serverless architecture
•Extensible workloads
QA

More Related Content

PPTX
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Mariano Gonzalez
 
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
PDF
Microservices Patterns with GoldenGate
Jeffrey T. Pollock
 
PPTX
Capgemini Insights and Data
DataWorks Summit/Hadoop Summit
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PPTX
How data modelling helps serve billions of queries in millisecond latency wit...
DataWorks Summit
 
PDF
Privacy-Preserving AI Network - PlatON 2.0
ShiHeng1
 
PDF
Making the most of your Snowflake Investment
Paul Van Siclen
 
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Mariano Gonzalez
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
Microservices Patterns with GoldenGate
Jeffrey T. Pollock
 
Capgemini Insights and Data
DataWorks Summit/Hadoop Summit
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
How data modelling helps serve billions of queries in millisecond latency wit...
DataWorks Summit
 
Privacy-Preserving AI Network - PlatON 2.0
ShiHeng1
 
Making the most of your Snowflake Investment
Paul Van Siclen
 

What's hot (20)

PDF
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
PDF
On the Radar: SnapLogic
SnapLogic
 
PDF
Cloud Modernization and Data as a Service Option
Denodo
 
PDF
Life is a Stream of Events
confluent
 
PPTX
Hadoop for Humans: Introducing SnapReduce 2.0
SnapLogic
 
PPTX
Big Data Management: What's New, What's Different, and What You Need To Know
SnapLogic
 
PDF
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
SnapLogic
 
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Kai Wähner
 
PDF
Consumption based analytics enabled by Data Virtualization
Denodo
 
PDF
On Demand BI
Darren Cunningham
 
PPTX
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
PPTX
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
SnapLogic
 
PDF
451 Research Impact Report
Infochimps, a CSC Big Data Business
 
PDF
Data Democratization at Nubank
Databricks
 
PDF
No sql now2011_review_of_adhoc_architectures
Nicholas Goodman
 
PPTX
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
DataWorks Summit
 
PDF
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Data Con LA
 
PDF
From ingest to insights with AWS
Paul Van Siclen
 
PDF
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
On the Radar: SnapLogic
SnapLogic
 
Cloud Modernization and Data as a Service Option
Denodo
 
Life is a Stream of Events
confluent
 
Hadoop for Humans: Introducing SnapReduce 2.0
SnapLogic
 
Big Data Management: What's New, What's Different, and What You Need To Know
SnapLogic
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
SnapLogic
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Kai Wähner
 
Consumption based analytics enabled by Data Virtualization
Denodo
 
On Demand BI
Darren Cunningham
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
 
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
SnapLogic
 
451 Research Impact Report
Infochimps, a CSC Big Data Business
 
Data Democratization at Nubank
Databricks
 
No sql now2011_review_of_adhoc_architectures
Nicholas Goodman
 
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
DataWorks Summit
 
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
Data Con LA
 
From ingest to insights with AWS
Paul Van Siclen
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
Ad

Similar to Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020 (20)

PDF
Getting more into GCP.pdf
Knoldus Inc.
 
PDF
Getting started with GCP ( Google Cloud Platform)
bigdata trunk
 
PDF
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar
 
PDF
Introduction to Google Cloud Platform
Sujai Prakasam
 
PPTX
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
PPTX
Introduction to Google Cloud & GCCP Campaign
GDSCVJTI
 
PPTX
Introduction to Google Cloud Platform
dhruv_chaudhari
 
PDF
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic
 
PPTX
Eric Andersen Keynote
Data Con LA
 
PDF
Introduction to GCP
Knoldus Inc.
 
PPTX
GDSC Cloud Jam.pptx
GDSCIITBhilai
 
PDF
Google Cloud Platform Introduction - 2016Q3
Simon Su
 
PPTX
GCCP Session 2.pptx
DSCIITPatna
 
PPTX
30 daysofcloud - 2
HitanshDoshi
 
PDF
The journey of Moving from AWS ELK to GCP Data Pipeline
Randy Huang
 
PDF
GCP-pde.pdf
NirajKumar938204
 
PDF
Introduction to gcp
IPSpecialist
 
PPTX
GCP Slide.pptx
UnknownPerson475333
 
PPTX
Google Cloud Study Jam | GDSC NCU
Shivam254129
 
PDF
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
Getting more into GCP.pdf
Knoldus Inc.
 
Getting started with GCP ( Google Cloud Platform)
bigdata trunk
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar
 
Introduction to Google Cloud Platform
Sujai Prakasam
 
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Introduction to Google Cloud & GCCP Campaign
GDSCVJTI
 
Introduction to Google Cloud Platform
dhruv_chaudhari
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic
 
Eric Andersen Keynote
Data Con LA
 
Introduction to GCP
Knoldus Inc.
 
GDSC Cloud Jam.pptx
GDSCIITBhilai
 
Google Cloud Platform Introduction - 2016Q3
Simon Su
 
GCCP Session 2.pptx
DSCIITPatna
 
30 daysofcloud - 2
HitanshDoshi
 
The journey of Moving from AWS ELK to GCP Data Pipeline
Randy Huang
 
GCP-pde.pdf
NirajKumar938204
 
Introduction to gcp
IPSpecialist
 
GCP Slide.pptx
UnknownPerson475333
 
Google Cloud Study Jam | GDSC NCU
Shivam254129
 
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
Ad

Recently uploaded (20)

PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
Exploring AI Agents in Process Industries
amoreira6
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
Presentation about variables and constant.pptx
kr2589474
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

  • 1. Chicago Cloud Conference 2020 Architecting Analytic Pipelines on GCP
  • 2. Who am I? Mariano is an engineer with more than 15 years of experience with the JVM. He enjoys working with and exploring a variety of big data technologies. He is an avid open-source contributor. Data/Platform Architect at Otus Mariano Gonzalez Most importantly, I am just a person trying to learn about and share big data technologies and approaches.
  • 3. Agenda ● Goal for this session ● Overview of GCP services ● Apache Beam and GCP Dataflow ● Natural Language Processing for sentiment analysis ● Demo ETL/Analytics ● QA
  • 4. Goal for this Session Find an elegant way to build and deploy data/analytic pipelines that: ● Support for multiple workloads ● Scale compute and storage independently ● Backed up by manage services ● Cost effective
  • 5. Common Architecture Analytics Pipeline Data Storage Different Types and Formats of Data Analytic/Data Pipelines User
  • 6. Overview of GCP services - App Engine ● Good alternative if K8s infrastructure is not in place ● Easy deployment ○ Similar to AWS SAM from a CLI perspective ○ Similar to AWS Beanstalk from a deployment perspective ● Well integrated with other cloud services ○ GCP docker Registry ● Multiple Runtimes ○ Custom (Docker) ○ JVM/Node/Python
  • 7. Overview of GCP services - Storage ● Hot - durable, available performance object storage for frequently accessed data ○ Amazon S3 Standard ○ Microsoft Azure Hot Blob Storage ○ Google Cloud Storage standard ● Cool - storage class for data that is accessed less frequently, but requires rapid access when needed ○ Amazon S3 Standard I/A and S3 Standard Z-I/A ○ Microsoft Azure Cool Blob Storage ○ Google Cloud Storage Nearline ● Cold - secure, durable, and low-cost storage service for data archiving ○ Amazon S3 Glacier ○ Microsoft Azure Blob Archive Storage ○ Google Cloud Storage Coldline
  • 8. Overview of GCP services - Pubsub Why not just use Kafka? ● Fully managed services ○ Both system can have fully managed version in the cloud ● Cloud vs On-prem ○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka can be used as a both cloud service and on-prem service ● Message duplication ○ Kafka manage the offsets via zookeeper ○ Pubsub works using acknowledging the message
  • 9. Overview of GCP services - Pubsub Why not just use Kafka? ● Retention policy ○ Both Kafka and Pubsub have options to configure the maximum retention time ● Consumers Group vs Subscriptions ○ Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription ○ Kafka use the concept of "consumer group" and "partition"
  • 10. Overview of GCP services - BigQuery ● Query engines probably one of the most competed service today: ○ Snowflake ○ Presto ○ Redshift ● How are these warehouses different?
  • 11. ● Presto ○ Self hosted open source solution ● Pre-RA3 Redshift ○ Somewhat more fully managed, but still requires the user to configure individual compute clusters with a fixed amount of memory, compute and storage ● Redshift RA3 ○ Closer to the user experience of Snowflake by separating compute from storage ● Snowflake ○ The user only configures the size and number of compute clusters ○ Every compute cluster sees the same data ○ Compute clusters can be created and removed in seconds Overview of GCP services - BigQuery
  • 12. BigQuery ● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number of "compute slots" ● Pure serverless model, where the user submits queries one at a time and pays per query ● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your workload A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A "spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization will be much cheaper in on-demand mode. Overview of GCP services - BigQuery
  • 13. What is Google Cloud Dataflow? ● Data processing service for both: ○ batch ○ real-time data streaming applications ● Benefits ○ Enables developers to set up analytic pipelines immediately ● Nextgen MapReduce ○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce brought to a single type of computational for batch processing jobs ○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency processing). Overview of GCP services - Dataflow
  • 14. Apache Beam SDK and Dataflow Runner Google Cloud Dataflow overlaps with services such as: ● Amazon Kinesis ● Apache Storm ● Apache Spark ● Facebook Flux $ java -jar build/libs/transformation-1.0-all.jar --project=ccc-2020-289323 --runner=DataflowRunner --streaming=true --region=us-east1 --tempLocation=gs://chicago-cloud-conference-2020/temp/ --stagingLocation=gs://chicago-cloud-conference-2020/jars/ --filesToStage=build/libs/transformation-1.0-all.jar --maxNumWorkers=2 --numWorkers=1
  • 15. Apache Beam SDK and Dataflow Runner
  • 16. Overview of GCP services - Dataproc On demand Hadoop Cluster ● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight) Dataproc is the fastest to provision ● Easy runtime customization via PIP commands ● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR - Apache Zeppelin) $ gcloud beta dataproc clusters create cluster-name --optional-components=ANACONDA,JUPYTER --image-version=1.4 --enable-component-gateway --bucket=chicago-cloud-conference-2020 --region=us-east1 --project=ccc-2020-289323 --metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
  • 17. Overview of GCP services - Cloud Natural Language API ● What can we do Cloud Natural Language API? ○ Reveal the structure and meaning of text via machine learning models ○ Extract information about people, places, and events, mentioned in text documents, news articles or blog posts ○ Understand sentiment about product on social media or parse intent from customer conversations happening in a call center or a messaging app ● How can we use it? ○ Analyze text uploaded as part of a HTTP request ○ Integrate with Google Cloud Storage
  • 18. NLP - Sentiment Analysis Two type of metrics to consider: 1. Score a. It ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the general emotional tendency of the text 1. Magnitude a. Indicates the general intensity of emotion (both positive and negative) in a given text, between 0.0 and inf b. Magnitude is not normalized and each expression of emotion in the text (both positive and negative) contributes to the value Sentiment Sample Values Positive score: 0.8, magnitude: 3.0 Negative score: -0.6, magnitude: 4.0 Neutral score: 0.1, magnitude: 0.0 Mixed score: 0.0, magnitude: 4.0
  • 19. Demo - ETL • Extract – Diferentes fuentes (Twitter for this case) • Transform – Cleanup and data presentation • Load – Columnar format https://github.com/eschizoid/ccc-2020
  • 21. Conclusion •Cost effect solution if you know your data access patterns •Full serverless architecture •Extensible workloads
  • 22. QA