SlideShare a Scribd company logo
Liferay & Big Data 
Getting value from your data 
! 
Miguel Ángel Pastor Olivar 
miguel.pastor@liferay.com
Who am I? 
! 
• Some random guy 
! 
• Member of the Liferay core infrastructure 
team 
! 
•Disclaimer: Not a computer scientist 
! 
• @miguelinlas3
What are we going to talk about? 
! 
• Big Data: what is this about? 
! 
• Simple architecture proposal 
! 
• Use cases 
! 
• Questions (and hopefully answers)
Big Data?
• Data is so big that regular solutions are: 
! 
–Extremely slow 
! 
–Too small 
! 
–Really expensive 
! 
• How we use all the data we already own
! 
• Volume 
–Transactions, data streaming from social media, … 
! 
• Velocity 
–Torrents of data in real time 
! 
• Variety 
–Numerical data, text, email, video, audio, …
Popular usages
• Recommender systems 
! 
• Predicting the future: 
– Netflix does autoscaling based on past 
network data traffic 
! 
• Churn models 
– Big telco companies build social networks 
to reduce the churn
• Sentiment analysis 
–Are talking about you in the Internet? 
! 
• Real Time Bidding 
–Optimise advertising 
! 
• Health care 
–Improve patients health while reducing costs 
–Improve quality of life of multiple sclerosis patients
Terminology
• Storage models 
• How to store relevant information 
! 
• Computation models 
• Process and transform all the information 
! 
• Analytics 
• How we can take actions based on the 
previous steps
Big Data 
Architectures
Data storage
Hadoop Distributed File System (HDFS) 
! 
• Java based file system 
! 
• Scalable, fault-tolerant, distributed storage 
! 
• Designed to run on commodity hardware 
! 
• Closely related to MapReduce
Source: http://hortonworks.com/
NoSQL storage
• Semistructured data 
! 
• Focused on 
! 
• Horizontal scalability 
! 
• Availability 
! 
• Different trade-offs: CAP, BASE, … 
!
NewSQL 
storage
• Modern relational databases 
! 
• Same scalable performance than NoSQL for 
OLTP 
! 
• Maintain ACID guarantees 
! 
• A few alternatives: VoltDB, Google Spanner, 
FoundationDB, …
Computation 
and analytics
Apache Hadoop
Apache Hadoop Map Reduce 
! 
• Distributed processing 
! 
• Large datasets 
! 
•Clusters of computers 
#LRNAS2014 
! 
• Simple programming model 
! 
• Verbose and hard to use API
Liferay 
projects 
is 
the 
best 
Open 
Source 
project 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
• Batch model data crunching 
! 
• Not so good event stream processing 
! 
• But … 
! 
• Many algorithms hard to implement using 
MapReduce 
! 
• Cascading, Scalding, Cascalog, Impala, …
Apache Storm
• Distributed realtime computation system 
! 
• Easy to reliably process unbounded streams of data 
! 
• Multi language support 
! 
• Realtime analytics, online machine learning, continuous 
computation, distributed RPC, ETL, …
Spout 
Spout 
Bolt Bolt 
Bolt
Apache Spark
• Fast and general-purpose cluster computing 
• Developed by Berkeley AMP 
! 
• High level APIs (not MapReduce) 
! 
• Optimised engine: 
• supports general execution graphs 
! 
• Higher-level tools: 
• Spark SQL, MLib, Spark Streaming, Graphx
Apache Mahout
! 
• Scalable machine learning library 
#LRNAS2014 
! 
• Built on top of Hadoop 
! 
• Some algorithms don’t require Hadoop at all 
#LRNAS2014
R language
• Focused on: 
• Data visualisation 
• Statistical computations 
• Analysis of data 
! 
• Tons of built-in packages 
! 
• Connect to Hadoop through Hadoop Streaming 
! 
• Not a fast language
Reference 
Architecture
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Datasources
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
• System events 
! 
• User tracking (client side) 
• Clicks, navigation, activities, … 
! 
• Monitoring (transactions, load page times, …) 
! 
• Models (message boards, blogs, wiki …) 
! 
• Custom developments …
Event broker
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Data Source 
0 1 2 3 4 5 6 7 8 
Writes 
9 
Reads Reads 
System A System B
Apache Kafka 
! 
• Publish-subscribe as distributed commit log 
! 
• Fast 
! 
• Scalable 
! 
• Durable 
! 
• Distributed by design
Broker A 
Broker B 
Producer Consumer 
Broker C 
ZooKeeper
Computation 
and analytics
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Batch processing? 
! 
Real time processing? 
! 
Machine learning algorithms? 
! 
Graph analysis? 
! 
Unified programming model?
Liferay & Big Data Dev Con 2014
! 
• Fast and general engine for large-scale data 
processing 
! 
• Write your apps in Java, Scala or Python 
! 
• Run on YARN cluster manager 
! 
• Can read any existing Hadoop data (HDFS) 
! 
• In memory or disk
Apache Spark Main Components 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Spark Core
• Driver main function and executes various 
parallel operations on a cluster 
! 
• Resilient Distributed Datasets (RDD) 
• HDFS (or any Hadoop file system) 
! 
• Scala collection 
! 
• Second abstraction: shared variables
Spark SQL
• Mix SQL queries with Spark programs 
! 
• Unified Data Access 
! 
• Hive compatibility 
! 
• Standard JDBC or ODBC connectivity 
! 
• Same engine for both interactive and long running 
queries
Spark Streaming
• Build your apps using high-level operators 
! 
• Fault tolerance: exactly-once semantics out of the box 
! 
• Combine streaming with batch and interactive queries 
! 
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ 
! 
• Define your own custom data sources
Spark MLib
! 
• Basic statistics 
• Summary statistics 
• Correlations 
• …. 
! 
• Classification and regression 
• Linear models 
• Decision tress 
• Naive Bayes
! 
• Clustering 
• K-Means 
! 
• Collaborative filtering 
• Alternate least squares 
! 
• Dimensionality reduction 
• Singular value decomposition 
! 
• Principal component analysis
Spark GraphX
! 
• Graphs API and graph-parallel computation 
! 
• Growing scale and importance 
• From social networks to language modelling 
! 
• Directed multigraph with properties attached to each 
vertex and edge 
! 
• Growing collection of graph algorithms and builders
Live demo! 
Building a messages 
classifier
Takeaways
• Not about data size, but how you use it 
! 
• You already own tons of data, you just need to take get 
value from it 
! 
• There is no silver bullet: you’ve plenty of alternatives 
! 
• JVM Big data related techs are usually a great choice 
! 
• Try it yourself!!
References
!• 
Apache Kafka 
! 
• Apache Spark 
! 
• Apache Storm 
! 
• Apache Hadoop 
! 
• Big Data definition at Wikipedia 
! 
• Liferay Kafka Bridge 
! 
• What every software engineer should know about a log
Thank you!!
Questions 
(and hopefully answers)

More Related Content

What's hot (20)

PDF
Uber's data science workbench
Ran Wei
 
PDF
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 
PPTX
MongoDB Days Germany: Data Processing with MongoDB
MongoDB
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PPTX
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
PPTX
How do spark_kafka_and_syncsort_dmx-h
Precisely
 
PDF
Architecting next generation big data platform
hadooparchbook
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PDF
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
PDF
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
PPT
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
 
PDF
Elasticsearch JVM-MX Meetup April 2016
Domingo Suarez Torres
 
PDF
Architecting a next-generation data platform
hadooparchbook
 
PPTX
Real time monitoring of hadoop and spark workflows
Shankar Manian
 
PDF
Architecting a next generation data platform
hadooparchbook
 
PDF
Hybrid Apache Spark Architecture with YARN and Kubernetes
Databricks
 
PPTX
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
PDF
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Lucidworks
 
PDF
Solr for Data Science
Grant Ingersoll
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Uber's data science workbench
Ran Wei
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
How do spark_kafka_and_syncsort_dmx-h
Precisely
 
Architecting next generation big data platform
hadooparchbook
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Lucidworks
 
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
 
Elasticsearch JVM-MX Meetup April 2016
Domingo Suarez Torres
 
Architecting a next-generation data platform
hadooparchbook
 
Real time monitoring of hadoop and spark workflows
Shankar Manian
 
Architecting a next generation data platform
hadooparchbook
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Databricks
 
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Lucidworks
 
Solr for Data Science
Grant Ingersoll
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 

Viewers also liked (20)

PPTX
3. Sinagogas, inspiración para Grupos Pequeños
Heyssen Cordero Maraví
 
PDF
Arianrod prefacio1
raceaguilart
 
PDF
KIAC_Conference Report_Print
Annette Tamara MBABAZI
 
PDF
Curso Comunicacion 2
juan pablo
 
PPT
Ruta de la tapa
Francisco Jose Vera Usabal
 
PDF
Arrow ECS - One Source, IT Skills & Serivces
Arrow ECS UK
 
PDF
Algo de astronomia
Bahistas Valladolid
 
PDF
Water and Waste Water Treatment - EN - 140716 - webreduced
Renan Norbiate de Melo
 
PDF
Integración prevención 03 10-10
Jose Maria Rivas Moar
 
PPT
CyberAttack -- Whose side is your computer on?
Jim Isaak
 
DOCX
Origen y significado del día de muertos
ommasi
 
PDF
HSBP June Invite
Neenz Faleafine
 
PDF
Netherlands Fuel Card Briefing
Alpha Company, AFNORTH Battalion SFRG
 
PDF
Dermlite Dermatoscopes
Schuco
 
PPTX
Como funciona el alcohol en el cuerpo
gallardoeliass
 
PPTX
Vhigo Mase
vhigomase
 
PDF
Reputacion online C4E
Ferran Burriel
 
PDF
Future Academy - Cerificate
Francesca Di Sarno
 
PPTX
Mr. Eduard Rodès Director of the European Short Sea Shipping School
ASCAME
 
PPT
Customer Lifestage
Joe Hage
 
3. Sinagogas, inspiración para Grupos Pequeños
Heyssen Cordero Maraví
 
Arianrod prefacio1
raceaguilart
 
KIAC_Conference Report_Print
Annette Tamara MBABAZI
 
Curso Comunicacion 2
juan pablo
 
Arrow ECS - One Source, IT Skills & Serivces
Arrow ECS UK
 
Algo de astronomia
Bahistas Valladolid
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Renan Norbiate de Melo
 
Integración prevención 03 10-10
Jose Maria Rivas Moar
 
CyberAttack -- Whose side is your computer on?
Jim Isaak
 
Origen y significado del día de muertos
ommasi
 
HSBP June Invite
Neenz Faleafine
 
Netherlands Fuel Card Briefing
Alpha Company, AFNORTH Battalion SFRG
 
Dermlite Dermatoscopes
Schuco
 
Como funciona el alcohol en el cuerpo
gallardoeliass
 
Vhigo Mase
vhigomase
 
Reputacion online C4E
Ferran Burriel
 
Future Academy - Cerificate
Francesca Di Sarno
 
Mr. Eduard Rodès Director of the European Short Sea Shipping School
ASCAME
 
Customer Lifestage
Joe Hage
 
Ad

Similar to Liferay & Big Data Dev Con 2014 (20)

PPTX
Apache drill
MapR Technologies
 
PPTX
Big Data tools in practice
Darko Marjanovic
 
PPTX
Apache Drill
Ted Dunning
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
PDF
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
PPTX
Big Data Open Source Technologies
neeraj rathore
 
PDF
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
BigDataEverywhere
 
PDF
Technologies for Data Analytics Platform
N Masahiro
 
ODP
If You Have The Content, Then Apache Has The Technology!
gagravarr
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PPTX
Hands On: Introduction to the Hadoop Ecosystem
Adaryl "Bob" Wakefield, MBA
 
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
PDF
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
PDF
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
PDF
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
PDF
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PPTX
Apache Hadoop Hive
Some corner at the Laboratory
 
Apache drill
MapR Technologies
 
Big Data tools in practice
Darko Marjanovic
 
Apache Drill
Ted Dunning
 
Demystifying data engineering
Thang Bui (Bob)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Big Data Open Source Technologies
neeraj rathore
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
BigDataEverywhere
 
Technologies for Data Analytics Platform
N Masahiro
 
If You Have The Content, Then Apache Has The Technology!
gagravarr
 
Introduction To Hadoop Ecosystem
InSemble
 
Hands On: Introduction to the Hadoop Ecosystem
Adaryl "Bob" Wakefield, MBA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Apache Hadoop Hive
Some corner at the Laboratory
 
Ad

More from Miguel Pastor (17)

PDF
Microservices: The OSGi way A different vision on microservices
Miguel Pastor
 
PDF
Reactive applications and Akka intro used in the Madrid Scala Meetup
Miguel Pastor
 
PDF
Reactive applications using Akka
Miguel Pastor
 
PPTX
Liferay Devcon 2013: Our way towards modularity
Miguel Pastor
 
ODP
Liferay Module Framework
Miguel Pastor
 
ODP
Liferay and Cloud
Miguel Pastor
 
PDF
Jvm fundamentals
Miguel Pastor
 
PDF
Scala Overview
Miguel Pastor
 
ODP
Hadoop, Cloud y Spring
Miguel Pastor
 
PDF
Scala: un vistazo general
Miguel Pastor
 
ODP
Platform as a Service overview
Miguel Pastor
 
ODP
HadoopDB
Miguel Pastor
 
PDF
Aspect Oriented Programming introduction
Miguel Pastor
 
ODP
Software measure-slides
Miguel Pastor
 
ODP
Arquitecturas MMOG
Miguel Pastor
 
ODP
Software Failures
Miguel Pastor
 
ODP
Groovy and Grails intro
Miguel Pastor
 
Microservices: The OSGi way A different vision on microservices
Miguel Pastor
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Miguel Pastor
 
Reactive applications using Akka
Miguel Pastor
 
Liferay Devcon 2013: Our way towards modularity
Miguel Pastor
 
Liferay Module Framework
Miguel Pastor
 
Liferay and Cloud
Miguel Pastor
 
Jvm fundamentals
Miguel Pastor
 
Scala Overview
Miguel Pastor
 
Hadoop, Cloud y Spring
Miguel Pastor
 
Scala: un vistazo general
Miguel Pastor
 
Platform as a Service overview
Miguel Pastor
 
HadoopDB
Miguel Pastor
 
Aspect Oriented Programming introduction
Miguel Pastor
 
Software measure-slides
Miguel Pastor
 
Arquitecturas MMOG
Miguel Pastor
 
Software Failures
Miguel Pastor
 
Groovy and Grails intro
Miguel Pastor
 

Recently uploaded (20)

PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 

Liferay & Big Data Dev Con 2014

  • 1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar [email protected]
  • 2. Who am I? ! • Some random guy ! • Member of the Liferay core infrastructure team ! •Disclaimer: Not a computer scientist ! • @miguelinlas3
  • 3. What are we going to talk about? ! • Big Data: what is this about? ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers)
  • 5. • Data is so big that regular solutions are: ! –Extremely slow ! –Too small ! –Really expensive ! • How we use all the data we already own
  • 6. ! • Volume –Transactions, data streaming from social media, … ! • Velocity –Torrents of data in real time ! • Variety –Numerical data, text, email, video, audio, …
  • 8. • Recommender systems ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn
  • 9. • Sentiment analysis –Are talking about you in the Internet? ! • Real Time Bidding –Optimise advertising ! • Health care –Improve patients health while reducing costs –Improve quality of life of multiple sclerosis patients
  • 11. • Storage models • How to store relevant information ! • Computation models • Process and transform all the information ! • Analytics • How we can take actions based on the previous steps
  • 14. Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce
  • 17. • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … !
  • 19. • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, …
  • 22. Apache Hadoop Map Reduce ! • Distributed processing ! • Large datasets ! •Clusters of computers #LRNAS2014 ! • Simple programming model ! • Verbose and hard to use API
  • 23. Liferay projects is the best Open Source project best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 24. • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Cascading, Scalding, Cascalog, Impala, …
  • 26. • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, …
  • 27. Spout Spout Bolt Bolt Bolt
  • 29. • Fast and general-purpose cluster computing • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: • supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx
  • 31. ! • Scalable machine learning library #LRNAS2014 ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014
  • 33. • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language
  • 35. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 37. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 38. • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments …
  • 40. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 41. Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B
  • 42. Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design
  • 43. Broker A Broker B Producer Consumer Broker C ZooKeeper
  • 45. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 46. Batch processing? ! Real time processing? ! Machine learning algorithms? ! Graph analysis? ! Unified programming model?
  • 48. ! • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk
  • 49. Apache Spark Main Components Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 51. • Driver main function and executes various parallel operations on a cluster ! • Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables
  • 53. • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries
  • 55. • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources
  • 57. ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes
  • 58. ! • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction • Singular value decomposition ! • Principal component analysis
  • 60. ! • Graphs API and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders
  • 61. Live demo! Building a messages classifier
  • 63. • Not about data size, but how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!!
  • 65. !• Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • Liferay Kafka Bridge ! • What every software engineer should know about a log