SlideShare a Scribd company logo
Fast Queries
on Data Lakes
Exposing bigdata and streaming analytics using hadoop, cassandra, akka and spray
Natalino Busa
@natalinobusa
Big and Fast.
Tools Architecture Hands on Application!
Parallelism Hadoop Cassandra Akka
Machine Learning Statistics Big Data
Algorithms Cloud Computing Scala Spray
Natalino Busa
@natalinobusa
www.natalinobusa.com
Challenges
Not much time to react
Events must be delivered fast to the new machine APIs
It’s Web, and Mobile Apps: latency budget is limited
Loads of information to process
Understand well the user history
Access a larger context
OK, let’s build some apps
home brewed
wikipedia search
engine … Yeee ^-^/
Tools of the day:
Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for
running queries
Cassandra: A low-latency 2D store
Reliable
Distributed, Replicated File System
Low latency
Sub msec. read/write operations
Tunable CAP
Define your level of consistency
Data model:
hashed rows, sorted wide columns
Architecture model:
No SPOF, ring of nodes,
omogeneous system
Lambda architecture
Batch
Computing
HTTP RESTful API
In-Memory
Distributed Database
In-memory
Distributed DB’s
Lambda Architecture
Batch + Streaming
low-latency
Web API services
Streaming
Computing
All Data Fast Data
wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.py
hadoop
reducer.py
Publish pages on
Cassandra
Produce
inverted index
entries
Top 10 Urls per word
go to Cassandra
How to: Build an inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple
CREATE KEYSPACE wikipedia WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE wikipedia.pages (
url text,
title text,
abstract text,
length int,
refs int,
PRIMARY KEY (url)
);
CREATE TABLE wikipedia.inverted (
keyword text,
relevance int,
url text,
PRIMARY KEY ((keyword), relevance)
);
Data model ...
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
cat enwiki-latest-abstracts.xml | ./mapper.py | ./reducer.py
Map-Reduce
demystified
./mapper.py
produces tab separated triplets:
element 008930 http://en.wikipedia.org/wiki/Gold
with 008930 http://en.wikipedia.org/wiki/Gold
symbol 008930 http://en.wikipedia.org/wiki/Gold
atomic 008930 http://en.wikipedia.org/wiki/Gold
number 008930 http://en.wikipedia.org/wiki/Gold
dense 008930 http://en.wikipedia.org/wiki/Gold
soft 008930 http://en.wikipedia.org/wiki/Gold
malleable 008930 http://en.wikipedia.org/wiki/Gold
ductile 008930 http://en.wikipedia.org/wiki/Gold
Map-Reduce
demistified
./reducer.py
produces tab separated triplets for the same key:
ductile 008930 http://en.wikipedia.org/wiki/Gold
ductile 008452 http://en.wikipedia.org/wiki/Hydroforming
ductile 007930 http://en.wikipedia.org/wiki/Liquid_metal_embrittlement
...
Map-Reduce
demistified
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
def main():
global cassandra_client
logging.basicConfig()
cassandra_client = CassandraClient()
cassandra_client.connect(['127.0.0.1'])
readLoop()
cassandra_client.close()
Mapper ...
doc = ET.fromstring(doc)
...
#extract words from title and abstract
words = [w for w in txt.split() if w not in STOPWORDS and len(w) > 2]
#relevance algorithm
relevance = len(abstract) * len(links)
#mapper output to cassandra wikipedia.pages table
cassandra_client.insertPage(url, title, abstract, length, refs)
#emit unique the key-value pairs
emitted = list()
for word in words:
if word not in emitted:
print '%st%06dt%s' % (word, relevance, url)
emitted.append(word)
Mapper ...
T split !!!
wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.sh
hadoop
reducer.sh
Publish pages on
Cassandra
Extract
inverted index
Top 10 Urls per word
go to Cassandra
Inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple
Export during the
"map" phase
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
cassandra
cassandra
cassandra
from cassandra.cluster import Cluster
class CassandraClient:
session = None
insert_page_statement = None
def connect(self, nodes):
cluster = Cluster(nodes)
metadata = cluster.metadata
self.session = cluster.connect()
log.info('Connected to cluster: ' + metadata.cluster_name)
prepareStatements()
def close(self):
self.session.cluster.shutdown()
self.session.shutdown()
log.info('Connection closed.')
Cassandra client
def prepareStatement(self):
self.insert_page_statement = self.session.prepare("""
INSERT INTO wikipedia.pages
(url, title, abstract, length, refs)
VALUES (?, ?, ?, ?, ?);
""")
def insertPage(self, url, title, abstract, length, refs):
self.session.execute(
self.insert_page_statement.bind(
(url, title, abstract, length, refs)))
Cassandra client
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar 
-files mapper.py,reducer.py 
-mapper ./mapper.py 
-reducer ./reducer.py 
-jobconf stream.num.map.output.key.fields=1 
-jobconf stream.num.reduce.output.key.fields=1 
-jobconf mapred.reduce.tasks=16 
-input wikipedia-latest-abstract 
-output $HADOOP_OUTPUT_DIR
YARN: mapreduce v2
Using map-reduce and yarn
wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.sh
hadoop
reducer.sh
Publish pages on
Cassandra
Extract
inverted index
Top 10 Urls per word
go to Cassandra
Inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple
Export inverted inded
during "reduce" phase
SELECT TRANSFORM (url, abstract, links)
USING 'mapper.py' AS
(relevance, url)
FROM hive_wiki_table
ORDER BY relevance LIMIT 50;
Hive UDF
functions and
hooks
Second method: using hive sql queries
def emit_ranking(n=100):
global sorted_dict
for i in range(n):
cassandra_client.insertWord(current_word, relevance,
url)
…
def readLoop():
# input comes from STDIN
for line in sys.stdin:
# parse the input we got from mapper.py
word, relevance, url = line.split('t', 2)
if current_word == word :
sorted_dict[relevance] = url
else:
if current_word:
emit_ranking()
… Reducer ...
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
cassandra
cassandra
Front-end:
@app.route('/word/<keyword>')
def fetch_word(keyword):
db = get_cassandra()
pages = []
results = db.fetchWordResults(keyword)
for hit in results:
pages.append(db.fetchPageDetails(hit["url"]))
return Response(json.dumps(pages), status=200, mimetype="
application/json")
if __name__ == '__main__':
app.run()
Front-End:
prototyping in Flask
Expose during Map or Reduce?
Expose Map
- only access to local information
- simple, distributed "awk" filter
Expose in Reduce
- need to collect data scattered across your cluster
- analysis on all the available data
Latency tradeoffs
Two runtimes frameworks:
cassandra : in-memory, low-latency
hadoop : extensive, exhaustive, churns all the data
Statistics and machine learning:
Python and R : they can be used for batch and/or realtime
Fastest analysis: still the domain on C, Java, Scala
Some lessons learned
● Use mapreduce to (pre)process data
● Connect to Cassandra during MR
● Use MR as for batch heavy lifting
● Lambda architecture: Fast Data + All Data
Some lessons learned
Expose results to Cassandra for fast access
- responsive apps
- high troughput / low latency
Hadoop as a background tool
- data validation, new extractions, new algorithms
- data harmonization, correction, immutable system of records
The tutorial is on github
https://github.com/natalinobusa/wikipedia
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?

More Related Content

What's hot (19)

PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
PDF
Intro to py spark (and cassandra)
Jon Haddad
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Heuritech: Apache Spark REX
didmarin
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PDF
Cassandra & Spark for IoT
Matthias Niehoff
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
Wide Column Store NoSQL vs SQL Data Modeling
ScyllaDB
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
Time series with Apache Cassandra - Long version
Patrick McFadin
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Druid meetup 4th_sql_on_druid
Yousun Jeong
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Intro to py spark (and cassandra)
Jon Haddad
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Heuritech: Apache Spark REX
didmarin
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
Cassandra & Spark for IoT
Matthias Niehoff
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Wide Column Store NoSQL vs SQL Data Modeling
ScyllaDB
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Spark with Cassandra by Christopher Batey
Spark Summit
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Druid meetup 4th_sql_on_druid
Yousun Jeong
 

Viewers also liked (20)

PPTX
Big data architectures and the data lake
James Serra
 
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
PPTX
How jKool Analyzes Streaming Data in Real Time with DataStax
DataStax
 
PPTX
Hadoop+Cassandra_Integration
Joyabrata Das
 
PPTX
Setting up a mini big data architecture, just for you! - Bas Geerdink
NLJUG
 
PDF
Ready for smart data banking?
Patrick Barnert
 
PDF
Hadoop Integration in Cassandra
Jairam Chandar
 
PDF
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Denodo
 
PDF
Intro to hadoop tutorial
markgrover
 
PDF
Hadoop operations
DataWorks Summit
 
PDF
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
NoSQLmatters
 
ODP
HBase introduction talk
Hayden Marchant
 
PDF
Apache Cassandra in the Real World
Jeremy Hanna
 
PDF
Gis capabilities on Big Data Systems
Ahmad Jawwad
 
PDF
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Modern Data Stack France
 
PDF
Tutorial hadoop hdfs_map_reduce
mudassar mulla
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PDF
Logical Data Warehouse and Data Lakes
Denodo
 
PDF
Introduction to Hadoop
Vigen Sahakyan
 
PDF
[Ai in finance] AI in regulatory compliance, risk management, and auditing
Natalino Busa
 
Big data architectures and the data lake
James Serra
 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
How jKool Analyzes Streaming Data in Real Time with DataStax
DataStax
 
Hadoop+Cassandra_Integration
Joyabrata Das
 
Setting up a mini big data architecture, just for you! - Bas Geerdink
NLJUG
 
Ready for smart data banking?
Patrick Barnert
 
Hadoop Integration in Cassandra
Jairam Chandar
 
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Denodo
 
Intro to hadoop tutorial
markgrover
 
Hadoop operations
DataWorks Summit
 
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
NoSQLmatters
 
HBase introduction talk
Hayden Marchant
 
Apache Cassandra in the Real World
Jeremy Hanna
 
Gis capabilities on Big Data Systems
Ahmad Jawwad
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Modern Data Stack France
 
Tutorial hadoop hdfs_map_reduce
mudassar mulla
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Logical Data Warehouse and Data Lakes
Denodo
 
Introduction to Hadoop
Vigen Sahakyan
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
Natalino Busa
 
Ad

Similar to Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial. (20)

PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PPT
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
PPT
Hadoop institutes in hyderabad
Kelly Technologies
 
PDF
Getting Started with Hadoop
Josh Devins
 
PDF
Apache Hadoop 1.1
Sperasoft
 
PPTX
Getting Started with Hadoop
Cloudera, Inc.
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PDF
Avoiding big data antipatterns
grepalex
 
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
PPT
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
PDF
What's Next for Google's BigTable
Sqrrl
 
ZIP
Quick Wikipedia Mining using Elastic Map Reduce
ohkura
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PPTX
Hadoop for sysadmins
ericwilliammarshall
 
PPTX
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 
PDF
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Hive @ Hadoop day seattle_2010
nzhang
 
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
Hadoop institutes in hyderabad
Kelly Technologies
 
Getting Started with Hadoop
Josh Devins
 
Apache Hadoop 1.1
Sperasoft
 
Getting Started with Hadoop
Cloudera, Inc.
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Apache Hadoop and HBase
Cloudera, Inc.
 
Avoiding big data antipatterns
grepalex
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
Intro to Hadoop
Jonathan Bloom
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
What's Next for Google's BigTable
Sqrrl
 
Quick Wikipedia Mining using Elastic Map Reduce
ohkura
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Introduction To Hadoop Ecosystem
InSemble
 
Hadoop for sysadmins
ericwilliammarshall
 
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Ad

More from Natalino Busa (18)

PDF
Data Production Pipelines: Legacy, practices, and innovation
Natalino Busa
 
PDF
Data science apps powered by Jupyter Notebooks
Natalino Busa
 
PDF
7 steps for highly effective deep neural networks
Natalino Busa
 
PDF
Data science apps: beyond notebooks
Natalino Busa
 
PDF
Strata London 16: sightseeing, venues, and friends
Natalino Busa
 
PDF
Data in Action
Natalino Busa
 
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
PDF
The evolution of data analytics
Natalino Busa
 
PDF
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 
PDF
Streaming Api Design with Akka, Scala and Spray
Natalino Busa
 
PDF
Big data solutions for advanced marketing analytics
Natalino Busa
 
PDF
Awesome Banking API's
Natalino Busa
 
PDF
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
PDF
Big and fast a quest for relevant and real-time analytics
Natalino Busa
 
PDF
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 
PDF
Strata 2014: Data science and big data trending topics
Natalino Busa
 
PDF
Streaming computing: architectures, and tchnologies
Natalino Busa
 
PDF
Big data landscape
Natalino Busa
 
Data Production Pipelines: Legacy, practices, and innovation
Natalino Busa
 
Data science apps powered by Jupyter Notebooks
Natalino Busa
 
7 steps for highly effective deep neural networks
Natalino Busa
 
Data science apps: beyond notebooks
Natalino Busa
 
Strata London 16: sightseeing, venues, and friends
Natalino Busa
 
Data in Action
Natalino Busa
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
The evolution of data analytics
Natalino Busa
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 
Streaming Api Design with Akka, Scala and Spray
Natalino Busa
 
Big data solutions for advanced marketing analytics
Natalino Busa
 
Awesome Banking API's
Natalino Busa
 
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big and fast a quest for relevant and real-time analytics
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 
Strata 2014: Data science and big data trending topics
Natalino Busa
 
Streaming computing: architectures, and tchnologies
Natalino Busa
 
Big data landscape
Natalino Busa
 

Recently uploaded (20)

PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

  • 1. Fast Queries on Data Lakes Exposing bigdata and streaming analytics using hadoop, cassandra, akka and spray Natalino Busa @natalinobusa
  • 2. Big and Fast. Tools Architecture Hands on Application!
  • 3. Parallelism Hadoop Cassandra Akka Machine Learning Statistics Big Data Algorithms Cloud Computing Scala Spray Natalino Busa @natalinobusa www.natalinobusa.com
  • 4. Challenges Not much time to react Events must be delivered fast to the new machine APIs It’s Web, and Mobile Apps: latency budget is limited Loads of information to process Understand well the user history Access a larger context
  • 5. OK, let’s build some apps
  • 8. Hadoop: Distributed Data OS Reliable Distributed, Replicated File System Low cost ↓ Cost vs ↑ Performance/Storage Computing Powerhouse All clusters CPU’s working in parallel for running queries
  • 9. Cassandra: A low-latency 2D store Reliable Distributed, Replicated File System Low latency Sub msec. read/write operations Tunable CAP Define your level of consistency Data model: hashed rows, sorted wide columns Architecture model: No SPOF, ring of nodes, omogeneous system
  • 10. Lambda architecture Batch Computing HTTP RESTful API In-Memory Distributed Database In-memory Distributed DB’s Lambda Architecture Batch + Streaming low-latency Web API services Streaming Computing All Data Fast Data
  • 11. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.py hadoop reducer.py Publish pages on Cassandra Produce inverted index entries Top 10 Urls per word go to Cassandra How to: Build an inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple
  • 12. CREATE KEYSPACE wikipedia WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; CREATE TABLE wikipedia.pages ( url text, title text, abstract text, length int, refs int, PRIMARY KEY (url) ); CREATE TABLE wikipedia.inverted ( keyword text, relevance int, url text, PRIMARY KEY ((keyword), relevance) ); Data model ...
  • 13. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute
  • 14. cat enwiki-latest-abstracts.xml | ./mapper.py | ./reducer.py Map-Reduce demystified
  • 15. ./mapper.py produces tab separated triplets: element 008930 http://en.wikipedia.org/wiki/Gold with 008930 http://en.wikipedia.org/wiki/Gold symbol 008930 http://en.wikipedia.org/wiki/Gold atomic 008930 http://en.wikipedia.org/wiki/Gold number 008930 http://en.wikipedia.org/wiki/Gold dense 008930 http://en.wikipedia.org/wiki/Gold soft 008930 http://en.wikipedia.org/wiki/Gold malleable 008930 http://en.wikipedia.org/wiki/Gold ductile 008930 http://en.wikipedia.org/wiki/Gold Map-Reduce demistified
  • 16. ./reducer.py produces tab separated triplets for the same key: ductile 008930 http://en.wikipedia.org/wiki/Gold ductile 008452 http://en.wikipedia.org/wiki/Hydroforming ductile 007930 http://en.wikipedia.org/wiki/Liquid_metal_embrittlement ... Map-Reduce demistified
  • 17. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute
  • 18. def main(): global cassandra_client logging.basicConfig() cassandra_client = CassandraClient() cassandra_client.connect(['127.0.0.1']) readLoop() cassandra_client.close() Mapper ...
  • 19. doc = ET.fromstring(doc) ... #extract words from title and abstract words = [w for w in txt.split() if w not in STOPWORDS and len(w) > 2] #relevance algorithm relevance = len(abstract) * len(links) #mapper output to cassandra wikipedia.pages table cassandra_client.insertPage(url, title, abstract, length, refs) #emit unique the key-value pairs emitted = list() for word in words: if word not in emitted: print '%st%06dt%s' % (word, relevance, url) emitted.append(word) Mapper ... T split !!!
  • 20. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.sh hadoop reducer.sh Publish pages on Cassandra Extract inverted index Top 10 Urls per word go to Cassandra Inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple Export during the "map" phase
  • 21. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute cassandra cassandra cassandra
  • 22. from cassandra.cluster import Cluster class CassandraClient: session = None insert_page_statement = None def connect(self, nodes): cluster = Cluster(nodes) metadata = cluster.metadata self.session = cluster.connect() log.info('Connected to cluster: ' + metadata.cluster_name) prepareStatements() def close(self): self.session.cluster.shutdown() self.session.shutdown() log.info('Connection closed.') Cassandra client
  • 23. def prepareStatement(self): self.insert_page_statement = self.session.prepare(""" INSERT INTO wikipedia.pages (url, title, abstract, length, refs) VALUES (?, ?, ?, ?, ?); """) def insertPage(self, url, title, abstract, length, refs): self.session.execute( self.insert_page_statement.bind( (url, title, abstract, length, refs))) Cassandra client
  • 24. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -files mapper.py,reducer.py -mapper ./mapper.py -reducer ./reducer.py -jobconf stream.num.map.output.key.fields=1 -jobconf stream.num.reduce.output.key.fields=1 -jobconf mapred.reduce.tasks=16 -input wikipedia-latest-abstract -output $HADOOP_OUTPUT_DIR YARN: mapreduce v2 Using map-reduce and yarn
  • 25. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.sh hadoop reducer.sh Publish pages on Cassandra Extract inverted index Top 10 Urls per word go to Cassandra Inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple Export inverted inded during "reduce" phase
  • 26. SELECT TRANSFORM (url, abstract, links) USING 'mapper.py' AS (relevance, url) FROM hive_wiki_table ORDER BY relevance LIMIT 50; Hive UDF functions and hooks Second method: using hive sql queries def emit_ranking(n=100): global sorted_dict for i in range(n): cassandra_client.insertWord(current_word, relevance, url) … def readLoop(): # input comes from STDIN for line in sys.stdin: # parse the input we got from mapper.py word, relevance, url = line.split('t', 2) if current_word == word : sorted_dict[relevance] = url else: if current_word: emit_ranking() … Reducer ...
  • 27. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute cassandra cassandra
  • 29. @app.route('/word/<keyword>') def fetch_word(keyword): db = get_cassandra() pages = [] results = db.fetchWordResults(keyword) for hit in results: pages.append(db.fetchPageDetails(hit["url"])) return Response(json.dumps(pages), status=200, mimetype=" application/json") if __name__ == '__main__': app.run() Front-End: prototyping in Flask
  • 30. Expose during Map or Reduce? Expose Map - only access to local information - simple, distributed "awk" filter Expose in Reduce - need to collect data scattered across your cluster - analysis on all the available data
  • 31. Latency tradeoffs Two runtimes frameworks: cassandra : in-memory, low-latency hadoop : extensive, exhaustive, churns all the data Statistics and machine learning: Python and R : they can be used for batch and/or realtime Fastest analysis: still the domain on C, Java, Scala
  • 32. Some lessons learned ● Use mapreduce to (pre)process data ● Connect to Cassandra during MR ● Use MR as for batch heavy lifting ● Lambda architecture: Fast Data + All Data
  • 33. Some lessons learned Expose results to Cassandra for fast access - responsive apps - high troughput / low latency Hadoop as a background tool - data validation, new extractions, new algorithms - data harmonization, correction, immutable system of records
  • 34. The tutorial is on github https://github.com/natalinobusa/wikipedia
  • 35. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com Thanks ! Any questions?