Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

Fast Queries
on Data Lakes
Exposing bigdata and streaming analytics using hadoop, cassandra, akka and spray
Natalino Busa
@natalinobusa

Big and Fast.
Tools Architecture Hands on Application!

Parallelism Hadoop Cassandra Akka
Machine Learning Statistics Big Data
Algorithms Cloud Computing Scala Spray
Natalino Busa
@natalinobusa
www.natalinobusa.com

Challenges
Not much time to react
Events must be delivered fast to the new machine APIs
It’s Web, and Mobile Apps: latency budget is limited
Loads of information to process
Understand well the user history
Access a larger context

home brewed
wikipedia search
engine … Yeee ^-^/

Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for
running queries

Cassandra: A low-latency 2D store
Reliable
Distributed, Replicated File System
Low latency
Sub msec. read/write operations
Tunable CAP
Define your level of consistency
Data model:
hashed rows, sorted wide columns
Architecture model:
No SPOF, ring of nodes,
omogeneous system

Lambda architecture
Batch
Computing
HTTP RESTful API
In-Memory
Distributed Database
In-memory
Distributed DB’s
Lambda Architecture
Batch + Streaming
low-latency
Web API services
Streaming
Computing
All Data Fast Data

wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.py
hadoop
reducer.py
Publish pages on
Cassandra
Produce
inverted index
entries
Top 10 Urls per word
go to Cassandra
How to: Build an inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple

CREATE KEYSPACE wikipedia WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE wikipedia.pages (
url text,
title text,
abstract text,
length int,
refs int,
PRIMARY KEY (url)
);
CREATE TABLE wikipedia.inverted (
keyword text,
relevance int,
url text,
PRIMARY KEY ((keyword), relevance)
);
Data model ...

memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute

cat enwiki-latest-abstracts.xml | ./mapper.py | ./reducer.py
Map-Reduce
demystified

./mapper.py
produces tab separated triplets:
element 008930 http://en.wikipedia.org/wiki/Gold
with 008930 http://en.wikipedia.org/wiki/Gold
symbol 008930 http://en.wikipedia.org/wiki/Gold
atomic 008930 http://en.wikipedia.org/wiki/Gold
number 008930 http://en.wikipedia.org/wiki/Gold
dense 008930 http://en.wikipedia.org/wiki/Gold
soft 008930 http://en.wikipedia.org/wiki/Gold
malleable 008930 http://en.wikipedia.org/wiki/Gold
ductile 008930 http://en.wikipedia.org/wiki/Gold
Map-Reduce
demistified

./reducer.py
produces tab separated triplets for the same key:
ductile 008930 http://en.wikipedia.org/wiki/Gold
ductile 008452 http://en.wikipedia.org/wiki/Hydroforming
ductile 007930 http://en.wikipedia.org/wiki/Liquid_metal_embrittlement
...
Map-Reduce
demistified

def main():
global cassandra_client
logging.basicConfig()
cassandra_client = CassandraClient()
cassandra_client.connect(['127.0.0.1'])
readLoop()
cassandra_client.close()
Mapper ...

doc = ET.fromstring(doc)
...
#extract words from title and abstract
words = [w for w in txt.split() if w not in STOPWORDS and len(w) > 2]
#relevance algorithm
relevance = len(abstract) * len(links)
#mapper output to cassandra wikipedia.pages table
cassandra_client.insertPage(url, title, abstract, length, refs)
#emit unique the key-value pairs
emitted = list()
for word in words:
if word not in emitted:
print '%st%06dt%s' % (word, relevance, url)
emitted.append(word)
Mapper ...
T split !!!

wikipedia abstracts
hadoop
mapper.sh
hadoop
reducer.sh
Publish pages on
Cassandra
Extract
inverted index
go to Cassandra
Inverted index :
Export during the
"map" phase

memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
compute
cassandra
cassandra
cassandra

from cassandra.cluster import Cluster
class CassandraClient:
session = None
insert_page_statement = None
def connect(self, nodes):
cluster = Cluster(nodes)
metadata = cluster.metadata
self.session = cluster.connect()
log.info('Connected to cluster: ' + metadata.cluster_name)
prepareStatements()
def close(self):
self.session.cluster.shutdown()
self.session.shutdown()
log.info('Connection closed.')
Cassandra client

def prepareStatement(self):
self.insert_page_statement = self.session.prepare("""
INSERT INTO wikipedia.pages
(url, title, abstract, length, refs)
VALUES (?, ?, ?, ?, ?);
""")
def insertPage(self, url, title, abstract, length, refs):
self.session.execute(
self.insert_page_statement.bind(
(url, title, abstract, length, refs)))
Cassandra client

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar
-files mapper.py,reducer.py
-mapper ./mapper.py
-reducer ./reducer.py
-jobconf stream.num.map.output.key.fields=1
-jobconf stream.num.reduce.output.key.fields=1
-jobconf mapred.reduce.tasks=16
-input wikipedia-latest-abstract
-output $HADOOP_OUTPUT_DIR
YARN: mapreduce v2
Using map-reduce and yarn

wikipedia abstracts
hadoop
mapper.sh
hadoop
reducer.sh
Publish pages on
Cassandra
Extract
inverted index
go to Cassandra
Inverted index :
Export inverted inded
during "reduce" phase

SELECT TRANSFORM (url, abstract, links)
USING 'mapper.py' AS
(relevance, url)
FROM hive_wiki_table
ORDER BY relevance LIMIT 50;
Hive UDF
functions and
hooks
Second method: using hive sql queries
def emit_ranking(n=100):
global sorted_dict
for i in range(n):
cassandra_client.insertWord(current_word, relevance,
url)
…
def readLoop():
# input comes from STDIN
for line in sys.stdin:
# parse the input we got from mapper.py
word, relevance, url = line.split('t', 2)
if current_word == word :
sorted_dict[relevance] = url
else:
if current_word:
emit_ranking()
… Reducer ...

memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
compute
cassandra
cassandra

@app.route('/word/<keyword>')
def fetch_word(keyword):
db = get_cassandra()
pages = []
results = db.fetchWordResults(keyword)
for hit in results:
pages.append(db.fetchPageDetails(hit["url"]))
return Response(json.dumps(pages), status=200, mimetype="
application/json")
if __name__ == '__main__':
app.run()
Front-End:
prototyping in Flask

Expose during Map or Reduce?
Expose Map
- only access to local information
- simple, distributed "awk" filter
Expose in Reduce
- need to collect data scattered across your cluster
- analysis on all the available data

Latency tradeoffs
Two runtimes frameworks:
cassandra : in-memory, low-latency
hadoop : extensive, exhaustive, churns all the data
Statistics and machine learning:
Python and R : they can be used for batch and/or realtime
Fastest analysis: still the domain on C, Java, Scala

Some lessons learned
● Use mapreduce to (pre)process data
● Connect to Cassandra during MR
● Use MR as for batch heavy lifting
● Lambda architecture: Fast Data + All Data

Some lessons learned
Expose results to Cassandra for fast access
- responsive apps
- high troughput / low latency
Hadoop as a background tool
- data validation, new extractions, new algorithms
- data harmonization, correction, immutable system of records

The tutorial is on github
https://github.com/natalinobusa/wikipedia

Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial. (20)

More from Natalino Busa (18)

Recently uploaded (20)

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.