SlideShare a Scribd company logo
1©MapR Technologies - Confidential
Mahout, New and Improved
Now with Super Fast Clustering
2©MapR Technologies - Confidential
Agenda
 What happened in Mahout 0.7
– less bloat
– simpler structure
– general cleanup
3©MapR Technologies - Confidential
To Cut Out Bloat
4©MapR Technologies - Confidential
5©MapR Technologies - Confidential
Bloat is Leaving in 0.7
 Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
 Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
 If you care, get on the mailing list
6©MapR Technologies - Confidential
Bloat is Leaving in 0.7
 Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
 Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
 If you care, get on the mailing list
– oops, too late since 0.7 is already released
7©MapR Technologies - Confidential
Integration of
Collections
8©MapR Technologies - Confidential
Nobody Cares about Collections
 We need it, math is built on it
 Pull it into math
 Broke the build (battle of the code expanders)
 Fixed now (thanks to Grant)
9©MapR Technologies - Confidential
Pig Vector
10©MapR Technologies - Confidential
What is it?
 Supports access to Mahout functionality from Pig
 So far -- text vectorization
 And classification
 And model saving
11©MapR Technologies - Confidential
What is it?
 Supports Pig access to Mahout functions
 So far text vectorization
 And classification
 And model saving
 Kind of works (see pigML from twitter for better function)
12©MapR Technologies - Confidential
Compile and Install
 Start by compiling and installing mahout in your local repository:
cd ~/Apache
git clone https://github.com/apache/mahout.git
cd mahout
mvn install -DskipTests
 Then do the same with pig-vector
cd ~/Apache
git clone git@github.com:tdunning/pig-vector.git
cd pig-vector
mvn package
13©MapR Technologies - Confidential
Tokenize and Vectorize Text
 Tokenized is done using a text encoder
– the dimension of the resulting vectors (typically 100,000-1,000,000
– a description of the variables to be included in the encoding
– the schema of the tuples that pig will pass together with their data types
 Example:
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector
('10','x+y+1', 'x:numeric, y:word, z:text');
 You can also add a Lucene 3.1 analyzer in parentheses if you want
something fancier
14©MapR Technologies - Confidential
The Formula
 Not normal arithmetic
 Describes which variables to use, whether offset is included
 Also describes which interactions to use
15©MapR Technologies - Confidential
The Formula
 Not normal arithmetic
 Describes which variables to use, whether offset is included
 Also describes which interactions to use
– but that doesn’t do anything yet!
16©MapR Technologies - Confidential
Load and Encode Data
 Load the data
a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
as (x1:int, x2:int, x3:int);
 And encode it
b = foreach a generate 1 as key, EncodeVector(*) as v;
 Note that the true meaning of * is very subtle
 Now store it
store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage
(
'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c
com.twitter.elephantbird.pig.util.GenericWritableConverter
-t org.apache.mahout.math.VectorWritable’);
17©MapR Technologies - Confidential
Train a Model
 Pass previously encoded data to a sequential model trainer
define train org.apache.mahout.pig.LogisticRegression(
'iterations=5, inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast
comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space
talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');
 Note that the argument is a string with its own syntax
18©MapR Technologies - Confidential
Reservations and Qualms
 Pig-vector isn’t done
 And it is ugly
 And it doesn’t quite work
 And it is hard to build
 But there seems to be promise
19©MapR Technologies - Confidential
Potential
 Add Naïve Bayes Model?
 Somehow simplify the syntax?
 Try a recent version of elephant-bird?
 Switch to pigML?
20©MapR Technologies - Confidential
Large-scale k-Means Clustering
21©MapR Technologies - Confidential
Goals
 Cluster very large data sets
 Facilitate large nearest neighbor search
 Allow very large number of clusters
 Achieve good quality
– low average distance to nearest centroid on held-out data
 Based on Mahout Math
 Runs on Hadoop (really MapR) cluster
 FAST – cluster tens of millions in minutes
22©MapR Technologies - Confidential
Non-goals
 Use map-reduce (but it is there)
 Minimize the number of clusters
 Support metrics other than L2
23©MapR Technologies - Confidential
Anti-goals
 Multiple passes over original data
 Scale as O(k n)
24©MapR Technologies - Confidential
Why?
25©MapR Technologies - Confidential
K-nearest Neighbor with
Super Fast k-means
26©MapR Technologies - Confidential
What’s that?
 Find the k nearest training examples
 Use the average value of the target variable from them
 This is easy … but hard
– easy because it is so conceptually simple and you have few knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results, not just single nearest
 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
27©MapR Technologies - Confidential
Modeling with k-nearest Neighbors
a
b c
28©MapR Technologies - Confidential
Subject to Some Limits
29©MapR Technologies - Confidential
Log Transform Improves Things
30©MapR Technologies - Confidential
Neighbors Depend on Good Presentation
31©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from MapR customer
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup
32©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene
33©MapR Technologies - Confidential
What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
 Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices
 Searcher interface
– Brute, ProjectionSearch, KmeansSearch, LshSearch
 Super-fast clustering
– Kmeans, StreamingKmeans
34©MapR Technologies - Confidential
Projection Search
java.lang.TreeSet!
35©MapR Technologies - Confidential
Projection Search
 Projection onto a line provides a total order on data
 Nearby points stay nearby
 Some other points also wind up close
 Search points just before or just after the query point
36©MapR Technologies - Confidential
How Many Projections?
37©MapR Technologies - Confidential
K-means Search
 Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
 Recursive application
– to search a cluster, use a Searcher!
38©MapR Technologies - Confidential
39©MapR Technologies - Confidential
x
40©MapR Technologies - Confidential
41©MapR Technologies - Confidential
42©MapR Technologies - Confidential
x
43©MapR Technologies - Confidential
But This Requires k-means!
 Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads on one node)
– Very parallelizable
44©MapR Technologies - Confidential
Basic Method
 Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
 Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters
45©MapR Technologies - Confidential
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > k log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease
46©MapR Technologies - Confidential
How It Works
 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
47©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
48©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
49©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
50©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
 Empirically, projection search beats 64 bit LSH by a bit
– More optimization may change this story
51©MapR Technologies - Confidential
Moving to Ultra Mega Super Scale
 Map-reduce implementation nearly trivial
 Map: rough-cluster input data, output ß, weighted centroids
 Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory
 Combiner possible, but not important
52©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.mapr.com/ted-boston-2012-07
Hash tags: #boston-hug #mahout #mapr

More Related Content

PPTX
MapR LucidWorks Joint Webinar 121211
MapR Technologies
 
PPTX
Drill dchug-29 nov2012
MapR Technologies
 
PPTX
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
MapR Technologies
 
PDF
HUG slides on NFS and ODBC
MapR Technologies
 
PPTX
Dealing with an Upside Down Internet
MapR Technologies
 
PPTX
Goto amsterdam-2013-skinned
Ted Dunning
 
PPTX
Drill at the Chug 9-19-12
Ted Dunning
 
PPTX
London hug
Ted Dunning
 
MapR LucidWorks Joint Webinar 121211
MapR Technologies
 
Drill dchug-29 nov2012
MapR Technologies
 
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
MapR Technologies
 
HUG slides on NFS and ODBC
MapR Technologies
 
Dealing with an Upside Down Internet
MapR Technologies
 
Goto amsterdam-2013-skinned
Ted Dunning
 
Drill at the Chug 9-19-12
Ted Dunning
 
London hug
Ted Dunning
 

What's hot (20)

PPT
HPTS talk on micro-sharding with Katta
Ted Dunning
 
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
PDF
Storm users group real time hadoop
Ted Dunning
 
PPTX
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
inside-BigData.com
 
PDF
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
PDF
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
PDF
Hadoop, MapReduce and R = RHadoop
Victoria López
 
PDF
Hadoop as a Platform for Genomics
MapR Technologies
 
PDF
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon
 
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
PDF
Apache Spark Overview
Carol McDonald
 
PDF
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
PDF
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
 
PDF
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data Performance and Capacity Management
rightsize
 
PPTX
Introduction to Hadoop part 2
Giovanna Roda
 
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Big Data Spain
 
PPTX
10c introduction
mapr-academy
 
PDF
Hadoop scalability
WANdisco Plc
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
HPTS talk on micro-sharding with Katta
Ted Dunning
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
Storm users group real time hadoop
Ted Dunning
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
inside-BigData.com
 
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Hadoop as a Platform for Genomics
MapR Technologies
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Apache Spark Overview
Carol McDonald
 
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
 
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Big Data Performance and Capacity Management
rightsize
 
Introduction to Hadoop part 2
Giovanna Roda
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Big Data Spain
 
10c introduction
mapr-academy
 
Hadoop scalability
WANdisco Plc
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Ad

Similar to Boston Hug by Ted Dunning 2012 (20)

PPTX
New directions for mahout
MapR Technologies
 
PPTX
Graphlab Ted Dunning Clustering
MapR Technologies
 
PPTX
Boston hug-2012-07
Ted Dunning
 
PPTX
CMU Lecture on Hadoop Performance
MapR Technologies
 
PPTX
New Directions for Mahout
Ted Dunning
 
PPTX
Graphlab dunning-clustering
Ted Dunning
 
PPTX
London Data Science - Super-Fast Clustering Report
MapR Technologies
 
PPTX
Buzz words-dunning-real-time-learning
Ted Dunning
 
PPTX
Predictive Analytics San Diego
MapR Technologies
 
PPTX
The power of hadoop in business
MapR Technologies
 
PPTX
News From Mahout
MapR Technologies
 
PPTX
Devoxx Real-Time Learning
MapR Technologies
 
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
PPTX
Introduction to Mahout
Ted Dunning
 
PPTX
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
PPTX
What's Right and Wrong with Apache Mahout
MapR Technologies
 
PPTX
Whats Right and Wrong with Apache Mahout
Ted Dunning
 
PDF
1. Big Data - Introduction(what is bigdata).pdf
AmanCSE050
 
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
PDF
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
New directions for mahout
MapR Technologies
 
Graphlab Ted Dunning Clustering
MapR Technologies
 
Boston hug-2012-07
Ted Dunning
 
CMU Lecture on Hadoop Performance
MapR Technologies
 
New Directions for Mahout
Ted Dunning
 
Graphlab dunning-clustering
Ted Dunning
 
London Data Science - Super-Fast Clustering Report
MapR Technologies
 
Buzz words-dunning-real-time-learning
Ted Dunning
 
Predictive Analytics San Diego
MapR Technologies
 
The power of hadoop in business
MapR Technologies
 
News From Mahout
MapR Technologies
 
Devoxx Real-Time Learning
MapR Technologies
 
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Introduction to Mahout
Ted Dunning
 
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
What's Right and Wrong with Apache Mahout
MapR Technologies
 
Whats Right and Wrong with Apache Mahout
Ted Dunning
 
1. Big Data - Introduction(what is bigdata).pdf
AmanCSE050
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
MapR Technologies
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
PPTX
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
PPTX
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
PDF
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
PDF
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
PPTX
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
PPTX
MapR and Cisco Make IT Better
MapR Technologies
 
PPTX
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

Recently uploaded (20)

PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Software Development Methodologies in 2025
KodekX
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 

Boston Hug by Ted Dunning 2012

  • 1. 1©MapR Technologies - Confidential Mahout, New and Improved Now with Super Fast Clustering
  • 2. 2©MapR Technologies - Confidential Agenda  What happened in Mahout 0.7 – less bloat – simpler structure – general cleanup
  • 3. 3©MapR Technologies - Confidential To Cut Out Bloat
  • 4. 4©MapR Technologies - Confidential
  • 5. 5©MapR Technologies - Confidential Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list
  • 6. 6©MapR Technologies - Confidential Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list – oops, too late since 0.7 is already released
  • 7. 7©MapR Technologies - Confidential Integration of Collections
  • 8. 8©MapR Technologies - Confidential Nobody Cares about Collections  We need it, math is built on it  Pull it into math  Broke the build (battle of the code expanders)  Fixed now (thanks to Grant)
  • 9. 9©MapR Technologies - Confidential Pig Vector
  • 10. 10©MapR Technologies - Confidential What is it?  Supports access to Mahout functionality from Pig  So far -- text vectorization  And classification  And model saving
  • 11. 11©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving  Kind of works (see pigML from twitter for better function)
  • 12. 12©MapR Technologies - Confidential Compile and Install  Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests  Then do the same with pig-vector cd ~/Apache git clone [email protected]:tdunning/pig-vector.git cd pig-vector mvn package
  • 13. 13©MapR Technologies - Confidential Tokenize and Vectorize Text  Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types  Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector ('10','x+y+1', 'x:numeric, y:word, z:text');  You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier
  • 14. 14©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use
  • 15. 15©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use – but that doesn’t do anything yet!
  • 16. 16©MapR Technologies - Confidential Load and Encode Data  Load the data a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',') as (x1:int, x2:int, x3:int);  And encode it b = foreach a generate 1 as key, EncodeVector(*) as v;  Note that the true meaning of * is very subtle  Now store it store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage ( '-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);
  • 17. 17©MapR Technologies - Confidential Train a Model  Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( 'iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');  Note that the argument is a string with its own syntax
  • 18. 18©MapR Technologies - Confidential Reservations and Qualms  Pig-vector isn’t done  And it is ugly  And it doesn’t quite work  And it is hard to build  But there seems to be promise
  • 19. 19©MapR Technologies - Confidential Potential  Add Naïve Bayes Model?  Somehow simplify the syntax?  Try a recent version of elephant-bird?  Switch to pigML?
  • 20. 20©MapR Technologies - Confidential Large-scale k-Means Clustering
  • 21. 21©MapR Technologies - Confidential Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes
  • 22. 22©MapR Technologies - Confidential Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2
  • 23. 23©MapR Technologies - Confidential Anti-goals  Multiple passes over original data  Scale as O(k n)
  • 24. 24©MapR Technologies - Confidential Why?
  • 25. 25©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  • 26. 26©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you have few knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results, not just single nearest  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 27. 27©MapR Technologies - Confidential Modeling with k-nearest Neighbors a b c
  • 28. 28©MapR Technologies - Confidential Subject to Some Limits
  • 29. 29©MapR Technologies - Confidential Log Transform Improves Things
  • 30. 30©MapR Technologies - Confidential Neighbors Depend on Good Presentation
  • 31. 31©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from MapR customer  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  • 32. 32©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  • 33. 33©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – Brute, ProjectionSearch, KmeansSearch, LshSearch  Super-fast clustering – Kmeans, StreamingKmeans
  • 34. 34©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  • 35. 35©MapR Technologies - Confidential Projection Search  Projection onto a line provides a total order on data  Nearby points stay nearby  Some other points also wind up close  Search points just before or just after the query point
  • 36. 36©MapR Technologies - Confidential How Many Projections?
  • 37. 37©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  • 38. 38©MapR Technologies - Confidential
  • 39. 39©MapR Technologies - Confidential x
  • 40. 40©MapR Technologies - Confidential
  • 41. 41©MapR Technologies - Confidential
  • 42. 42©MapR Technologies - Confidential x
  • 43. 43©MapR Technologies - Confidential But This Requires k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads on one node) – Very parallelizable
  • 44. 44©MapR Technologies - Confidential Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters
  • 45. 45©MapR Technologies - Confidential Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > k log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease
  • 46. 46©MapR Technologies - Confidential How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 47. 47©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 48. 48©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  • 49. 49©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  • 50. 50©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit – More optimization may change this story
  • 51. 51©MapR Technologies - Confidential Moving to Ultra Mega Super Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but not important
  • 52. 52©MapR Technologies - Confidential  Contact: – [email protected] – @ted_dunning  Slides and such: – http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr