Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

Hunter Kelly
@retnuh
All the Topics on the Interwebs

embassy wikileaks assange
german merkel cables
snowden speigel spying

❖ What are we actually doing?
➢ Mining web pages for insights
❖ How?
➢ Using Machine Learning to do heavy
lifting
■ Use Classifiers to filter/bucket the
data
■ Build Topic Models to try to discover
concepts related to words

❖ Getting Data
➢ DMOZ
➢ Common Crawl
❖ Manipulating Data
➢ Spark
➢ Sparkling
■ RDDs
■ DataFrames
❖ Data Science
➢ MLLib
➢ Classification - Random Forests™
➢ LDA (Latent Dirichlet Allocation)

❖ DMOZ
➢ “The largest human edited directory of the
web”
➢ Useful when you think of it in terms of
“free crowdsourced labeled data”
➢ Fairly ancient, borderline decrepit
➢ Crowdsourced is a double edged sword

❖ Common Crawl (CC)
➢ “an open repository of web crawl data
that can be accessed and analyzed by
anyone.”
➢ Monthly crawls
➢ Readily accessible index
➢ Tons of free data - raw, links, plain text
formats

❖ How to use them together!
➢ Use DMOZ to samples of positive and
negative “seed links”
➢ Lookup and expand your “seed links”
using CC index
➢ Fetch your data with little/no fuss using
CC index information

❖ Apache Spark
➢ The “next big thing”
➢ Or arguably the “current” big thing
❖ Sparkling
➢ Clojure bindings to Spark
➢ Great Presentation (highly recommended)
➢ RDDs
➢ DataFrames

❖ RDDs
➢ Resilient Distributed Datasets
➢ Easy to think of them as partitioned (or
sharded) seqs
➢ Transformations (map, filter, etc) are lazy
➢ Operations (count, collect, reduce, etc)
cause evaluation
➢ Very familiar paradigms for Clojure
programmers

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

(defn sieve-prime-multiples [n primes numbers]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples (->> primes
(r/mapcat #(generate-multiples % n (odd? %)))
(into #{}))
candidates (->> numbers
(r/remove prime-multiples))
new-primes (->> candidates
(r/filter #(< % upto))
r/foldcat
sort
(into []))
remaining (->> candidates
(r/remove (set new-primes))
r/foldcat)]
[new-primes remaining]))
Clojure using Reducers

(defn sieve-prime-multiples [ctx n primes numbers-rdd]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples-rdd (->> (spark/parallelize ctx primes)
(spark/flat-map
#(generate-multiples % n (odd? %))))
candidates-rdd (spark/cache (.subtract numbers-rdd
prime-multiples-rdd))
new-primes-rdd (->> candidates-rdd
(spark/filter #(< % upto))
spark/cache)
new-prime (vec (sort (spark/collect new-primes-rdd)))
remaining-rdd (.subtract candidates-rdd new-primes-rdd)]
(.unpersist candidates-rdd false)
(.unpersist new-primes-rdd false)
[new-primes remaining-rdd]))
Clojure using Spark

❖ A Historical Tangent
➢ “Those who cannot remember the past
are condemned to repeat it.”
➢ ~15 years ago, everything is running
MySQL, Oracle, etc.
➢ ~7 years ago everyone abandoning
SQL+RDBMS for NoSQL
➢ Now looping back to SQL - Spark SQL,
Google F1, etc.

❖ DataFrames
➢ DataFrames are the new hotness
➢ It’s how Python and R can now achieve
similar speeds
➢ The Catalyst execution engine can plan
intelligently - behind the scenes,
generates source code, heavy use of
Scala macros, optimize away
boxing/unboxing calls, etc.
➢ Focus is clearly on DataFrames and
upcoming DataSets

❖ DataFrames (cont)
➢ Great in Scala, not so much via JVM
interop
➢ Heavy use of Scala magic like implicits,
etc.
➢ Working with DataFrames from Clojure
can be… less than pleasant
➢ Scala folks really like their static, declared
types
➢ Going to get worse with DataSets

(def FEATURE-TYPE [[:feature DataTypes/IntegerType]])
(def FEATURE-SCHEMA (types->schema FEATURE-TYPE))
(defn create-feature-table
[sql-ctx table-name features]
(let [ctx (.sparkContext sql-ctx)
features-rdd (->> (spark/parallelize (JavaSparkContext. ctx)
(seq features))
(spark/map (fn [i] (RowFactory/create
(to-array [i])))))
features-df (.createDataFrame sql-ctx features-rdd
FEATURE-SCHEMA)]
(.registerTempTable features-df table-name)
features-df))
Creating a single column DataFrame

(let [query-df (-> bow-df
(.select "word" (into-array ["index"])))]
(reduce (fn [[bow rbow] row]
[(assoc bow (.getString row 0)
(.getInt row 1))
(assoc rbow (.getInt row 1)
(.getString row 0))])
[{} {}] (.collectAsList query-df))))

(-> bow-df
(.join features-df (.equalTo ind-col
(.col features-df
"feature")))
(.select (into-array [(.col bow-df "*")
feature-index-col]))
(.orderBy (into-array [feature-index-col])))

Machine Learning
Elevator Pitch

❖ Machine Learning Key Points
➢ Uses statistical methods on large
amounts of data to hopefully gain insights
➢ Uses vectors of numbers extracted (by
you) from your data - “feature vectors”
➢ Classification puts things into buckets, i.e.
“fashion related website” vs. “everything
else”
➢ Topic modeling - way of finding patterns in
a bunch of documents - a “corpus”

❖ MLLib
➢ Spark’s Machine Learning (ML) library
➢ “Its goal is to make practical machine
learning scalable and easy”
➢ Divides into two packages:
■ spark.mllib - built on top of RDDs
■ spark.ml - built on top of DataFrames

❖ MLLib (cont)
➢ All the basics - Vectors, Sparse Vectors,
LabeledPoints, etc.
➢ A good variety of algorithms, all designed
for running in parallel
➢ Well documented
➢ Large community

❖ Example - Metrics
➢ BinaryClassificationMetrics has some
useful things, but not basic things
➢ Have to use MulticlassMetrics for some of
the most wanted metrics, even on a
binary classifier
➢ Neither actually give you the count of
items by label - but
BinaryClassificationMetrics logs it to INFO
➢ End up iterating your data 3 (!) times to
get all desired metrics

Computing metrics(defn metrics [rdd model]
(let [pl (->> rdd
(spark/map (fn [point]
(let [y (.label point) x (.features point)]
(spark/tuple (.predict model x) y))))
spark/cache)
multi-metrics (MulticlassMetrics. (.rdd pl))
metrics (BinaryClassificationMetrics. (.rdd pl))
r {:area-under-pr (.areaUnderPR metrics)
:f-measure (.fMeasure multi-metrics 1.0) ;; Others elided
:label-counts (->> rdd
(spark/map-to-pair
(fn [point] (spark/tuple (.label point) 1)))
spark/count-by-key)}]
(.unpersist pl false)
r))

❖ Examples - Eye on the prize?
➢ HashingTF - oh boy
■ Lose all access to original word
■ Uses gigantic Array instead of a
HashMap
➢ ChiSqSelector - used to select top N
features
■ but how do we determine N? Can’t ask
■ End up grubbing around in the source
to find uses Statistics/chiSqTest

Computing Chi-Square Test
(let [sql-ctx (spark-util/make-sql-context ctx)
labels-features-df (spark-util/maybe-sample-df options
(spark-util/load-table sql-ctx "features" input))
labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet
labels-features-df true)
(spark/map
(fn [m] (get-in m
[:labeled-points :term-count]))))
[bow rbow] (bow/load-bow-maps-from-table sql-ctx
(spark-util/load-table sql-ctx "bow" bow-input))
chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)]
(doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))]
(log/info "Feature:" ind (rbow ind) "tst:" tst)))

Classification w/
Random Forests

❖ Classification
➢ Using lots of data to tell things apart
➢ Can put stuff into two buckets (or
“classes”) - Binary Classifier
➢ Or into many buckets - Multi-class
Classifier
➢ Lots of different techniques
➢ Supervised learning - each sample needs:
■ “features” - a vector of numeric data
■ “label” - a label specifying its class

❖ The Bag of Words
➢ We started with very basic word cleansing
- lowercase, remove non letters/digits, 3
char min length, drop things just numbers
➢ Managed to make it this far in talk without
having to use word count!
➢ But ultimately most Data Science/ML
tasks involving text ends up heavily
dependent on word count

❖ The Bag of Words (cont)
➢ Ended up with too many words (1.3M)
even on sample
➢ Were working on bare baseline, so no
stopword removal or stemming, following
KISS principle
➢ We did say must occur on >= 5 distinct
sites (not documents), reduced size to
460k words

(defn create-bow-site-occurance [json-lines-rdd]
(->> json-lines-rdd
(spark/map-to-pair
(fn [m] (spark/tuple (site (:url m))
(set (clean-word-seq (:raw_text m))))))
(spark/reduce-by-key union)
(spark/flat-map-to-pair
(s-de/key-value-fn
(fn [site words] (map spark/tuple words (repeat 1)))))
(spark/reduce-by-key +)
(spark/filter
(s-de/key-value-fn
(fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))
spark/sort-by-key))
Bag of Words

❖ Random Forests™
➢ Ensemble of Decision Trees
➢ Uses “bootstrapping” for selection of
feature set and training set
➢ Not “Deep Learning” but extremely easy
to use and very effective
➢ “Any sufficiently advanced technology is
indistinguishable from magic.”
➢ Able to get pretty decent results! F-
measure 0.86

Train the Random Forest from LabeledPoints
(defn train-random-forest [num-trees max-depth max-bins seed
labeled-points-rdd]
(let [p {:num-classes 2, :categorical-feature-info {},
:feature-subset-strategy "auto", :impurity "gini",
:max-depth max-depth, :max-bins max-bins}]
(RandomForest/trainClassifier labeled-points-rdd
(:num-classes p)
(:categorical-feature-info p)
num-trees
(:feature-subset-strategy p)
(:impurity p)
(:max-depth p)
(:max-bins p)
seed)))

Prepare to train/test RandomForest
(defn load-and-train-random-forest [rdd num-trees max-depth max-bins
seed & [sample-fraction]]
(let [sampled-rdd (if sample-fraction
(spark/sample false sample-fraction seed rdd)
rdd)
labeled-rdd (->> sampled-rdd
(spark/map #(labeled-point lf/fashion? %)))
[train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed)
cached-train (spark/cache train)
cached-test (spark/cache test)
model (train-random-forest num-trees max-depth max-bins seed
cached-train)]
[cached-train cached-test model]))

❖ LDA - Latent Dirichlet Allocation
➢ Topic Model which infers topics from text
corpus
➢ Topics -> cluster centers, docs -> rows
➢ Features are vectors of word counts (Bag
of Words)
➢ Unsupervised Learning technique (but
you do supply the topic count)

❖ LDA (cont)
➢ Quite tetchy to run at large scale
➢ OutOfMemory error on executors
➢ Job aborted due to stage failure: Serialized task 4341:0 was
365752339 bytes, which exceeds max allowed: spark.akka.frameSize
(134217728 bytes) - reserved (204800 bytes). Consider increasing
spark.akka.frameSize or using broadcast variables for large values.
➢ WTF?
➢ BTW, do not ever change “spark.akka.
frameSize”...

❖ LDA (moar cont)
➢ Finally able to get a trained model after
reducing BoW to more manageable size
~11k down from ~160k
➢ Trained on ~100k documents, roughly
even split between fashion/non-fashion
➢ These models for demonstration
purposes, moar fanciness planned

Train an LDA Model
(defn train-lda-model [num-topics seed features-fn maps-rdd]
(let [rdd (->> maps-rdd
(spark/map (fn [{:keys [doc-number] :as m}]
(spark/tuple doc-number (features-fn m))))
spark/cache)
corpus-size (spark/count rdd)
mbf (mini-batch-fraction-batch-size corpus-size 5000)
max-iters (int (Math/ceil (/ mbf)))
optimizer (doto (OnlineLDAOptimizer.)
(.setMiniBatchFraction (min 1.0 mbf)))
model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics)
(.setSeed seed) (.setMaxIterations max-iters))
(.run (.rdd rdd)))]
(.unpersist rdd false)
model))

❖ So what did we do?
➢ We took pre-scraped, “pre-labeled” data
➢ Used Clojure and Spark/Sparkling to
munge the data
➢ Used state of the art ML tools to analyze
the data
➢ Explored for insights

❖ So what can YOU do?
➢ This will work for almost ANY domain
➢ There’s a lot of interesting information
even at this stage
➢ There’s a ton of interesting directions this
can go
■ Run classifier over all of CC data
■ Build domain-specific LDA models
➢ Do cool things and have fun doing it!

Hunter Kelly
@retnuh
https://github.com/retnuh

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk (20)

Recently uploaded (20)

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk