SlideShare a Scribd company logo
Hunter Kelly
@retnuh
All the Topics on the Interwebs
manning
Perhaps this?
Or maybe this?
embassy wikileaks assange
german merkel cables
snowden speigel spying
Wut?
❖ What are we actually doing?
➢ Mining web pages for insights
❖ How?
➢ Using Machine Learning to do heavy
lifting
■ Use Classifiers to filter/bucket the
data
■ Build Topic Models to try to discover
concepts related to words
❖ Getting Data
➢ DMOZ
➢ Common Crawl
❖ Manipulating Data
➢ Spark
➢ Sparkling
■ RDDs
■ DataFrames
❖ Data Science
➢ MLLib
➢ Classification - Random Forests™
➢ LDA (Latent Dirichlet Allocation)
DMOZ
Common Crawl
❖ DMOZ
➢ “The largest human edited directory of the
web”
➢ Useful when you think of it in terms of
“free crowdsourced labeled data”
➢ Fairly ancient, borderline decrepit
➢ Crowdsourced is a double edged sword
❖ Common Crawl (CC)
➢ “an open repository of web crawl data
that can be accessed and analyzed by
anyone.”
➢ Monthly crawls
➢ Readily accessible index
➢ Tons of free data - raw, links, plain text
formats
❖ How to use them together!
➢ Use DMOZ to samples of positive and
negative “seed links”
➢ Lookup and expand your “seed links”
using CC index
➢ Fetch your data with little/no fuss using
CC index information
Spark &
Sparkling
❖ Apache Spark
➢ The “next big thing”
➢ Or arguably the “current” big thing
❖ Sparkling
➢ Clojure bindings to Spark
➢ Great Presentation (highly recommended)
➢ RDDs
➢ DataFrames
RDDs
❖ RDDs
➢ Resilient Distributed Datasets
➢ Easy to think of them as partitioned (or
sharded) seqs
➢ Transformations (map, filter, etc) are lazy
➢ Operations (count, collect, reduce, etc)
cause evaluation
➢ Very familiar paradigms for Clojure
programmers
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
(defn sieve-prime-multiples [n primes numbers]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples (->> primes
(r/mapcat #(generate-multiples % n (odd? %)))
(into #{}))
candidates (->> numbers
(r/remove prime-multiples))
new-primes (->> candidates
(r/filter #(< % upto))
r/foldcat
sort
(into []))
remaining (->> candidates
(r/remove (set new-primes))
r/foldcat)]
[new-primes remaining]))
Clojure using Reducers
(defn sieve-prime-multiples [ctx n primes numbers-rdd]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples-rdd (->> (spark/parallelize ctx primes)
(spark/flat-map
#(generate-multiples % n (odd? %))))
candidates-rdd (spark/cache (.subtract numbers-rdd
prime-multiples-rdd))
new-primes-rdd (->> candidates-rdd
(spark/filter #(< % upto))
spark/cache)
new-prime (vec (sort (spark/collect new-primes-rdd)))
remaining-rdd (.subtract candidates-rdd new-primes-rdd)]
(.unpersist candidates-rdd false)
(.unpersist new-primes-rdd false)
[new-primes remaining-rdd]))
Clojure using Spark
❖ A Historical Tangent
➢ “Those who cannot remember the past
are condemned to repeat it.”
➢ ~15 years ago, everything is running
MySQL, Oracle, etc.
➢ ~7 years ago everyone abandoning
SQL+RDBMS for NoSQL
➢ Now looping back to SQL - Spark SQL,
Google F1, etc.
DataFrames
❖ DataFrames
➢ DataFrames are the new hotness
➢ It’s how Python and R can now achieve
similar speeds
➢ The Catalyst execution engine can plan
intelligently - behind the scenes,
generates source code, heavy use of
Scala macros, optimize away
boxing/unboxing calls, etc.
➢ Focus is clearly on DataFrames and
upcoming DataSets
❖ DataFrames (cont)
➢ Great in Scala, not so much via JVM
interop
➢ Heavy use of Scala magic like implicits,
etc.
➢ Working with DataFrames from Clojure
can be… less than pleasant
➢ Scala folks really like their static, declared
types
➢ Going to get worse with DataSets
(def FEATURE-TYPE [[:feature DataTypes/IntegerType]])
(def FEATURE-SCHEMA (types->schema FEATURE-TYPE))
(defn create-feature-table
[sql-ctx table-name features]
(let [ctx (.sparkContext sql-ctx)
features-rdd (->> (spark/parallelize (JavaSparkContext. ctx)
(seq features))
(spark/map (fn [i] (RowFactory/create
(to-array [i])))))
features-df (.createDataFrame sql-ctx features-rdd
FEATURE-SCHEMA)]
(.registerTempTable features-df table-name)
features-df))
Creating a single column DataFrame
(let [query-df (-> bow-df
(.select "word" (into-array ["index"])))]
(reduce (fn [[bow rbow] row]
[(assoc bow (.getString row 0)
(.getInt row 1))
(assoc rbow (.getInt row 1)
(.getString row 0))])
[{} {}] (.collectAsList query-df))))
(-> bow-df
(.join features-df (.equalTo ind-col
(.col features-df
"feature")))
(.select (into-array [(.col bow-df "*")
feature-index-col]))
(.orderBy (into-array [feature-index-col])))
Machine Learning
Elevator Pitch
❖ Machine Learning Key Points
➢ Uses statistical methods on large
amounts of data to hopefully gain insights
➢ Uses vectors of numbers extracted (by
you) from your data - “feature vectors”
➢ Classification puts things into buckets, i.e.
“fashion related website” vs. “everything
else”
➢ Topic modeling - way of finding patterns in
a bunch of documents - a “corpus”
MLLib
❖ MLLib
➢ Spark’s Machine Learning (ML) library
➢ “Its goal is to make practical machine
learning scalable and easy”
➢ Divides into two packages:
■ spark.mllib - built on top of RDDs
■ spark.ml - built on top of DataFrames
❖ MLLib (cont)
➢ All the basics - Vectors, Sparse Vectors,
LabeledPoints, etc.
➢ A good variety of algorithms, all designed
for running in parallel
➢ Well documented
➢ Large community
MLLib gives us this...
But we want this!
❖ Example - Metrics
➢ BinaryClassificationMetrics has some
useful things, but not basic things
➢ Have to use MulticlassMetrics for some of
the most wanted metrics, even on a
binary classifier
➢ Neither actually give you the count of
items by label - but
BinaryClassificationMetrics logs it to INFO
➢ End up iterating your data 3 (!) times to
get all desired metrics
Computing metrics(defn metrics [rdd model]
(let [pl (->> rdd
(spark/map (fn [point]
(let [y (.label point) x (.features point)]
(spark/tuple (.predict model x) y))))
spark/cache)
multi-metrics (MulticlassMetrics. (.rdd pl))
metrics (BinaryClassificationMetrics. (.rdd pl))
r {:area-under-pr (.areaUnderPR metrics)
:f-measure (.fMeasure multi-metrics 1.0) ;; Others elided
:label-counts (->> rdd
(spark/map-to-pair
(fn [point] (spark/tuple (.label point) 1)))
spark/count-by-key)}]
(.unpersist pl false)
r))
❖ Examples - Eye on the prize?
➢ HashingTF - oh boy
■ Lose all access to original word
■ Uses gigantic Array instead of a
HashMap
➢ ChiSqSelector - used to select top N
features
■ but how do we determine N? Can’t ask
■ End up grubbing around in the source
to find uses Statistics/chiSqTest
Computing Chi-Square Test
(let [sql-ctx (spark-util/make-sql-context ctx)
labels-features-df (spark-util/maybe-sample-df options
(spark-util/load-table sql-ctx "features" input))
labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet
labels-features-df true)
(spark/map
(fn [m] (get-in m
[:labeled-points :term-count]))))
[bow rbow] (bow/load-bow-maps-from-table sql-ctx
(spark-util/load-table sql-ctx "bow" bow-input))
chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)]
(doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))]
(log/info "Feature:" ind (rbow ind) "tst:" tst)))
Classification w/
Random Forests
❖ Classification
➢ Using lots of data to tell things apart
➢ Can put stuff into two buckets (or
“classes”) - Binary Classifier
➢ Or into many buckets - Multi-class
Classifier
➢ Lots of different techniques
➢ Supervised learning - each sample needs:
■ “features” - a vector of numeric data
■ “label” - a label specifying its class
❖ The Bag of Words
➢ We started with very basic word cleansing
- lowercase, remove non letters/digits, 3
char min length, drop things just numbers
➢ Managed to make it this far in talk without
having to use word count!
➢ But ultimately most Data Science/ML
tasks involving text ends up heavily
dependent on word count
❖ The Bag of Words (cont)
➢ Ended up with too many words (1.3M)
even on sample
➢ Were working on bare baseline, so no
stopword removal or stemming, following
KISS principle
➢ We did say must occur on >= 5 distinct
sites (not documents), reduced size to
460k words
(defn create-bow-site-occurance [json-lines-rdd]
(->> json-lines-rdd
(spark/map-to-pair
(fn [m] (spark/tuple (site (:url m))
(set (clean-word-seq (:raw_text m))))))
(spark/reduce-by-key union)
(spark/flat-map-to-pair
(s-de/key-value-fn
(fn [site words] (map spark/tuple words (repeat 1)))))
(spark/reduce-by-key +)
(spark/filter
(s-de/key-value-fn
(fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))
spark/sort-by-key))
Bag of Words
❖ Random Forests™
➢ Ensemble of Decision Trees
➢ Uses “bootstrapping” for selection of
feature set and training set
➢ Not “Deep Learning” but extremely easy
to use and very effective
➢ “Any sufficiently advanced technology is
indistinguishable from magic.”
➢ Able to get pretty decent results! F-
measure 0.86
Train the Random Forest from LabeledPoints
(defn train-random-forest [num-trees max-depth max-bins seed
labeled-points-rdd]
(let [p {:num-classes 2, :categorical-feature-info {},
:feature-subset-strategy "auto", :impurity "gini",
:max-depth max-depth, :max-bins max-bins}]
(RandomForest/trainClassifier labeled-points-rdd
(:num-classes p)
(:categorical-feature-info p)
num-trees
(:feature-subset-strategy p)
(:impurity p)
(:max-depth p)
(:max-bins p)
seed)))
Prepare to train/test RandomForest
(defn load-and-train-random-forest [rdd num-trees max-depth max-bins
seed & [sample-fraction]]
(let [sampled-rdd (if sample-fraction
(spark/sample false sample-fraction seed rdd)
rdd)
labeled-rdd (->> sampled-rdd
(spark/map #(labeled-point lf/fashion? %)))
[train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed)
cached-train (spark/cache train)
cached-test (spark/cache test)
model (train-random-forest num-trees max-depth max-bins seed
cached-train)]
[cached-train cached-test model]))
Topic Modelling
with LDA
❖ LDA - Latent Dirichlet Allocation
➢ Topic Model which infers topics from text
corpus
➢ Topics -> cluster centers, docs -> rows
➢ Features are vectors of word counts (Bag
of Words)
➢ Unsupervised Learning technique (but
you do supply the topic count)
❖ LDA (cont)
➢ Quite tetchy to run at large scale
➢ OutOfMemory error on executors
➢ Job aborted due to stage failure: Serialized task 4341:0 was
365752339 bytes, which exceeds max allowed: spark.akka.frameSize
(134217728 bytes) - reserved (204800 bytes). Consider increasing
spark.akka.frameSize or using broadcast variables for large values.
➢ WTF?
➢ BTW, do not ever change “spark.akka.
frameSize”...
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
❖ LDA (moar cont)
➢ Finally able to get a trained model after
reducing BoW to more manageable size
~11k down from ~160k
➢ Trained on ~100k documents, roughly
even split between fashion/non-fashion
➢ These models for demonstration
purposes, moar fanciness planned
Train an LDA Model
(defn train-lda-model [num-topics seed features-fn maps-rdd]
(let [rdd (->> maps-rdd
(spark/map (fn [{:keys [doc-number] :as m}]
(spark/tuple doc-number (features-fn m))))
spark/cache)
corpus-size (spark/count rdd)
mbf (mini-batch-fraction-batch-size corpus-size 5000)
max-iters (int (Math/ceil (/ mbf)))
optimizer (doto (OnlineLDAOptimizer.)
(.setMiniBatchFraction (min 1.0 mbf)))
model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics)
(.setSeed seed) (.setMaxIterations max-iters))
(.run (.rdd rdd)))]
(.unpersist rdd false)
model))
Demo!
So what’s
the point?
❖ So what did we do?
➢ We took pre-scraped, “pre-labeled” data
➢ Used Clojure and Spark/Sparkling to
munge the data
➢ Used state of the art ML tools to analyze
the data
➢ Explored for insights
❖ So what can YOU do?
➢ This will work for almost ANY domain
➢ There’s a lot of interesting information
even at this stage
➢ There’s a ton of interesting directions this
can go
■ Run classifier over all of CC data
■ Build domain-specific LDA models
➢ Do cool things and have fun doing it!
Hunter Kelly
@retnuh
https://github.com/retnuh

More Related Content

What's hot (20)

PDF
SparkSQL and Dataframe
Namgee Lee
 
PPT
Javascript2839
Ramamohan Chokkam
 
PPTX
Java Performance Tips (So Code Camp San Diego 2014)
Kai Chan
 
PPT
J s-o-n-120219575328402-3
Ramamohan Chokkam
 
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Automatically generating-json-from-java-objects-java-objects268
Ramamohan Chokkam
 
PPTX
Scala meetup - Intro to spark
Javier Arrieta
 
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
PDF
Spark Schema For Free with David Szakallas
Databricks
 
PPT
Hive - SerDe and LazySerde
Zheng Shao
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PPTX
Dex Technical Seminar (April 2011)
Sergio Gomez Villamor
 
PDF
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor
 
PPTX
A Little SPARQL in your Analytics
Dr. Neil Brittliff
 
PDF
Spark schema for free with David Szakallas
Databricks
 
PDF
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
PPTX
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
PDF
Apache avro and overview hadoop tools
alireza alikhani
 
SparkSQL and Dataframe
Namgee Lee
 
Javascript2839
Ramamohan Chokkam
 
Java Performance Tips (So Code Camp San Diego 2014)
Kai Chan
 
J s-o-n-120219575328402-3
Ramamohan Chokkam
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Automatically generating-json-from-java-objects-java-objects268
Ramamohan Chokkam
 
Scala meetup - Intro to spark
Javier Arrieta
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Spark Schema For Free with David Szakallas
Databricks
 
Hive - SerDe and LazySerde
Zheng Shao
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Dex Technical Seminar (April 2011)
Sergio Gomez Villamor
 
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor
 
A Little SPARQL in your Analytics
Dr. Neil Brittliff
 
Spark schema for free with David Szakallas
Databricks
 
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Apache avro and overview hadoop tools
alireza alikhani
 

Viewers also liked (18)

PDF
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
PDF
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
 
PDF
Powering Radical Agility with Docker
Zalando Technology
 
PDF
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
PPTX
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
 
PDF
The Glamorous Toolkit: Towards a novel live IDE
ESUG
 
PDF
Data driven community management June 2015
Conor Duke
 
PDF
PharoJS
ESUG
 
PDF
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
 
PDF
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
 
PDF
Flink in Zalando's World of Microservices
Zalando Technology
 
PDF
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
 
PDF
Camunda BPM at Zalando: Order Processing at scale
camunda services GmbH
 
PDF
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
PDF
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
 
PDF
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
 
PDF
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
Shuyo Nakatani
 
PPT
BCG Matrix
Vishal Wadekar
 
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
 
Powering Radical Agility with Docker
Zalando Technology
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
 
The Glamorous Toolkit: Towards a novel live IDE
ESUG
 
Data driven community management June 2015
Conor Duke
 
PharoJS
ESUG
 
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
 
Flink in Zalando's World of Microservices
Zalando Technology
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
 
Camunda BPM at Zalando: Order Processing at scale
camunda services GmbH
 
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
 
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
Shuyo Nakatani
 
BCG Matrix
Vishal Wadekar
 
Ad

Similar to Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk (20)

PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
PDF
As simple as Apache Spark
Data Science Warsaw
 
PDF
Spark
Amir Payberah
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
NYC_2016_slides
Nathan Halko
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PPTX
scalable machine learning
Samir Bessalah
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Apache Spark Introduction
sudhakara st
 
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
As simple as Apache Spark
Data Science Warsaw
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
NYC_2016_slides
Nathan Halko
 
Intro to Apache Spark
Mammoth Data
 
scalable machine learning
Samir Bessalah
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Ad

Recently uploaded (20)

PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Digital Circuits, important subject in CS
contactparinay1
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

  • 1. Hunter Kelly @retnuh All the Topics on the Interwebs
  • 5. embassy wikileaks assange german merkel cables snowden speigel spying
  • 7. ❖ What are we actually doing? ➢ Mining web pages for insights ❖ How? ➢ Using Machine Learning to do heavy lifting ■ Use Classifiers to filter/bucket the data ■ Build Topic Models to try to discover concepts related to words
  • 8. ❖ Getting Data ➢ DMOZ ➢ Common Crawl ❖ Manipulating Data ➢ Spark ➢ Sparkling ■ RDDs ■ DataFrames ❖ Data Science ➢ MLLib ➢ Classification - Random Forests™ ➢ LDA (Latent Dirichlet Allocation)
  • 10. ❖ DMOZ ➢ “The largest human edited directory of the web” ➢ Useful when you think of it in terms of “free crowdsourced labeled data” ➢ Fairly ancient, borderline decrepit ➢ Crowdsourced is a double edged sword
  • 11. ❖ Common Crawl (CC) ➢ “an open repository of web crawl data that can be accessed and analyzed by anyone.” ➢ Monthly crawls ➢ Readily accessible index ➢ Tons of free data - raw, links, plain text formats
  • 12. ❖ How to use them together! ➢ Use DMOZ to samples of positive and negative “seed links” ➢ Lookup and expand your “seed links” using CC index ➢ Fetch your data with little/no fuss using CC index information
  • 14. ❖ Apache Spark ➢ The “next big thing” ➢ Or arguably the “current” big thing ❖ Sparkling ➢ Clojure bindings to Spark ➢ Great Presentation (highly recommended) ➢ RDDs ➢ DataFrames
  • 15. RDDs
  • 16. ❖ RDDs ➢ Resilient Distributed Datasets ➢ Easy to think of them as partitioned (or sharded) seqs ➢ Transformations (map, filter, etc) are lazy ➢ Operations (count, collect, reduce, etc) cause evaluation ➢ Very familiar paradigms for Clojure programmers
  • 18. (defn sieve-prime-multiples [n primes numbers] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples (->> primes (r/mapcat #(generate-multiples % n (odd? %))) (into #{})) candidates (->> numbers (r/remove prime-multiples)) new-primes (->> candidates (r/filter #(< % upto)) r/foldcat sort (into [])) remaining (->> candidates (r/remove (set new-primes)) r/foldcat)] [new-primes remaining])) Clojure using Reducers
  • 19. (defn sieve-prime-multiples [ctx n primes numbers-rdd] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples-rdd (->> (spark/parallelize ctx primes) (spark/flat-map #(generate-multiples % n (odd? %)))) candidates-rdd (spark/cache (.subtract numbers-rdd prime-multiples-rdd)) new-primes-rdd (->> candidates-rdd (spark/filter #(< % upto)) spark/cache) new-prime (vec (sort (spark/collect new-primes-rdd))) remaining-rdd (.subtract candidates-rdd new-primes-rdd)] (.unpersist candidates-rdd false) (.unpersist new-primes-rdd false) [new-primes remaining-rdd])) Clojure using Spark
  • 20. ❖ A Historical Tangent ➢ “Those who cannot remember the past are condemned to repeat it.” ➢ ~15 years ago, everything is running MySQL, Oracle, etc. ➢ ~7 years ago everyone abandoning SQL+RDBMS for NoSQL ➢ Now looping back to SQL - Spark SQL, Google F1, etc.
  • 22. ❖ DataFrames ➢ DataFrames are the new hotness ➢ It’s how Python and R can now achieve similar speeds ➢ The Catalyst execution engine can plan intelligently - behind the scenes, generates source code, heavy use of Scala macros, optimize away boxing/unboxing calls, etc. ➢ Focus is clearly on DataFrames and upcoming DataSets
  • 23. ❖ DataFrames (cont) ➢ Great in Scala, not so much via JVM interop ➢ Heavy use of Scala magic like implicits, etc. ➢ Working with DataFrames from Clojure can be… less than pleasant ➢ Scala folks really like their static, declared types ➢ Going to get worse with DataSets
  • 24. (def FEATURE-TYPE [[:feature DataTypes/IntegerType]]) (def FEATURE-SCHEMA (types->schema FEATURE-TYPE)) (defn create-feature-table [sql-ctx table-name features] (let [ctx (.sparkContext sql-ctx) features-rdd (->> (spark/parallelize (JavaSparkContext. ctx) (seq features)) (spark/map (fn [i] (RowFactory/create (to-array [i]))))) features-df (.createDataFrame sql-ctx features-rdd FEATURE-SCHEMA)] (.registerTempTable features-df table-name) features-df)) Creating a single column DataFrame
  • 25. (let [query-df (-> bow-df (.select "word" (into-array ["index"])))] (reduce (fn [[bow rbow] row] [(assoc bow (.getString row 0) (.getInt row 1)) (assoc rbow (.getInt row 1) (.getString row 0))]) [{} {}] (.collectAsList query-df))))
  • 26. (-> bow-df (.join features-df (.equalTo ind-col (.col features-df "feature"))) (.select (into-array [(.col bow-df "*") feature-index-col])) (.orderBy (into-array [feature-index-col])))
  • 28. ❖ Machine Learning Key Points ➢ Uses statistical methods on large amounts of data to hopefully gain insights ➢ Uses vectors of numbers extracted (by you) from your data - “feature vectors” ➢ Classification puts things into buckets, i.e. “fashion related website” vs. “everything else” ➢ Topic modeling - way of finding patterns in a bunch of documents - a “corpus”
  • 29. MLLib
  • 30. ❖ MLLib ➢ Spark’s Machine Learning (ML) library ➢ “Its goal is to make practical machine learning scalable and easy” ➢ Divides into two packages: ■ spark.mllib - built on top of RDDs ■ spark.ml - built on top of DataFrames
  • 31. ❖ MLLib (cont) ➢ All the basics - Vectors, Sparse Vectors, LabeledPoints, etc. ➢ A good variety of algorithms, all designed for running in parallel ➢ Well documented ➢ Large community
  • 32. MLLib gives us this...
  • 33. But we want this!
  • 34. ❖ Example - Metrics ➢ BinaryClassificationMetrics has some useful things, but not basic things ➢ Have to use MulticlassMetrics for some of the most wanted metrics, even on a binary classifier ➢ Neither actually give you the count of items by label - but BinaryClassificationMetrics logs it to INFO ➢ End up iterating your data 3 (!) times to get all desired metrics
  • 35. Computing metrics(defn metrics [rdd model] (let [pl (->> rdd (spark/map (fn [point] (let [y (.label point) x (.features point)] (spark/tuple (.predict model x) y)))) spark/cache) multi-metrics (MulticlassMetrics. (.rdd pl)) metrics (BinaryClassificationMetrics. (.rdd pl)) r {:area-under-pr (.areaUnderPR metrics) :f-measure (.fMeasure multi-metrics 1.0) ;; Others elided :label-counts (->> rdd (spark/map-to-pair (fn [point] (spark/tuple (.label point) 1))) spark/count-by-key)}] (.unpersist pl false) r))
  • 36. ❖ Examples - Eye on the prize? ➢ HashingTF - oh boy ■ Lose all access to original word ■ Uses gigantic Array instead of a HashMap ➢ ChiSqSelector - used to select top N features ■ but how do we determine N? Can’t ask ■ End up grubbing around in the source to find uses Statistics/chiSqTest
  • 37. Computing Chi-Square Test (let [sql-ctx (spark-util/make-sql-context ctx) labels-features-df (spark-util/maybe-sample-df options (spark-util/load-table sql-ctx "features" input)) labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet labels-features-df true) (spark/map (fn [m] (get-in m [:labeled-points :term-count])))) [bow rbow] (bow/load-bow-maps-from-table sql-ctx (spark-util/load-table sql-ctx "bow" bow-input)) chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)] (doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))] (log/info "Feature:" ind (rbow ind) "tst:" tst)))
  • 39. ❖ Classification ➢ Using lots of data to tell things apart ➢ Can put stuff into two buckets (or “classes”) - Binary Classifier ➢ Or into many buckets - Multi-class Classifier ➢ Lots of different techniques ➢ Supervised learning - each sample needs: ■ “features” - a vector of numeric data ■ “label” - a label specifying its class
  • 40. ❖ The Bag of Words ➢ We started with very basic word cleansing - lowercase, remove non letters/digits, 3 char min length, drop things just numbers ➢ Managed to make it this far in talk without having to use word count! ➢ But ultimately most Data Science/ML tasks involving text ends up heavily dependent on word count
  • 41. ❖ The Bag of Words (cont) ➢ Ended up with too many words (1.3M) even on sample ➢ Were working on bare baseline, so no stopword removal or stemming, following KISS principle ➢ We did say must occur on >= 5 distinct sites (not documents), reduced size to 460k words
  • 42. (defn create-bow-site-occurance [json-lines-rdd] (->> json-lines-rdd (spark/map-to-pair (fn [m] (spark/tuple (site (:url m)) (set (clean-word-seq (:raw_text m)))))) (spark/reduce-by-key union) (spark/flat-map-to-pair (s-de/key-value-fn (fn [site words] (map spark/tuple words (repeat 1))))) (spark/reduce-by-key +) (spark/filter (s-de/key-value-fn (fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT)))) spark/sort-by-key)) Bag of Words
  • 43. ❖ Random Forests™ ➢ Ensemble of Decision Trees ➢ Uses “bootstrapping” for selection of feature set and training set ➢ Not “Deep Learning” but extremely easy to use and very effective ➢ “Any sufficiently advanced technology is indistinguishable from magic.” ➢ Able to get pretty decent results! F- measure 0.86
  • 44. Train the Random Forest from LabeledPoints (defn train-random-forest [num-trees max-depth max-bins seed labeled-points-rdd] (let [p {:num-classes 2, :categorical-feature-info {}, :feature-subset-strategy "auto", :impurity "gini", :max-depth max-depth, :max-bins max-bins}] (RandomForest/trainClassifier labeled-points-rdd (:num-classes p) (:categorical-feature-info p) num-trees (:feature-subset-strategy p) (:impurity p) (:max-depth p) (:max-bins p) seed)))
  • 45. Prepare to train/test RandomForest (defn load-and-train-random-forest [rdd num-trees max-depth max-bins seed & [sample-fraction]] (let [sampled-rdd (if sample-fraction (spark/sample false sample-fraction seed rdd) rdd) labeled-rdd (->> sampled-rdd (spark/map #(labeled-point lf/fashion? %))) [train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed) cached-train (spark/cache train) cached-test (spark/cache test) model (train-random-forest num-trees max-depth max-bins seed cached-train)] [cached-train cached-test model]))
  • 47. ❖ LDA - Latent Dirichlet Allocation ➢ Topic Model which infers topics from text corpus ➢ Topics -> cluster centers, docs -> rows ➢ Features are vectors of word counts (Bag of Words) ➢ Unsupervised Learning technique (but you do supply the topic count)
  • 48. ❖ LDA (cont) ➢ Quite tetchy to run at large scale ➢ OutOfMemory error on executors ➢ Job aborted due to stage failure: Serialized task 4341:0 was 365752339 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values. ➢ WTF? ➢ BTW, do not ever change “spark.akka. frameSize”...
  • 50. ❖ LDA (moar cont) ➢ Finally able to get a trained model after reducing BoW to more manageable size ~11k down from ~160k ➢ Trained on ~100k documents, roughly even split between fashion/non-fashion ➢ These models for demonstration purposes, moar fanciness planned
  • 51. Train an LDA Model (defn train-lda-model [num-topics seed features-fn maps-rdd] (let [rdd (->> maps-rdd (spark/map (fn [{:keys [doc-number] :as m}] (spark/tuple doc-number (features-fn m)))) spark/cache) corpus-size (spark/count rdd) mbf (mini-batch-fraction-batch-size corpus-size 5000) max-iters (int (Math/ceil (/ mbf))) optimizer (doto (OnlineLDAOptimizer.) (.setMiniBatchFraction (min 1.0 mbf))) model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics) (.setSeed seed) (.setMaxIterations max-iters)) (.run (.rdd rdd)))] (.unpersist rdd false) model))
  • 52. Demo!
  • 54. ❖ So what did we do? ➢ We took pre-scraped, “pre-labeled” data ➢ Used Clojure and Spark/Sparkling to munge the data ➢ Used state of the art ML tools to analyze the data ➢ Explored for insights
  • 55. ❖ So what can YOU do? ➢ This will work for almost ANY domain ➢ There’s a lot of interesting information even at this stage ➢ There’s a ton of interesting directions this can go ■ Run classifier over all of CC data ■ Build domain-specific LDA models ➢ Do cool things and have fun doing it!