SlideShare a Scribd company logo
Large  scale  text  processing  pipeline  
with  Spark  ML  and  GraphFrames
Alexey  Svyatkovskiy,  Kosuke Imai,  Jim  Pivarski
Princeton  University
Outline
• Policy  diffusion  detection  in  U.S.  legislature:   the  problem  
• Solving  the  problem  at  scale
• Apache  Spark
• Text  processing  pipeline:  core  modules
• Text  processing  workflow
• Data  ingestion
• Pre-­processing  and  feature  extraction
• All-­pairs  similarity  calculation
• Reformulating  problem  as  a  network  graph  problem
• Interactive  analysis
• All-­pairs  similarity  join
• Candidate  selection:  clustering,  hashing
• Policy  diffusion  detection  modes
• Interactive  analysis  with  Histogrammar tool
• Conclusion  and  next  steps
Policy  diffusion  detection:  the  problem
• Policy  diffusion  detection  is  a  problem  from  a  wider  class  of  fundamental  text  mining  problems  
of  finding  similar  items
• Occurs  when  government  decisions  in  a  given  jurisdiction  are  systematically  influenced  by  
prior  policy  choices  made  in  other  jurisdictions,  in  a  different  state  on  a  different  year
• Example:  “Stand  your  ground”  bills  first  introduced  in  
Florida,  Michigan  and  South  Carolina  2005
– A  number  of  states  have  passed  a  form  of  
SYG  bills  in  2012  after  T.  Martin’s  death
• We  focus  on  a  type  of  policy  diffusion  that  can  be  
detected  by  examining  similarity  of  bill  texts
States  that  have  passed  SYG  laws
States  that  have  passed  SYG  laws  since  T.  Martin’s  death
States  that  have  proposed  SYG  laws  after  T.  Martin’s  death
Source:  LCAV.org
Anatomy  of  Spark  applications
• Spark  uses  master/worker  architecture  with  central  coordinators  (drivers)  and  many  
distributed  workers  (executors)
• We  choose  Scala  for  our  implementation  (both  driver  and  executors)  because  unlike  Python  
and  R  it  is  statically  typed,  and  the  cost  of  JVM  communication  is  minimal
Cluster  manager
Driver  program
Worker  node/container
Worker  node/container
SparkContext
Executor Cache
Task Task
Executor Cache
Task Task
• A  10  node  SGI  Linux  Hadoop  cluster      
• Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU  
processors,  256  GB  RAM      
• All  the  servers  mounted  on  one  rack  and  
interconnected  using  a  10  Gigabit  Ethernet  switch      
• Cloudera distribution  of  Hadoop  configured  in  high-­
availability  mode  using  two  namenodes
• Schedule  Spark  applications  using  YARN      
• Distributed  file  system  (HDFS)  
• Alternative  configuration  uses  SLURM  resource  manager  
deploying  Spark  in  a  standalone  mode
Hardware  specifications
Text  processing  pipeline:  core  modules
All-pairs document similaritycalculation
Represent document pairs and similarities as
weighted undirected graph.
Perform PageRank and graph queries using
GraphFrames package
Raw unstructured text data
from a static source
(HDFS), orstreaming
(Kafka or Akkainput
source)
Structured, compressed datain
Avro format
Pre-processing, Feature extraction,
Dimension reduction
Candidate selection:clustering, hashing
Data  ingestion Distributed processing withSpark ML
Statistical analysis, aggregation using
Histogrammarpackage,plottingusing
Bokeh and Matplotlib front-ends
Interactive analysis
Data  ingestion
• Our  dataset  is  based  on  the  LexisNexis  StateNet dataset
which  contains  a  total  of  more  than  7  million  legislative  
bills  from  50  US  states  from  1991  to  2016
• The  initial  dataset  contains  unstructured  text  documents  
sub-­divided  by  year  and  state
• We  use  Apache  Avro  serialization  framework  to  store  the  
data  and  access  it  efficiently  in  Spark  applications
– Binary  JSON  meta-­format  with  schema  stored  with  the  data
– Row-­oriented,  good  for  nested  schemas.  Supports  schema  evolution
Unique  identifier  of  the  bill
Used  to  construct  predicates  and  filter
the  data  before  calculating  joins
Entire  contents  of  the  bill  as  a  string,  not  
read  into  memory  during  candidate  selection  
and  filtering  steps,  thanks  to  the  Avro  schema  
evolution  property
Example  raw  bill
Analysis  workflow1
Extract  candidate  
documents:
those  feature  vectors
that  we  need
to  test  for  similarity
Prepare  multisets of  tokens.
N  consecutive  words,
optionally  add  
bag-­of-­word features.  
Feature  vectors  are  so  high  
dimension,or  so  many  that  
they  cannot  all  fit  in  memory.
SVD  performs  semantic  
decomposition  of  TF-­IDF  
matrix  and  extracts  concepts  
which  are  most  relevant  for  
clustering
Cleaning,  
normalization,  
stopword
removal
Tokenization,  
n-­gram
TF-­IDF K-­means  
clustering
All-­pairs  
similarity  in  
each  cluster
Truncated  SVD
Feature  weighting  to  reflect  
importance  of  a  term  t to  a  
document  d in  corpus  D
Words  appearing  very  
frequently  across  the  
corpus  bare  little  meaning
Raw  text  of  legislative  
proposals  contains  spurious  
white  spaces  and  non-­
alphanumeric  characters,
which  bare  no  meaning  for  
analysis  and  often  represent  
an  obstacle  for  tokenization
Dataframe API
RDD-­based  (linear  
algebra,  BLAS) RDD-­based
Analysis  workflow2
Extract  candidate  
documents:
a pair  of  documents  
that  hashes  into  the  
same  bucket  for  a  
fraction  
of  bands  makes  a  
candidate  pair
Prepare  sets  of  tokens  
or  shingles.
N  consecutive  words  
/or  characters
N-­gram/N-­
shingle
Min  hashing
Locality  
sensitive  
hashing
All-­pairs  
similarity  in  
each  cluster
Extract  minhash signatures:
integer  vectors  that  
represent  the  sets,  and
reflect  their  similarity
Dataframe API
(optional)  
Cleaning,  
normalization,  
stopword
removal
Spark  ML:  a  quick  review
• Spark  ML  provides  standardized  API  for  machine  learning  algorithms  to  make  it  easier  
to  combine  multiple  algorithms  into  a  single  pipeline  or  workflow
– Similarly  to  scikit-­learn  Python  library
Basic  components  of  a  Pipeline are:    
• Transformer:   an  abstract  class  to  apply  a  transformation  to  dataset/dataframes
– UnaryTransformer abstract  class:  takes  an  input  column,  applies  transformation,  
and  output  the  result  as  a  new  column
– Has  a  transform() method
• Estimator: implements an  algorithm  which  can  be  fit  on  a  dataframe to  produce  a  
Transformer.   For  instance: a  learning  algorithm  is  an  Estimator  which  is  trained  on  a  
dataframe to  produce  a  model.    
– Has  a  fit() method
• Parameter:  an  API  to  pass  parameters  to  Transformers  and  Estimators  
Putting  it  all  into  Pipeline
• Preprocessing  and  feature  
extraction  steps  
– Use  StopWordsRemover,  RegexTokenizer,
MinHashLSH,  HashingTF transformers    
– IDF estimator  
• Prepare  custom  transformers  
to  plug  into  Pipeline
– Ex:  extend  UnaryTransformer,  
override  createTransformerFunc
using  custom  UDF
• Clustering KMeans and  LSH
• Ability  to  perform  hyper-­parameter
search  and  cross-­validation
Example  ML  pipeline  (see  backup  for  a  full  snippet)
Example  custom  transformer
All-­pairs  similarity:  overview
• Our  goal  is  to  go  beyond  identifying  the  diffusion  topics:  “stand  your  ground  
bills”,  cyberstalking,  marijuana  laws.  But  also  to  perform  an  all-­pairs  
comparison  
• Previous  work  in  policy  diffusion  has  been  unable  to  make  an  all-­pairs  
comparison  between  bills  for  a  lengthy  time  period  because  of  computational  
intensity
– Brute-­force  all-­pairs  calculation   between  the  texts  of  the  state  bills  requires  calculating  a  
cross-­join,  yielding  O(10#$)  distinct  pairs  on  the  dataset  considered
– As  a  substitute,  scholars  studied  single  topic  areas
• Focusing  on  the  document  vectors  which  are  likely  to  be  highly  similar  is  
essential  for  all-­pairs  comparison  at  scale
• Modern  studies  employ  variations  of  nearest-­neighbor  search,  locality  sensitive  
hashing  (LSH),  as  well  as  sampling  techniques  to  select  a  subset  of  rows  of  
TF-­IDF  matrix  based  on  the  sparsity (DIMSUM)
• Our  approach  utilizes  clustering  and  hashing  methods  (details  on  the  next  
slide)
All-­pairs  similarity,  workflow1:  clustering
• First  approach  utilizes  K-­means  clustering  to  identify  groups  of  candidate  documents  which  are  likely  
to  belong  to  the  same  diffusion  topic,  reducing  the  number  of  comparisons  in  the  all-­pairs  similarity  join  
calculation
– Distance-­based  clustering  using  fast  square  distance  calculation  in  Spark  ML
– Introduce  a  dataframe column  with  a  cluster  label.  Perform  
all-­pairs  calculation  within  each  cluster
– Determine  the  optimum  number  of  clusters  empirically,  by  repeating  the  
calculation  for  a  range  of  values  of  k,  scoring  on  a  processing  time  
versus  WCSSE  plane
– 150  clusters  for  a  3  state  subset,  400  clusters  for  the  entire  dataset
– 2-­3  orders  of  magnitude  less  combinatorial  pairs  to  calculate  compared  
to  the  brute-­force  approach  
Brute  force  solution
is  too  slow…
All-­pairs  similarity,  workflow2:  LSH
• N-­shingle  features  with  relatively  large  N  >  5,  hashed,  converted  to  sets
• Characteristic  matrix  instead  of  TF-­IDF  matrix  (values  0  or  1)
• Extract  MinHash signatures  for  each  column  (document)  using  a  family  
of  hash  functions  h1(x),  h2(x),  h3(x),  …  hn(x)
– Hash  several  times
– The  similarity  of  two  signatures  is  the  fraction  of  the  hash  functions  in  which  they  agree
• LSH:  focus  on  pairs  of  signatures  which  are  likely  to  be  from  similar  
documents
– Hash  columns  of  signature  matrix  M  to  many  buckets
– Partition  signature  matrix  into  bands  of  rows.  For  each  band,  hash  each  column  into  
k  buckets
– A  pair  of  documents  that  hashes  into  the  same  bucket  for  a  fraction  
of  bands  makes  a  candidate  pair
• Use  MinHashLSH class  in  Spark  2.1
Characteristic  matrix:
N-­shingles  rows  
M-­documents columns  
Signature  matrix:
K-­number  of  hash  
functions  rows  
M-­documents  columns  
Ex:  Family  
of  4  hash  
functions
Similarity  measures
• The  Jaccard similarity between  two  sets  is  the  size  of  their  intersection  divided  by  the  size  
of  their  union:
J(𝐴, 𝐵) =	
  
|-∩/|
|-∪/|
=  
|-∩/|
- 1 / 2|-∩/|
– Key  distance:  consider  feature  vectors  as  sets  of  indices
– Hash  distance:  consider  dense  vector  representations  
• Cosine  distance  between  feature  vectors
C(A,B)=
(-6/)
|-|6|/|
– Convert  distances  to  similarities  assuming  inverse  proportionality,  
rescaling  it  to  [0,100]  range,  adding  a  regularization  term
Feature  type Similarity measure
Workflow1:  K-­means  
clustering
Unigram, TF-­IDF,  truncated  
SVD
Cosine, Euclidean
Workflow2:  LSH N-­gram with  N=5,  MinHash Jaccard
𝐴
𝐵
Interactive  analysis  with  Histogrammar tool
• Histogrammar  is  a  suite  of  composable aggregators  with
– Language  independent  specification  with  implementations  
in  Scala  and  Python
– Grammar  of  composable aggregation  routines
– Draws  plots  in  Matplotlib and  Bokeh
– Unlike  RDD.histogram,  Histogrammar let’s  one  build  2D,
profiles,  sum-­of-­squares  statistics  in  an  open-­ended  way  
http://histogrammar.org
Contact  Jim  Pivarski
or  me  to  contribute!
Example  interactive  session  with  Histogrammar
GraphFrames
• GraphFrames  is  an  extension  of  Spark  allowing  to  perform  graph  queries  and  
graph  algorithms  on  Spark  dataframes
– A  GraphFrame is  constructed  using  two  dataframes (a  dataframe of  nodes  
and  an  edge  dataframe),  allowing  to  easily  integrate  the  graph  processing  
step  into  the  pipeline  along  with  Spark  ML  
– Graph  queries:  like  a  Cypher  query  on  a  graph  database  (e.g.  Neo4j)
– Graph  algorithms:  PageRank,  Dijkstra
– Possibility  to  eliminate  joins
bill1 bill2 similarity
FL/2005/SB436   MI/2005/SB1046   91.38
FL/2005/SB436   MI/2005/HB5143   91.29
FL/2005/SB436 SC/2005/SB1131   82.89
Currently  undirected,
switch  to  directed  by  time
SB346
SB1046
HB5143
SB1131
For  similarity  above  
threshold
Applications  of  policy  diffusion  detection  tool
• The  policy  diffusion  detection  tool  can  be  used  in  a  number  of  modes:
– Identification  of  groups  of  diffused  bills  in  the  dataset  
given  a  diffusion  topic  (for  instance,  “Stand  your  ground”  policy,  
cyberstalking,  marijuana  laws  …)
– Discovery  of  diffusion  topics:  consider  top-­similarity  bills  within  each  cluster,  
careful  filtering  of  uniform  bills  and  interstate  compact  bills  is  necessary  as  
they  would  show  high  similarity  as  well
– Identification  of  minimum  cost  paths  connecting  two  specific  legislative  
proposals  on  a  graph
– Identification  of  the  most  influential  US  states  for  policy  diffusion
• The  political  science  research  paper  on  applications  of  the  tool  is  currently  in  
progress
Performance  summary,  details  on  the  Spark  configuration
• The  policy  diffusion  analysis  pipeline  uses  map,  filter  (narrow),  join,  
and  aggregateByKey (wide)  transformations
– Deterministic  all-­pairs  calculation  from  approach1  involves  a  two-­sided  
join:  heavy  shuffle  with  O(100  TB)  intermediate  data
– Benefit  from  increasing  spark.shuffle.memoryFraction to  0.6
• Spark  applications  have  been  deployed  on  Hadoop  cluster  with  YARN
– 40  executor  containers,  each  using  3  executor  cores  and  15  GB  of  
RAM  per  JVM  
– Use  external  shuffle  service  inside  the  YARN  node  manager  to  improve  
stability  of  memory-­intensive  jobs  with  larger  number  of  executor  containers
– Custom  partitioning  to  avoid  struggler  tasks
• Calculate  efficiency  of  parallel  execution  as
𝐸 =
89
:;<;=68>
Single  executor  case
Conclusions
• Evaluated  Apache  Spark  framework  for  the  case  of  data-­intensive  
machine  learning  problem  of  policy  diffusion  detection  in  US  legislature  
– Provided  a  scalable  method  to  calculate  all-­pairs  similarity  based  on  K-­means  
clustering  and  MinHashLSH
– Implemented  a  text  processing  pipeline  utilizing  Apache  Avro  serialization  
framework,  Spark  ML,  GraphFrames,  and  Histogrammar suite  of  data  
aggregation   primitives  
– Efficiently  calculate  all-­pairs  comparison  between   legislative  bills,  estimate  
relationships  between   bills  on  a  graph,  potentially  applicable  to  a  wider  class  of  
fundamental   text  mining  problems  of  finding  similar  items
• Tuned  Spark  internals    (partitioning,  shuffle)  to  obtain  good  scaling  up  
to  O(100)  processing  cores,  yielding  80%  parallel  efficiency
• Utilized  Histogrammar tool  as  a  part  of  the  framework  to  enable  
interactive  analysis,  allows  a  researcher  to  perform  analysis  in  Scala  
language,  integrating  well  with  Hadoop  ecosystem  
Thank  You.
Alexey  Svyatkovskiy
email:  alexeys@princeton.edu
twitter:  @asvyatko
Kosuke Imai:  kimai@princeton.edu
Jim  Pivarski:  pivarski@fnal.gov
Backup
Service  node  1
Zookeeper
Journal  node
Primary  namenode
httpfs
Service  node  2
Zookeeper
Journal  node
Resource  manager
Hive  master
Service  node  3
Zookeeper
Journal  node
Standby  namenode
History  server
Datanodes
Spark
HDFS
Datanode service
Login  node
Spark
Hue
YP  server
LDAP
…
SSH
• A  10  node  SGI  Linux  Hadoop  cluster      
• Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU  processors,  256  GB  RAM      
• All  the  servers  mounted  on  one  rack  and  interconnected  using  a  10  
Gigabit  Ethernet  switch      
• Cloudera distribution  of  Hadoop  configured  in  high-­availability  mode  using  two  
namenodes
• Schedule  Spark  applications  using  YARN      
• Distributed  file  system  (HDFS)  
TF-­IDF
• TF-­IDF  weighting:  reflect  importance  of  a  term  t to  a  document  d in  corpus  D
– HashingTF transformer,  which  maps  feature  to  an  index  by  applying  MurmurHash 3
– IDF estimator  down-­weights  components  which  appear  frequently  in  the  corpus
𝐼𝐷𝐹(𝑡, 𝐷) = log
𝐷 + 1
𝐷𝐹 𝑡, 𝐷 + 1
𝑇𝐹𝐼𝐷𝐹 𝑡, 𝑑, 𝐷 = 𝑇𝐹 𝑡, 𝑑 ●𝐼𝐷𝐹(𝑡, 𝐷)
Dimension  reduction:  truncated  SVD
• SVD  is  applied  to  the  TF-­IDF  document-­feature  matrix  to  perform  semantic  decomposition  
and  extract  concepts  which  are  most  relevant  for  classification
• RDD-­based  API,  implement  RowMatrix transposition,  matrix  truncation  method  needed  
along  with  SVD
• SVD  factorizes  the  document-­feature  matrix  𝑀 (mx𝑛)  into  three  matrices:  𝑈,  Σ,  and  𝑉,  such  
that:
𝑀 = 𝑈 6 Σ 6 𝑉8
Here  m is  the  number  of  legislative  bills  (order  of  10O
), 𝑘 is  the  number  of  concepts,  and  𝑛	
  is  the  
number  of  features  (2#R
)
• The  left  singular  matrix  𝑈	
  is  represented  as  distributed  row-­matrix,  while  Σ and  𝑉	
  are  
sufficiently  small  to  fit  into  Spark  driver  memory
mx𝑘 𝑘x𝑘 𝑛x𝑘
𝑘
m ≫ 𝑛	
   ≫ 𝑘
𝑘
HistogrammarAggregator class:  example
An  example  of  a  custom  dataset  Aggregator  class  using  fill  and  
container  from  Histogrammar
Signature  matrix r   rows b   bands
Columns  2  and  6
Would  make  a  candidate  pair
Columns  6  and  7  are
rather  different.
Example:  Hashing  Bands
Buckets
Example  ML  pipeline  (simplified)

More Related Content

What's hot (20)

PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
PDF
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PPTX
Apache HBase™
Prashant Gupta
 
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
PPTX
Manage Add-On Services with Apache Ambari
DataWorks Summit
 
PDF
Facebook Messages & HBase
强 王
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
HBase Low Latency
DataWorks Summit
 
PDF
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
PDF
Redis at LINE
LINE Corporation
 
PDF
MariaDB Performance Tuning and Optimization
MariaDB plc
 
PDF
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
PDF
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
Amazon Web Services Korea
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
PDF
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
confluent
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Apache Flink internals
Kostas Tzoumas
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Apache HBase™
Prashant Gupta
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
Manage Add-On Services with Apache Ambari
DataWorks Summit
 
Facebook Messages & HBase
强 王
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
HBase Low Latency
DataWorks Summit
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
Redis at LINE
LINE Corporation
 
MariaDB Performance Tuning and Optimization
MariaDB plc
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
Amazon Web Services Korea
 
Apache Flink Training: System Overview
Flink Forward
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Airflow Intro-1.pdf
BagustTriCahyo1
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
confluent
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

Similar to Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Summit East talk by Alexey Svyatkovskiy (20)

PDF
Spark DataFrames and ML Pipelines
Databricks
 
PPTX
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
PPTX
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuantUniversity
 
PPTX
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
PDF
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Spark DataFrames and ML Pipelines
Databricks
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuantUniversity
 
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
The Statistical and Applied Mathematical Sciences Institute
 
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Spark meetup TCHUG
Ryan Bosshart
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
Research Methodology Overview Introduction
ayeshagul29594
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Summit East talk by Alexey Svyatkovskiy

  • 1. Large  scale  text  processing  pipeline   with  Spark  ML  and  GraphFrames Alexey  Svyatkovskiy,  Kosuke Imai,  Jim  Pivarski Princeton  University
  • 2. Outline • Policy  diffusion  detection  in  U.S.  legislature:   the  problem   • Solving  the  problem  at  scale • Apache  Spark • Text  processing  pipeline:  core  modules • Text  processing  workflow • Data  ingestion • Pre-­processing  and  feature  extraction • All-­pairs  similarity  calculation • Reformulating  problem  as  a  network  graph  problem • Interactive  analysis • All-­pairs  similarity  join • Candidate  selection:  clustering,  hashing • Policy  diffusion  detection  modes • Interactive  analysis  with  Histogrammar tool • Conclusion  and  next  steps
  • 3. Policy  diffusion  detection:  the  problem • Policy  diffusion  detection  is  a  problem  from  a  wider  class  of  fundamental  text  mining  problems   of  finding  similar  items • Occurs  when  government  decisions  in  a  given  jurisdiction  are  systematically  influenced  by   prior  policy  choices  made  in  other  jurisdictions,  in  a  different  state  on  a  different  year • Example:  “Stand  your  ground”  bills  first  introduced  in   Florida,  Michigan  and  South  Carolina  2005 – A  number  of  states  have  passed  a  form  of   SYG  bills  in  2012  after  T.  Martin’s  death • We  focus  on  a  type  of  policy  diffusion  that  can  be   detected  by  examining  similarity  of  bill  texts States  that  have  passed  SYG  laws States  that  have  passed  SYG  laws  since  T.  Martin’s  death States  that  have  proposed  SYG  laws  after  T.  Martin’s  death Source:  LCAV.org
  • 4. Anatomy  of  Spark  applications • Spark  uses  master/worker  architecture  with  central  coordinators  (drivers)  and  many   distributed  workers  (executors) • We  choose  Scala  for  our  implementation  (both  driver  and  executors)  because  unlike  Python   and  R  it  is  statically  typed,  and  the  cost  of  JVM  communication  is  minimal Cluster  manager Driver  program Worker  node/container Worker  node/container SparkContext Executor Cache Task Task Executor Cache Task Task • A  10  node  SGI  Linux  Hadoop  cluster       • Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU   processors,  256  GB  RAM       • All  the  servers  mounted  on  one  rack  and   interconnected  using  a  10  Gigabit  Ethernet  switch       • Cloudera distribution  of  Hadoop  configured  in  high-­ availability  mode  using  two  namenodes • Schedule  Spark  applications  using  YARN       • Distributed  file  system  (HDFS)   • Alternative  configuration  uses  SLURM  resource  manager   deploying  Spark  in  a  standalone  mode Hardware  specifications
  • 5. Text  processing  pipeline:  core  modules All-pairs document similaritycalculation Represent document pairs and similarities as weighted undirected graph. Perform PageRank and graph queries using GraphFrames package Raw unstructured text data from a static source (HDFS), orstreaming (Kafka or Akkainput source) Structured, compressed datain Avro format Pre-processing, Feature extraction, Dimension reduction Candidate selection:clustering, hashing Data  ingestion Distributed processing withSpark ML Statistical analysis, aggregation using Histogrammarpackage,plottingusing Bokeh and Matplotlib front-ends Interactive analysis
  • 6. Data  ingestion • Our  dataset  is  based  on  the  LexisNexis  StateNet dataset which  contains  a  total  of  more  than  7  million  legislative   bills  from  50  US  states  from  1991  to  2016 • The  initial  dataset  contains  unstructured  text  documents   sub-­divided  by  year  and  state • We  use  Apache  Avro  serialization  framework  to  store  the   data  and  access  it  efficiently  in  Spark  applications – Binary  JSON  meta-­format  with  schema  stored  with  the  data – Row-­oriented,  good  for  nested  schemas.  Supports  schema  evolution Unique  identifier  of  the  bill Used  to  construct  predicates  and  filter the  data  before  calculating  joins Entire  contents  of  the  bill  as  a  string,  not   read  into  memory  during  candidate  selection   and  filtering  steps,  thanks  to  the  Avro  schema   evolution  property Example  raw  bill
  • 7. Analysis  workflow1 Extract  candidate   documents: those  feature  vectors that  we  need to  test  for  similarity Prepare  multisets of  tokens. N  consecutive  words, optionally  add   bag-­of-­word features.   Feature  vectors  are  so  high   dimension,or  so  many  that   they  cannot  all  fit  in  memory. SVD  performs  semantic   decomposition  of  TF-­IDF   matrix  and  extracts  concepts   which  are  most  relevant  for   clustering Cleaning,   normalization,   stopword removal Tokenization,   n-­gram TF-­IDF K-­means   clustering All-­pairs   similarity  in   each  cluster Truncated  SVD Feature  weighting  to  reflect   importance  of  a  term  t to  a   document  d in  corpus  D Words  appearing  very   frequently  across  the   corpus  bare  little  meaning Raw  text  of  legislative   proposals  contains  spurious   white  spaces  and  non-­ alphanumeric  characters, which  bare  no  meaning  for   analysis  and  often  represent   an  obstacle  for  tokenization Dataframe API RDD-­based  (linear   algebra,  BLAS) RDD-­based
  • 8. Analysis  workflow2 Extract  candidate   documents: a pair  of  documents   that  hashes  into  the   same  bucket  for  a   fraction   of  bands  makes  a   candidate  pair Prepare  sets  of  tokens   or  shingles. N  consecutive  words   /or  characters N-­gram/N-­ shingle Min  hashing Locality   sensitive   hashing All-­pairs   similarity  in   each  cluster Extract  minhash signatures: integer  vectors  that   represent  the  sets,  and reflect  their  similarity Dataframe API (optional)   Cleaning,   normalization,   stopword removal
  • 9. Spark  ML:  a  quick  review • Spark  ML  provides  standardized  API  for  machine  learning  algorithms  to  make  it  easier   to  combine  multiple  algorithms  into  a  single  pipeline  or  workflow – Similarly  to  scikit-­learn  Python  library Basic  components  of  a  Pipeline are:     • Transformer:   an  abstract  class  to  apply  a  transformation  to  dataset/dataframes – UnaryTransformer abstract  class:  takes  an  input  column,  applies  transformation,   and  output  the  result  as  a  new  column – Has  a  transform() method • Estimator: implements an  algorithm  which  can  be  fit  on  a  dataframe to  produce  a   Transformer.   For  instance: a  learning  algorithm  is  an  Estimator  which  is  trained  on  a   dataframe to  produce  a  model.     – Has  a  fit() method • Parameter:  an  API  to  pass  parameters  to  Transformers  and  Estimators  
  • 10. Putting  it  all  into  Pipeline • Preprocessing  and  feature   extraction  steps   – Use  StopWordsRemover,  RegexTokenizer, MinHashLSH,  HashingTF transformers     – IDF estimator   • Prepare  custom  transformers   to  plug  into  Pipeline – Ex:  extend  UnaryTransformer,   override  createTransformerFunc using  custom  UDF • Clustering KMeans and  LSH • Ability  to  perform  hyper-­parameter search  and  cross-­validation Example  ML  pipeline  (see  backup  for  a  full  snippet) Example  custom  transformer
  • 11. All-­pairs  similarity:  overview • Our  goal  is  to  go  beyond  identifying  the  diffusion  topics:  “stand  your  ground   bills”,  cyberstalking,  marijuana  laws.  But  also  to  perform  an  all-­pairs   comparison   • Previous  work  in  policy  diffusion  has  been  unable  to  make  an  all-­pairs   comparison  between  bills  for  a  lengthy  time  period  because  of  computational   intensity – Brute-­force  all-­pairs  calculation   between  the  texts  of  the  state  bills  requires  calculating  a   cross-­join,  yielding  O(10#$)  distinct  pairs  on  the  dataset  considered – As  a  substitute,  scholars  studied  single  topic  areas • Focusing  on  the  document  vectors  which  are  likely  to  be  highly  similar  is   essential  for  all-­pairs  comparison  at  scale • Modern  studies  employ  variations  of  nearest-­neighbor  search,  locality  sensitive   hashing  (LSH),  as  well  as  sampling  techniques  to  select  a  subset  of  rows  of   TF-­IDF  matrix  based  on  the  sparsity (DIMSUM) • Our  approach  utilizes  clustering  and  hashing  methods  (details  on  the  next   slide)
  • 12. All-­pairs  similarity,  workflow1:  clustering • First  approach  utilizes  K-­means  clustering  to  identify  groups  of  candidate  documents  which  are  likely   to  belong  to  the  same  diffusion  topic,  reducing  the  number  of  comparisons  in  the  all-­pairs  similarity  join   calculation – Distance-­based  clustering  using  fast  square  distance  calculation  in  Spark  ML – Introduce  a  dataframe column  with  a  cluster  label.  Perform   all-­pairs  calculation  within  each  cluster – Determine  the  optimum  number  of  clusters  empirically,  by  repeating  the   calculation  for  a  range  of  values  of  k,  scoring  on  a  processing  time   versus  WCSSE  plane – 150  clusters  for  a  3  state  subset,  400  clusters  for  the  entire  dataset – 2-­3  orders  of  magnitude  less  combinatorial  pairs  to  calculate  compared   to  the  brute-­force  approach   Brute  force  solution is  too  slow…
  • 13. All-­pairs  similarity,  workflow2:  LSH • N-­shingle  features  with  relatively  large  N  >  5,  hashed,  converted  to  sets • Characteristic  matrix  instead  of  TF-­IDF  matrix  (values  0  or  1) • Extract  MinHash signatures  for  each  column  (document)  using  a  family   of  hash  functions  h1(x),  h2(x),  h3(x),  …  hn(x) – Hash  several  times – The  similarity  of  two  signatures  is  the  fraction  of  the  hash  functions  in  which  they  agree • LSH:  focus  on  pairs  of  signatures  which  are  likely  to  be  from  similar   documents – Hash  columns  of  signature  matrix  M  to  many  buckets – Partition  signature  matrix  into  bands  of  rows.  For  each  band,  hash  each  column  into   k  buckets – A  pair  of  documents  that  hashes  into  the  same  bucket  for  a  fraction   of  bands  makes  a  candidate  pair • Use  MinHashLSH class  in  Spark  2.1 Characteristic  matrix: N-­shingles  rows   M-­documents columns   Signature  matrix: K-­number  of  hash   functions  rows   M-­documents  columns   Ex:  Family   of  4  hash   functions
  • 14. Similarity  measures • The  Jaccard similarity between  two  sets  is  the  size  of  their  intersection  divided  by  the  size   of  their  union: J(𝐴, 𝐵) =   |-∩/| |-∪/| =   |-∩/| - 1 / 2|-∩/| – Key  distance:  consider  feature  vectors  as  sets  of  indices – Hash  distance:  consider  dense  vector  representations   • Cosine  distance  between  feature  vectors C(A,B)= (-6/) |-|6|/| – Convert  distances  to  similarities  assuming  inverse  proportionality,   rescaling  it  to  [0,100]  range,  adding  a  regularization  term Feature  type Similarity measure Workflow1:  K-­means   clustering Unigram, TF-­IDF,  truncated   SVD Cosine, Euclidean Workflow2:  LSH N-­gram with  N=5,  MinHash Jaccard 𝐴 𝐵
  • 15. Interactive  analysis  with  Histogrammar tool • Histogrammar  is  a  suite  of  composable aggregators  with – Language  independent  specification  with  implementations   in  Scala  and  Python – Grammar  of  composable aggregation  routines – Draws  plots  in  Matplotlib and  Bokeh – Unlike  RDD.histogram,  Histogrammar let’s  one  build  2D, profiles,  sum-­of-­squares  statistics  in  an  open-­ended  way   http://histogrammar.org Contact  Jim  Pivarski or  me  to  contribute! Example  interactive  session  with  Histogrammar
  • 16. GraphFrames • GraphFrames  is  an  extension  of  Spark  allowing  to  perform  graph  queries  and   graph  algorithms  on  Spark  dataframes – A  GraphFrame is  constructed  using  two  dataframes (a  dataframe of  nodes   and  an  edge  dataframe),  allowing  to  easily  integrate  the  graph  processing   step  into  the  pipeline  along  with  Spark  ML   – Graph  queries:  like  a  Cypher  query  on  a  graph  database  (e.g.  Neo4j) – Graph  algorithms:  PageRank,  Dijkstra – Possibility  to  eliminate  joins bill1 bill2 similarity FL/2005/SB436   MI/2005/SB1046   91.38 FL/2005/SB436   MI/2005/HB5143   91.29 FL/2005/SB436 SC/2005/SB1131   82.89 Currently  undirected, switch  to  directed  by  time SB346 SB1046 HB5143 SB1131 For  similarity  above   threshold
  • 17. Applications  of  policy  diffusion  detection  tool • The  policy  diffusion  detection  tool  can  be  used  in  a  number  of  modes: – Identification  of  groups  of  diffused  bills  in  the  dataset   given  a  diffusion  topic  (for  instance,  “Stand  your  ground”  policy,   cyberstalking,  marijuana  laws  …) – Discovery  of  diffusion  topics:  consider  top-­similarity  bills  within  each  cluster,   careful  filtering  of  uniform  bills  and  interstate  compact  bills  is  necessary  as   they  would  show  high  similarity  as  well – Identification  of  minimum  cost  paths  connecting  two  specific  legislative   proposals  on  a  graph – Identification  of  the  most  influential  US  states  for  policy  diffusion • The  political  science  research  paper  on  applications  of  the  tool  is  currently  in   progress
  • 18. Performance  summary,  details  on  the  Spark  configuration • The  policy  diffusion  analysis  pipeline  uses  map,  filter  (narrow),  join,   and  aggregateByKey (wide)  transformations – Deterministic  all-­pairs  calculation  from  approach1  involves  a  two-­sided   join:  heavy  shuffle  with  O(100  TB)  intermediate  data – Benefit  from  increasing  spark.shuffle.memoryFraction to  0.6 • Spark  applications  have  been  deployed  on  Hadoop  cluster  with  YARN – 40  executor  containers,  each  using  3  executor  cores  and  15  GB  of   RAM  per  JVM   – Use  external  shuffle  service  inside  the  YARN  node  manager  to  improve   stability  of  memory-­intensive  jobs  with  larger  number  of  executor  containers – Custom  partitioning  to  avoid  struggler  tasks • Calculate  efficiency  of  parallel  execution  as 𝐸 = 89 :;<;=68> Single  executor  case
  • 19. Conclusions • Evaluated  Apache  Spark  framework  for  the  case  of  data-­intensive   machine  learning  problem  of  policy  diffusion  detection  in  US  legislature   – Provided  a  scalable  method  to  calculate  all-­pairs  similarity  based  on  K-­means   clustering  and  MinHashLSH – Implemented  a  text  processing  pipeline  utilizing  Apache  Avro  serialization   framework,  Spark  ML,  GraphFrames,  and  Histogrammar suite  of  data   aggregation   primitives   – Efficiently  calculate  all-­pairs  comparison  between   legislative  bills,  estimate   relationships  between   bills  on  a  graph,  potentially  applicable  to  a  wider  class  of   fundamental   text  mining  problems  of  finding  similar  items • Tuned  Spark  internals    (partitioning,  shuffle)  to  obtain  good  scaling  up   to  O(100)  processing  cores,  yielding  80%  parallel  efficiency • Utilized  Histogrammar tool  as  a  part  of  the  framework  to  enable   interactive  analysis,  allows  a  researcher  to  perform  analysis  in  Scala   language,  integrating  well  with  Hadoop  ecosystem  
  • 22. Service  node  1 Zookeeper Journal  node Primary  namenode httpfs Service  node  2 Zookeeper Journal  node Resource  manager Hive  master Service  node  3 Zookeeper Journal  node Standby  namenode History  server Datanodes Spark HDFS Datanode service Login  node Spark Hue YP  server LDAP … SSH • A  10  node  SGI  Linux  Hadoop  cluster       • Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU  processors,  256  GB  RAM       • All  the  servers  mounted  on  one  rack  and  interconnected  using  a  10   Gigabit  Ethernet  switch       • Cloudera distribution  of  Hadoop  configured  in  high-­availability  mode  using  two   namenodes • Schedule  Spark  applications  using  YARN       • Distributed  file  system  (HDFS)  
  • 23. TF-­IDF • TF-­IDF  weighting:  reflect  importance  of  a  term  t to  a  document  d in  corpus  D – HashingTF transformer,  which  maps  feature  to  an  index  by  applying  MurmurHash 3 – IDF estimator  down-­weights  components  which  appear  frequently  in  the  corpus 𝐼𝐷𝐹(𝑡, 𝐷) = log 𝐷 + 1 𝐷𝐹 𝑡, 𝐷 + 1 𝑇𝐹𝐼𝐷𝐹 𝑡, 𝑑, 𝐷 = 𝑇𝐹 𝑡, 𝑑 ●𝐼𝐷𝐹(𝑡, 𝐷)
  • 24. Dimension  reduction:  truncated  SVD • SVD  is  applied  to  the  TF-­IDF  document-­feature  matrix  to  perform  semantic  decomposition   and  extract  concepts  which  are  most  relevant  for  classification • RDD-­based  API,  implement  RowMatrix transposition,  matrix  truncation  method  needed   along  with  SVD • SVD  factorizes  the  document-­feature  matrix  𝑀 (mx𝑛)  into  three  matrices:  𝑈,  Σ,  and  𝑉,  such   that: 𝑀 = 𝑈 6 Σ 6 𝑉8 Here  m is  the  number  of  legislative  bills  (order  of  10O ), 𝑘 is  the  number  of  concepts,  and  𝑛  is  the   number  of  features  (2#R ) • The  left  singular  matrix  𝑈  is  represented  as  distributed  row-­matrix,  while  Σ and  𝑉  are   sufficiently  small  to  fit  into  Spark  driver  memory mx𝑘 𝑘x𝑘 𝑛x𝑘 𝑘 m ≫ 𝑛   ≫ 𝑘 𝑘
  • 25. HistogrammarAggregator class:  example An  example  of  a  custom  dataset  Aggregator  class  using  fill  and   container  from  Histogrammar
  • 26. Signature  matrix r   rows b   bands Columns  2  and  6 Would  make  a  candidate  pair Columns  6  and  7  are rather  different. Example:  Hashing  Bands Buckets
  • 27. Example  ML  pipeline  (simplified)