SlideShare a Scribd company logo
Parallel Non-blocking Deterministic
Algorithm for Online Topic Modeling
Murat Apishev
great-mel@yandex.ru
Oleksandr Frei
oleksandr.frei@gmail.com
HSE, MSU, MIPT
April 8, 2016
Contents
1 Introduction
Topic modeling
ARTM
BigARTM
2 Parallel implementation
Synchronous algorithms
Asynchronous algorithms
Comparison
3 Applications
The RSF project
Conclusions
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Topic modeling
Topic modeling — an application of machine learning to statistical
text analysis.
Topic — a specific terminology of the subject area, the set of terms
(unigrams or n−grams) frequently appearing together in
documents.
Topic model uncovers latent semantic structure of a text collection:
topic t is a probability distribution p(w|t) over terms w
document d is a probability distribution p(t|d) over topics t
Applications — information retrieval for long-text queries,
classification, categorization, summarization of texts.
Murat Apishev great-mel@yandex.ru AIST 2016 3 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Topic modeling task
Given: W — set (vocabulary) of terms (unigrams or n−grams),
D — set (collection) of text documents d ⊂ W ,
ndw — how many times term w appears in document d.
Find: model p(w|d) =
∑︀
t∈T
𝜑wt 𝜃td with parameters Φ
W×T
и Θ
T×D
:
𝜑wt =p(w|t) — term probabilities w in each topic t,
𝜃td =p(t|d) — topic probabilities t in each document d.
Criteria log-likelihood maximization:
∑︁
d∈D
∑︁
w∈d
ndw ln
∑︁
t∈T
𝜑wt 𝜃td → max
𝜑,𝜃
;
𝜑wt 0;
∑︀
w 𝜑wt = 1; 𝜃td 0;
∑︀
t 𝜃td = 1.
Issue: the problem of stochastic matrix factorization is ill-posed:
ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′.
Murat Apishev great-mel@yandex.ru AIST 2016 4 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
PLSA and EM-algorithm
Log-likelihood maximization:
∑︁
d∈D
∑︁
w∈W
ndw ln
∑︁
t
𝜑wt 𝜃td → max
Φ,Θ
EM-algorithm: the simple iteration method for the set of equations
E-шаг:
M-шаг:
⎧
⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩
ptdw = norm
t∈T
(︀
𝜑wt 𝜃td
)︀
𝜑wt = norm
w∈W
(︀
nwt
)︀
, nwt =
∑︀
d∈D
ndw ptdw
𝜃td = norm
t∈T
(︀
ntd
)︀
, ntd =
∑︀
w∈d
ndw ptdw
where norm
i∈I
xi = max{xi ,0}∑︀
j∈I
max{xj ,0}
Murat Apishev great-mel@yandex.ru AIST 2016 5 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
ARTM and regularized EM-algorithm
Log-likelihood maximization with additive regularization criterion R:
∑︁
d∈D
∑︁
w∈W
ndw ln
∑︁
t
𝜑wt 𝜃td + R(Φ, Θ) → max
Φ,Θ
EM-algorithm: the simple iteration method for the set of equations
E-шаг:
M-шаг:
⎧
⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩
ptdw = norm
t∈T
(︀
𝜑wt 𝜃td
)︀
𝜑wt = norm
w∈W
(︁
nwt + 𝜑wt
𝜕R
𝜕𝜑wt
)︁
, nwt =
∑︀
d∈D
ndw ptdw
𝜃td = norm
t∈T
(︁
ntd + 𝜃td
𝜕R
𝜕𝜃td
)︁
, ntd =
∑︀
w∈d
ndw ptdw
Murat Apishev great-mel@yandex.ru AIST 2016 6 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Examples of regularizers
Many Bayesian models can be reinterpreted as regularizers
in ARTM.
Some examples of regularizes:
1 Smoothing Φ / Θ (leads to popular LDA model)
2 Sparsing Φ / Θ
3 Decorrelation of topics in Φ
4 Semi-supervised learning
5 Topic coherence maximization
6 Topic selection
7 . . .
Murat Apishev great-mel@yandex.ru AIST 2016 7 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Multimodal Topic Model
Multimodal Topic Model finds topical distributions for terms
p(w|t), authors p(a|t), time p(y|t), objects of images p(o|t),
linked documents p(d′|t), advertising banners p(b|t), users p(u|t),
and binds all these modalities into a single topic model.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Users
Murat Apishev great-mel@yandex.ru AIST 2016 8 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
M-ARTM and multimodal regularized EM-algorithm
W m
is a vocabulary of terms of m-th modality, m ∈ M,
W = W 1
⊔ W m
as a joint vocabulary of all modalities
Multimodal log-likelihood maximization with additive regularization
criterion R:
∑︁
m∈M
𝜆m
∑︁
d∈D
∑︁
w∈W m
ndw ln
∑︁
t
𝜑wt 𝜃td + R(Φ, Θ) → max
Φ,Θ
EM-algorithm: the simple iteration method for the set of equations
E-шаг:
M-шаг:
⎧
⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩
ptdw = norm
t∈T
(︀
𝜑wt 𝜃td
)︀
𝜑wt = norm
w∈W m
(︁
nwt + 𝜑wt
𝜕R
𝜕𝜑wt
)︁
, nwt =
∑︀
d∈D
𝜆m(w)ndw ptdw
𝜃td = norm
t∈T
(︁
ntd + 𝜃td
𝜕R
𝜕𝜃td
)︁
, ntd =
∑︀
w∈d
𝜆m(w)ndw ptdw
Murat Apishev great-mel@yandex.ru AIST 2016 9 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
BigARTM project
BigARTM features:
Fast1
parallel and online processing of Big Data;
Multimodal and regularized topic modeling;
Built-in library of regularizers and quality measures;
BigARTM community:
Open-source https://github.com/bigartm
Documentation http://bigartm.org
BigARTM license and programming environment:
Freely available for commercial usage (BSD 3-Clause license)
Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)
Programming APIs: command line, C++, Python
1
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM:
Open Source Library for Regularized Multimodal Topic Modeling of Large
Collections Analysis of Images, Social Networks and Texts. 2015
Murat Apishev great-mel@yandex.ru AIST 2016 10 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
BigARTM vs. Gensim vs. Vowpal Wabbit LDA
3.7M articles from Wikipedia, 100K unique words
Framework procs train inference perplexity
BigARTM 1 35 min 72 sec 4000
LdaModel 1 369 min 395 sec 4161
VW.LDA 1 73 min 120 sec 4108
BigARTM 4 9 min 20 sec 4061
LdaMulticore 4 60 min 222 sec 4111
BigARTM 8 4.5 min 14 sec 4304
LdaMulticore 8 57 min 224 sec 4455
procs = number of parallel threads
inference = time to infer 𝜃d for 100K held-out documents
perplexity P is calculated on held-out documents
P(D) = exp
(︂
−
1
n
∑︁
d∈D
∑︁
w∈d
ndw ln
∑︁
t∈T
𝜑wt 𝜃td
)︂
, n =
∑︁
d
nd .
Murat Apishev great-mel@yandex.ru AIST 2016 11 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Offline algorithm
The collection is split into batches.
Offline algorithm performs scans over the collection.
Each thread process one batch at a time, inferring nwt and 𝜃td
(using Θ regularization).
After each scan algorithm recalculates Φ matrix and apply Φ
regularizers according to the equation
𝜑wt = norm
w∈W
(︁
nwt + 𝜑wt
𝜕R
𝜕𝜑wt
)︁
.
The implementation never stores the entire Θ matrix at any
given time.
Murat Apishev great-mel@yandex.ru AIST 2016 12 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Offline algorithm: Gantt chart
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Main
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Batch processing Norm
This and further Gantt charts were created using the NYTimes dataset:
https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.
Size of dataset is ≈ 300k documents, but each algorithm was run on
some subset (from 70% to 100%) to archive the ≈ 36 sec. working time.
Murat Apishev great-mel@yandex.ru AIST 2016 13 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Online algorithm
The algorithm is a generalization of Online variational Bayes
algorithm for LDA model.
Online ARTM improves the convergence rate of the
Offline ARTM by re-calculating matrix Φ after every 𝜂
batches.
Better suited for large and heterogeneous text collections.
Weighted sum of nwt from previous and current 𝜂 batches to
control the importance of new information.
Issue: all threads has no useful work to do during the update
of Φ matrix.
Murat Apishev great-mel@yandex.ru AIST 2016 14 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Online algorithm: Gantt chart
0 s. 4 s. 8 s. 12 s. 16 s. 20 s. 24 s. 28 s. 32 s. 36 s.
Main
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm Merge
Murat Apishev great-mel@yandex.ru AIST 2016 15 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Async: Asynchronous online algorithm
Processor threads:
ProcessBatch(Db,¡wt)Db ñwt
Merger thread:
Accumulate ñwt
Recalculate ¡wt
Queue
{Db}
Queue
{ñwt}
¡wt
Sync()
Db
Faster asynchronous implementation (it was compared with
Gensim and VW LDA)
Issue: Merger and DataLoader can become a bottleneck.
Issue: the result of such algorithm is non-deterministic.
Murat Apishev great-mel@yandex.ru AIST 2016 16 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Async: Gantt chart in normal case
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Merger
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm
Merge matrix Merge increments
Murat Apishev great-mel@yandex.ru AIST 2016 17 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Async: Gantt chart in bad case
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Merger
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm
Merge matrix Merge increments
Murat Apishev great-mel@yandex.ru AIST 2016 18 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
DetAsync: Deterministic asynchronous online algorithm
To avoid the indeterministic behavior lets replace the update
after first 𝜂 batches with update after given 𝜂 batches.
Remove Merger and DataLoader threads. Each Processor
thread reads batches and writes results into nwt matrix by
itself.
Processor threads get a set of batches to process, start
processing and immediately return a future object to main
thread.
The main thread can process the updates of Φ matrix while
Processor threads work, and then get the result by passing
received future object to Await function.
Murat Apishev great-mel@yandex.ru AIST 2016 19 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
DetAsync: schema
MasterModel
Processor threads:
Db = LoadBatch(b)
ProcessBatch(Db,¡wt)
Main thread:
Recalculate ¡wt
nwt¡wt
Transform({Db})
FitOffline({Db})
FitOnline({Db})
Murat Apishev great-mel@yandex.ru AIST 2016 20 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
DetAsync: Gantt chart
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Main
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm Merge
Murat Apishev great-mel@yandex.ru AIST 2016 21 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Experiments
Datasets: Wikipedia (|D| = 3.7M articles, |W | = 100K words), Pubmed
(|D| = 8.2M abstracts, |W | = 141K words).
Node: Intel Xeon CPU E5-2650 v2 system with 2 processors, 16 physical
cores in total (32 with hyper-threading).
Metric: perplexity P value achieved in the allotted time.
Time: each algorithm was time-boxed to run for a 30 minutes.
Peak memory usage (Gb):
|T| Offline Online DetAsync Async (v0.6)
Pubmed 1000 5.17 4.68 8.18 13.4
Pubmed 100 1.86 1.62 2.17 3.71
Wiki 1000 1.74 2.44 3.93 7.9
Wiki 100 0.54 0.53 0.83 1.28
Murat Apishev great-mel@yandex.ru AIST 2016 22 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Reached perplexity value
0 5 10 15 20 25 30
2,000
2,200
2,400
Time (min)
Perplexity
Offline
Online
Async
DetAsync
10 15 20 25 30 35
3,800
4,000
4,200
4,400
4,600
4,800
5,000
Time (min)
Perplexity
Offline
Online
Async
DetAsync
Wikipedia (left), Pubmed (right).
DetAsync achives best perplexity in given time-box.
Murat Apishev great-mel@yandex.ru AIST 2016 23 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Mining ethnic-related content from blogosphere
Development of concept and methodology for multi-level
monitoring of the state of inter-ethnic relations with the data from
social media.
The objectives of Topic Modeling in this project:
1 Identify ethnic topics in social media big data
2 Identify event and permanent ethnic topics
3 Identify spatio-temporal patterns of the ethnic discourse
4 Estimate the sentiment of the ethnic discourse
5 Develop the monitoring system of inter-ethnic discourse
The Russian Science Foundation grant 15-18-00091 (2015–2017)
(Higher School of Economics, St. Petersburg School of Social Sciences and
Humanities, Internet Studies Laboratory LINIS)
Murat Apishev great-mel@yandex.ru AIST 2016 24 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Example ethnonyms for semi-supervised topic modeling
османский русич
восточноевропейский сингапурец
эвенк перуанский
швейцарская словенский
аланский вепсский
саамский ниггер
латыш адыги
литовец сомалиец
цыганка абхаз
ханты-мансийский темнокожий
карачаевский нигериец
кубинка лягушатник
гагаузский камбоджиец
Murat Apishev great-mel@yandex.ru AIST 2016 25 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
Murat Apishev great-mel@yandex.ru AIST 2016 26 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
smoothing non-ethnonyms for background topics
Murat Apishev great-mel@yandex.ru AIST 2016 27 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
smoothing non-ethnonyms in background topics
decorrelating ethnic topics
Murat Apishev great-mel@yandex.ru AIST 2016 28 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
smoothing non-ethnonyms in background topics
decorrelating ethnic topics
adding ethnonyms modality and decorrelating their topics
Murat Apishev great-mel@yandex.ru AIST 2016 29 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Experiment
LiveJournal collection: 1.58M of documents
860K of words in the raw vocabulary after lemmatization
90K of words after filtering out
short words with length 2,
rare words with nw < 20 including:
non-Russian words
250 ethnonyms
Murat Apishev great-mel@yandex.ru AIST 2016 30 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Semi-supervised ARTM for ethnic topic modeling
The number of ethnic topics found by the model:
model ethnic |S| background |B| ++ +− −+ coh20
2
tfidf20
PLSA 400 12 15 17 -1447 -1012
LDA 400 12 15 17 -1540 -1121
ARTM-4 250 150 21 27 20 -1651 -1296
ARTM-5 250 150 38 42 30 -1342 -908
ARTM-4:
ethnic topics: sparsing and decorrelating, ethnonyms smoothing
background topics: smoothing, ethnonyms sparsing
ARTM-5:
ARTM-4 + ethnonyms as additional modality
2
Coherence and TF-IDF coherence are metrics that match the human
judgment of topic quality. The topic is better if it has higher coherence value.
Murat Apishev great-mel@yandex.ru AIST 2016 31 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Ethnic topics examples
(русские): русский, князь, россия, татарин, великий, царить, царь, иван,
император, империя, грозить, государь, век, московская, екатерина, москва,
(русские): акция, организация, митинг, движение, активный, мероприятие,
совет, русский, участник, москва, оппозиция, россия, пикет, протест, проведение,
националист, поддержка, общественный, проводить, участие,
(славяне, византийцы): славянский, святослав, жрец, древние, письменность,
рюрик, летопись, византия, мефодий, хазарский, русский, азбука,
(сирийцы): сирийский, асад, боевик, район, террорист, уничтожать, группировка,
дамаск, оружие, алесио, оппозиция, операция, селение, сша, нусра, турция,
(турки): турция, турецкий, курдский, эрдоган, стамбул, страна, кавказ, горин,
полиция, премьер-министр, регион, курдистан, ататюрк, партия,
(иранцы): иран, иранский, сша, россия, ядерный, президент, тегеран, сирия, оон,
израиль, переговоры, обама, санкция, исламский,
(палестинцы): террорист, израиль, терять, палестинский, палестинец,
террористический, палестина, взрыв, территория, страна, государство,
безопасность, арабский, организация, иерусалим, военный, полиция, газ,
(ливанцы): ливанский, боевик, район, ливан, армия, террорист, али, военный,
хизбалла, раненый, уничтожать, сирия, подразделение, квартал, армейский,
(ливийцы): ливан, демократия, страна, ливийский, каддафи, государство,
алжир, война, правительство, сша, арабский, али, муаммар, сирия,
(евреи): израиль, израильский, страна, израил, война, нетаньяху, тель-авив,
время, сша, сирия, египет, случай, самолет, еврейский, военный, ближний,
Murat Apishev great-mel@yandex.ru AIST 2016 32 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Conclusions
BigARTM is an open-source library supporting multimodal
ARTM theory.
Fast implementation of the underlying online EM-algorithm
was even more improved. Memory usage was reduced.
Combination of 8 regularizers in the task of ethnic topics
extraction showed the supirity of ARTM approach.
BigARTM is using to process more than 20 collections in
several different projects.
Join our comunity!
Contacts: bigartm.org, great-mel@yandex.ru
Murat Apishev great-mel@yandex.ru AIST 2016 33 / 33

More Related Content

What's hot (20)

PDF
VAE-type Deep Generative Models
Kenta Oono
 
PDF
D143136
IJRES Journal
 
PDF
Hyperparameter optimization with approximate gradient
Fabian Pedregosa
 
PDF
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
PPTX
Machine learning applications in aerospace domain
홍배 김
 
PDF
Data-Driven Recommender Systems
recsysfr
 
PDF
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Databricks
 
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
Kentaro Minami
 
PDF
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
PDF
Matrix and Tensor Tools for Computer Vision
Andrews Cordolino Sobral
 
PDF
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Sunghoon Joo
 
PDF
Improving Variational Inference with Inverse Autoregressive Flow
Tatsuya Shirakawa
 
PPTX
Introduction of "TrailBlazer" algorithm
Katsuki Ohto
 
PDF
Lecture 6: Convolutional Neural Networks
Sang Jun Lee
 
PDF
Variational Autoencoder
Mark Chang
 
PDF
CSC446: Pattern Recognition (LN8)
Mostafa G. M. Mostafa
 
PDF
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
PDF
The Gaussian Process Latent Variable Model (GPLVM)
James McMurray
 
PPTX
Bidirectional graph search techniques for finding shortest path in image base...
Navin Kumar
 
PDF
Dictionary Learning for Massive Matrix Factorization
recsysfr
 
VAE-type Deep Generative Models
Kenta Oono
 
D143136
IJRES Journal
 
Hyperparameter optimization with approximate gradient
Fabian Pedregosa
 
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
Machine learning applications in aerospace domain
홍배 김
 
Data-Driven Recommender Systems
recsysfr
 
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Databricks
 
Differential privacy without sensitivity [NIPS2016読み会資料]
Kentaro Minami
 
Neural Networks: Radial Bases Functions (RBF)
Mostafa G. M. Mostafa
 
Matrix and Tensor Tools for Computer Vision
Andrews Cordolino Sobral
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Sunghoon Joo
 
Improving Variational Inference with Inverse Autoregressive Flow
Tatsuya Shirakawa
 
Introduction of "TrailBlazer" algorithm
Katsuki Ohto
 
Lecture 6: Convolutional Neural Networks
Sang Jun Lee
 
Variational Autoencoder
Mark Chang
 
CSC446: Pattern Recognition (LN8)
Mostafa G. M. Mostafa
 
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
The Gaussian Process Latent Variable Model (GPLVM)
James McMurray
 
Bidirectional graph search techniques for finding shortest path in image base...
Navin Kumar
 
Dictionary Learning for Massive Matrix Factorization
recsysfr
 

Viewers also liked (6)

ODP
Topic Modeling
Karol Grzegorczyk
 
PDF
Topic model
saireya _
 
PDF
Topic Models, LDA and all that
Zhibo Xiao
 
POTX
LDA Beginner's Tutorial
Wayne Lee
 
PPTX
20151221 public
Katsuhiko Ishiguro
 
PPT
Topic Models
Claudia Wagner
 
Topic Modeling
Karol Grzegorczyk
 
Topic model
saireya _
 
Topic Models, LDA and all that
Zhibo Xiao
 
LDA Beginner's Tutorial
Wayne Lee
 
20151221 public
Katsuhiko Ishiguro
 
Topic Models
Claudia Wagner
 
Ad

Similar to Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling (20)

PPT
Parallel algorithms
guest084d20
 
PPT
Parallel algorithms
guest084d20
 
PPT
Parallel algorithms
guest084d20
 
PDF
Gk3611601162
IJERA Editor
 
PDF
Massive Matrix Factorization : Applications to collaborative filtering
Arthur Mensch
 
PPTX
autoTVM
Yi-Wen Hung
 
PDF
cis97003
perfj
 
PDF
Safety Verification of Deep Neural Networks_.pdf
larbaoui
 
PPTX
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
PPTX
Chapter two
mihiretu kassaye
 
PDF
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
romovpa
 
PPT
Research Away Day Jun 2009
German Terrazas
 
PDF
Second order traffic flow models on networks
Guillaume Costeseque
 
PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
PDF
Does PostgreSQL respond to the challenge of analytical queries?
Andrey Lepikhov
 
PPT
Stacks queues lists
Harry Potter
 
PPT
Stack squeues lists
James Wong
 
PPT
Stacksqueueslists
Fraboni Ec
 
PPT
Stacks queues lists
Tony Nguyen
 
PPT
Stacks queues lists
Luis Goldster
 
Parallel algorithms
guest084d20
 
Parallel algorithms
guest084d20
 
Parallel algorithms
guest084d20
 
Gk3611601162
IJERA Editor
 
Massive Matrix Factorization : Applications to collaborative filtering
Arthur Mensch
 
autoTVM
Yi-Wen Hung
 
cis97003
perfj
 
Safety Verification of Deep Neural Networks_.pdf
larbaoui
 
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Chapter two
mihiretu kassaye
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
romovpa
 
Research Away Day Jun 2009
German Terrazas
 
Second order traffic flow models on networks
Guillaume Costeseque
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
Does PostgreSQL respond to the challenge of analytical queries?
Andrey Lepikhov
 
Stacks queues lists
Harry Potter
 
Stack squeues lists
James Wong
 
Stacksqueueslists
Fraboni Ec
 
Stacks queues lists
Tony Nguyen
 
Stacks queues lists
Luis Goldster
 
Ad

More from AIST (20)

PDF
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
AIST
 
PDF
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 
PDF
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
AIST
 
PDF
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
AIST
 
PDF
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
AIST
 
PDF
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
AIST
 
PDF
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
AIST
 
PPTX
Иосиф Иткин, Exactpro - TBA
AIST
 
PPTX
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
AIST
 
PDF
George Moiseev - Classification of E-commerce Websites by Product Categories
AIST
 
PDF
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
AIST
 
PDF
Marina Danshina - The methodology of automated decryption of znamenny chants
AIST
 
PDF
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
AIST
 
PPTX
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
AIST
 
PDF
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
AIST
 
PPTX
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
AIST
 
PPTX
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
AIST
 
PDF
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
AIST
 
PPTX
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
AIST
 
PPT
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
AIST
 
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
AIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
AIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
AIST
 
Иосиф Иткин, Exactpro - TBA
AIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
AIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
AIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
AIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
AIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
AIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
AIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
AIST
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
AIST
 

Recently uploaded (20)

DOCX
Discover the Key Benefits of Implementing Data Mesh Architecture.docx
ajaykumar405166
 
PDF
Introduction to Data Science_Washington_
StarToon1
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
PDF
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
PPTX
apidays Munich 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (Aavista Oy)
apidays
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PDF
apidays Munich 2025 - The life-changing magic of great API docs, Jens Fischer...
apidays
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PPTX
materials that are required to used.pptx
drkaran1421
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PDF
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PDF
apidays Munich 2025 - Geospatial Artificial Intelligence (GeoAI) with OGC API...
apidays
 
PPTX
SRIJAN_Projecttttt_Report_Cover_PPT.pptx
SakshiLodhi9
 
Discover the Key Benefits of Implementing Data Mesh Architecture.docx
ajaykumar405166
 
Introduction to Data Science_Washington_
StarToon1
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
apidays Munich 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (Aavista Oy)
apidays
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
apidays Munich 2025 - The life-changing magic of great API docs, Jens Fischer...
apidays
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
materials that are required to used.pptx
drkaran1421
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
apidays Munich 2025 - Geospatial Artificial Intelligence (GeoAI) with OGC API...
apidays
 
SRIJAN_Projecttttt_Report_Cover_PPT.pptx
SakshiLodhi9
 

Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling

  • 1. Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling Murat Apishev [email protected] Oleksandr Frei [email protected] HSE, MSU, MIPT April 8, 2016
  • 2. Contents 1 Introduction Topic modeling ARTM BigARTM 2 Parallel implementation Synchronous algorithms Asynchronous algorithms Comparison 3 Applications The RSF project Conclusions
  • 3. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Topic modeling Topic modeling — an application of machine learning to statistical text analysis. Topic — a specific terminology of the subject area, the set of terms (unigrams or n−grams) frequently appearing together in documents. Topic model uncovers latent semantic structure of a text collection: topic t is a probability distribution p(w|t) over terms w document d is a probability distribution p(t|d) over topics t Applications — information retrieval for long-text queries, classification, categorization, summarization of texts. Murat Apishev [email protected] AIST 2016 3 / 33
  • 4. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Topic modeling task Given: W — set (vocabulary) of terms (unigrams or n−grams), D — set (collection) of text documents d ⊂ W , ndw — how many times term w appears in document d. Find: model p(w|d) = ∑︀ t∈T 𝜑wt 𝜃td with parameters Φ W×T и Θ T×D : 𝜑wt =p(w|t) — term probabilities w in each topic t, 𝜃td =p(t|d) — topic probabilities t in each document d. Criteria log-likelihood maximization: ∑︁ d∈D ∑︁ w∈d ndw ln ∑︁ t∈T 𝜑wt 𝜃td → max 𝜑,𝜃 ; 𝜑wt 0; ∑︀ w 𝜑wt = 1; 𝜃td 0; ∑︀ t 𝜃td = 1. Issue: the problem of stochastic matrix factorization is ill-posed: ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′. Murat Apishev [email protected] AIST 2016 4 / 33
  • 5. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM PLSA and EM-algorithm Log-likelihood maximization: ∑︁ d∈D ∑︁ w∈W ndw ln ∑︁ t 𝜑wt 𝜃td → max Φ,Θ EM-algorithm: the simple iteration method for the set of equations E-шаг: M-шаг: ⎧ ⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎩ ptdw = norm t∈T (︀ 𝜑wt 𝜃td )︀ 𝜑wt = norm w∈W (︀ nwt )︀ , nwt = ∑︀ d∈D ndw ptdw 𝜃td = norm t∈T (︀ ntd )︀ , ntd = ∑︀ w∈d ndw ptdw where norm i∈I xi = max{xi ,0}∑︀ j∈I max{xj ,0} Murat Apishev [email protected] AIST 2016 5 / 33
  • 6. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM ARTM and regularized EM-algorithm Log-likelihood maximization with additive regularization criterion R: ∑︁ d∈D ∑︁ w∈W ndw ln ∑︁ t 𝜑wt 𝜃td + R(Φ, Θ) → max Φ,Θ EM-algorithm: the simple iteration method for the set of equations E-шаг: M-шаг: ⎧ ⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎩ ptdw = norm t∈T (︀ 𝜑wt 𝜃td )︀ 𝜑wt = norm w∈W (︁ nwt + 𝜑wt 𝜕R 𝜕𝜑wt )︁ , nwt = ∑︀ d∈D ndw ptdw 𝜃td = norm t∈T (︁ ntd + 𝜃td 𝜕R 𝜕𝜃td )︁ , ntd = ∑︀ w∈d ndw ptdw Murat Apishev [email protected] AIST 2016 6 / 33
  • 7. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Examples of regularizers Many Bayesian models can be reinterpreted as regularizers in ARTM. Some examples of regularizes: 1 Smoothing Φ / Θ (leads to popular LDA model) 2 Sparsing Φ / Θ 3 Decorrelation of topics in Φ 4 Semi-supervised learning 5 Topic coherence maximization 6 Topic selection 7 . . . Murat Apishev [email protected] AIST 2016 7 / 33
  • 8. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Multimodal Topic Model Multimodal Topic Model finds topical distributions for terms p(w|t), authors p(a|t), time p(y|t), objects of images p(o|t), linked documents p(d′|t), advertising banners p(b|t), users p(u|t), and binds all these modalities into a single topic model. Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Ads Images Links Users Murat Apishev [email protected] AIST 2016 8 / 33
  • 9. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM M-ARTM and multimodal regularized EM-algorithm W m is a vocabulary of terms of m-th modality, m ∈ M, W = W 1 ⊔ W m as a joint vocabulary of all modalities Multimodal log-likelihood maximization with additive regularization criterion R: ∑︁ m∈M 𝜆m ∑︁ d∈D ∑︁ w∈W m ndw ln ∑︁ t 𝜑wt 𝜃td + R(Φ, Θ) → max Φ,Θ EM-algorithm: the simple iteration method for the set of equations E-шаг: M-шаг: ⎧ ⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎩ ptdw = norm t∈T (︀ 𝜑wt 𝜃td )︀ 𝜑wt = norm w∈W m (︁ nwt + 𝜑wt 𝜕R 𝜕𝜑wt )︁ , nwt = ∑︀ d∈D 𝜆m(w)ndw ptdw 𝜃td = norm t∈T (︁ ntd + 𝜃td 𝜕R 𝜕𝜃td )︁ , ntd = ∑︀ w∈d 𝜆m(w)ndw ptdw Murat Apishev [email protected] AIST 2016 9 / 33
  • 10. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM BigARTM project BigARTM features: Fast1 parallel and online processing of Big Data; Multimodal and regularized topic modeling; Built-in library of regularizers and quality measures; BigARTM community: Open-source https://github.com/bigartm Documentation http://bigartm.org BigARTM license and programming environment: Freely available for commercial usage (BSD 3-Clause license) Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit) Programming APIs: command line, C++, Python 1 Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Analysis of Images, Social Networks and Texts. 2015 Murat Apishev [email protected] AIST 2016 10 / 33
  • 11. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM BigARTM vs. Gensim vs. Vowpal Wabbit LDA 3.7M articles from Wikipedia, 100K unique words Framework procs train inference perplexity BigARTM 1 35 min 72 sec 4000 LdaModel 1 369 min 395 sec 4161 VW.LDA 1 73 min 120 sec 4108 BigARTM 4 9 min 20 sec 4061 LdaMulticore 4 60 min 222 sec 4111 BigARTM 8 4.5 min 14 sec 4304 LdaMulticore 8 57 min 224 sec 4455 procs = number of parallel threads inference = time to infer 𝜃d for 100K held-out documents perplexity P is calculated on held-out documents P(D) = exp (︂ − 1 n ∑︁ d∈D ∑︁ w∈d ndw ln ∑︁ t∈T 𝜑wt 𝜃td )︂ , n = ∑︁ d nd . Murat Apishev [email protected] AIST 2016 11 / 33
  • 12. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Offline algorithm The collection is split into batches. Offline algorithm performs scans over the collection. Each thread process one batch at a time, inferring nwt and 𝜃td (using Θ regularization). After each scan algorithm recalculates Φ matrix and apply Φ regularizers according to the equation 𝜑wt = norm w∈W (︁ nwt + 𝜑wt 𝜕R 𝜕𝜑wt )︁ . The implementation never stores the entire Θ matrix at any given time. Murat Apishev [email protected] AIST 2016 12 / 33
  • 13. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Offline algorithm: Gantt chart 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Main Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Batch processing Norm This and further Gantt charts were created using the NYTimes dataset: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words. Size of dataset is ≈ 300k documents, but each algorithm was run on some subset (from 70% to 100%) to archive the ≈ 36 sec. working time. Murat Apishev [email protected] AIST 2016 13 / 33
  • 14. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Online algorithm The algorithm is a generalization of Online variational Bayes algorithm for LDA model. Online ARTM improves the convergence rate of the Offline ARTM by re-calculating matrix Φ after every 𝜂 batches. Better suited for large and heterogeneous text collections. Weighted sum of nwt from previous and current 𝜂 batches to control the importance of new information. Issue: all threads has no useful work to do during the update of Φ matrix. Murat Apishev [email protected] AIST 2016 14 / 33
  • 15. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Online algorithm: Gantt chart 0 s. 4 s. 8 s. 12 s. 16 s. 20 s. 24 s. 28 s. 32 s. 36 s. Main Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge Murat Apishev [email protected] AIST 2016 15 / 33
  • 16. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Async: Asynchronous online algorithm Processor threads: ProcessBatch(Db,¡wt)Db ñwt Merger thread: Accumulate ñwt Recalculate ¡wt Queue {Db} Queue {ñwt} ¡wt Sync() Db Faster asynchronous implementation (it was compared with Gensim and VW LDA) Issue: Merger and DataLoader can become a bottleneck. Issue: the result of such algorithm is non-deterministic. Murat Apishev [email protected] AIST 2016 16 / 33
  • 17. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Async: Gantt chart in normal case 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Merger Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge matrix Merge increments Murat Apishev [email protected] AIST 2016 17 / 33
  • 18. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Async: Gantt chart in bad case 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Merger Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge matrix Merge increments Murat Apishev [email protected] AIST 2016 18 / 33
  • 19. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison DetAsync: Deterministic asynchronous online algorithm To avoid the indeterministic behavior lets replace the update after first 𝜂 batches with update after given 𝜂 batches. Remove Merger and DataLoader threads. Each Processor thread reads batches and writes results into nwt matrix by itself. Processor threads get a set of batches to process, start processing and immediately return a future object to main thread. The main thread can process the updates of Φ matrix while Processor threads work, and then get the result by passing received future object to Await function. Murat Apishev [email protected] AIST 2016 19 / 33
  • 20. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison DetAsync: schema MasterModel Processor threads: Db = LoadBatch(b) ProcessBatch(Db,¡wt) Main thread: Recalculate ¡wt nwt¡wt Transform({Db}) FitOffline({Db}) FitOnline({Db}) Murat Apishev [email protected] AIST 2016 20 / 33
  • 21. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison DetAsync: Gantt chart 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Main Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge Murat Apishev [email protected] AIST 2016 21 / 33
  • 22. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Experiments Datasets: Wikipedia (|D| = 3.7M articles, |W | = 100K words), Pubmed (|D| = 8.2M abstracts, |W | = 141K words). Node: Intel Xeon CPU E5-2650 v2 system with 2 processors, 16 physical cores in total (32 with hyper-threading). Metric: perplexity P value achieved in the allotted time. Time: each algorithm was time-boxed to run for a 30 minutes. Peak memory usage (Gb): |T| Offline Online DetAsync Async (v0.6) Pubmed 1000 5.17 4.68 8.18 13.4 Pubmed 100 1.86 1.62 2.17 3.71 Wiki 1000 1.74 2.44 3.93 7.9 Wiki 100 0.54 0.53 0.83 1.28 Murat Apishev [email protected] AIST 2016 22 / 33
  • 23. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Reached perplexity value 0 5 10 15 20 25 30 2,000 2,200 2,400 Time (min) Perplexity Offline Online Async DetAsync 10 15 20 25 30 35 3,800 4,000 4,200 4,400 4,600 4,800 5,000 Time (min) Perplexity Offline Online Async DetAsync Wikipedia (left), Pubmed (right). DetAsync achives best perplexity in given time-box. Murat Apishev [email protected] AIST 2016 23 / 33
  • 24. Introduction Parallel implementation Applications The RSF project Conclusions Mining ethnic-related content from blogosphere Development of concept and methodology for multi-level monitoring of the state of inter-ethnic relations with the data from social media. The objectives of Topic Modeling in this project: 1 Identify ethnic topics in social media big data 2 Identify event and permanent ethnic topics 3 Identify spatio-temporal patterns of the ethnic discourse 4 Estimate the sentiment of the ethnic discourse 5 Develop the monitoring system of inter-ethnic discourse The Russian Science Foundation grant 15-18-00091 (2015–2017) (Higher School of Economics, St. Petersburg School of Social Sciences and Humanities, Internet Studies Laboratory LINIS) Murat Apishev [email protected] AIST 2016 24 / 33
  • 25. Introduction Parallel implementation Applications The RSF project Conclusions Example ethnonyms for semi-supervised topic modeling османский русич восточноевропейский сингапурец эвенк перуанский швейцарская словенский аланский вепсский саамский ниггер латыш адыги литовец сомалиец цыганка абхаз ханты-мансийский темнокожий карачаевский нигериец кубинка лягушатник гагаузский камбоджиец Murat Apishev [email protected] AIST 2016 25 / 33
  • 26. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics Murat Apishev [email protected] AIST 2016 26 / 33
  • 27. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics smoothing non-ethnonyms for background topics Murat Apishev [email protected] AIST 2016 27 / 33
  • 28. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics smoothing non-ethnonyms in background topics decorrelating ethnic topics Murat Apishev [email protected] AIST 2016 28 / 33
  • 29. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics smoothing non-ethnonyms in background topics decorrelating ethnic topics adding ethnonyms modality and decorrelating their topics Murat Apishev [email protected] AIST 2016 29 / 33
  • 30. Introduction Parallel implementation Applications The RSF project Conclusions Experiment LiveJournal collection: 1.58M of documents 860K of words in the raw vocabulary after lemmatization 90K of words after filtering out short words with length 2, rare words with nw < 20 including: non-Russian words 250 ethnonyms Murat Apishev [email protected] AIST 2016 30 / 33
  • 31. Introduction Parallel implementation Applications The RSF project Conclusions Semi-supervised ARTM for ethnic topic modeling The number of ethnic topics found by the model: model ethnic |S| background |B| ++ +− −+ coh20 2 tfidf20 PLSA 400 12 15 17 -1447 -1012 LDA 400 12 15 17 -1540 -1121 ARTM-4 250 150 21 27 20 -1651 -1296 ARTM-5 250 150 38 42 30 -1342 -908 ARTM-4: ethnic topics: sparsing and decorrelating, ethnonyms smoothing background topics: smoothing, ethnonyms sparsing ARTM-5: ARTM-4 + ethnonyms as additional modality 2 Coherence and TF-IDF coherence are metrics that match the human judgment of topic quality. The topic is better if it has higher coherence value. Murat Apishev [email protected] AIST 2016 31 / 33
  • 32. Introduction Parallel implementation Applications The RSF project Conclusions Ethnic topics examples (русские): русский, князь, россия, татарин, великий, царить, царь, иван, император, империя, грозить, государь, век, московская, екатерина, москва, (русские): акция, организация, митинг, движение, активный, мероприятие, совет, русский, участник, москва, оппозиция, россия, пикет, протест, проведение, националист, поддержка, общественный, проводить, участие, (славяне, византийцы): славянский, святослав, жрец, древние, письменность, рюрик, летопись, византия, мефодий, хазарский, русский, азбука, (сирийцы): сирийский, асад, боевик, район, террорист, уничтожать, группировка, дамаск, оружие, алесио, оппозиция, операция, селение, сша, нусра, турция, (турки): турция, турецкий, курдский, эрдоган, стамбул, страна, кавказ, горин, полиция, премьер-министр, регион, курдистан, ататюрк, партия, (иранцы): иран, иранский, сша, россия, ядерный, президент, тегеран, сирия, оон, израиль, переговоры, обама, санкция, исламский, (палестинцы): террорист, израиль, терять, палестинский, палестинец, террористический, палестина, взрыв, территория, страна, государство, безопасность, арабский, организация, иерусалим, военный, полиция, газ, (ливанцы): ливанский, боевик, район, ливан, армия, террорист, али, военный, хизбалла, раненый, уничтожать, сирия, подразделение, квартал, армейский, (ливийцы): ливан, демократия, страна, ливийский, каддафи, государство, алжир, война, правительство, сша, арабский, али, муаммар, сирия, (евреи): израиль, израильский, страна, израил, война, нетаньяху, тель-авив, время, сша, сирия, египет, случай, самолет, еврейский, военный, ближний, Murat Apishev [email protected] AIST 2016 32 / 33
  • 33. Introduction Parallel implementation Applications The RSF project Conclusions Conclusions BigARTM is an open-source library supporting multimodal ARTM theory. Fast implementation of the underlying online EM-algorithm was even more improved. Memory usage was reduced. Combination of 8 regularizers in the task of ethnic topics extraction showed the supirity of ARTM approach. BigARTM is using to process more than 20 collections in several different projects. Join our comunity! Contacts: bigartm.org, [email protected] Murat Apishev [email protected] AIST 2016 33 / 33