SlideShare a Scribd company logo
Text	mining	of	Beauty	Blogs:
Text	mining	of	Beauty	Blogs:
О	чем	говорят	женщины?	
Артем	Просветов
Data	Scientist,	CleverDATA
cleverdata.ru |		info@cleverdata.ru
Raw	blog	data
Raw	data:	98,496 pages	in	format	of	~	1,000,000	files.
Ready	for	analysis:	58,719 English	pages	(59.6%)
40.4%	data:	empty	pages	and	pages	with	errors,	not	English	pages	
(23,461),	photo/video	pages	without	text	(2,315),	articles	from	
techcrunch.com	(3,402)
cleverdata.ru |		info@cleverdata.ru
From	60k of	pages	→		~2000 authors.
Pages	→	Authors
cleverdata.ru |		info@cleverdata.ru
Mean	blog	post	size	(in	words)
One can distinguish 2 populations
of bloggers:
• twitter style' authors with short
posts (~20%)
• full-length bloggers with 200-
500 mean words per post
(~80%)
cleverdata.ru |		info@cleverdata.ru
Used APIs and services:
- Sentity (https://sentity.io/)
- Twinword (https://www.twinword.com/)
- Textualinsights (http://www.textualinsights.com/)
- VivekN (https://github.com/vivekn/sentiment-web)
Sentiment	analysis
cleverdata.ru |		info@cleverdata.ru
Sentiment	analysis
• - the resulting sentiment rate is based
on 4 independent rate systems.
• - the majority of the blogs have positive
emotion rate.
• - the mean sentiment rate is «positive
warm» 0.72.
• - all this results are intuitively consistent
and are in a good agreement with
manual tests
cleverdata.ru |		info@cleverdata.ru
We used a	few traffic rank systems:
Estimation of blog efficiency
• Alexa Rank,	that basically audits and makes public the frequency of
visits on various Web sites.
• Yandex Thematic Citation Index (TIC),	that determines the
“credibility”	of Internet resources based on a	qualitative assessment
of links to other sites.
• Google Page Rank,	that works by counting the number and quality
of links to blog to determine a	rough estimate of how important the
website is.
cleverdata.ru |		info@cleverdata.ru
Content relevance rate is based on fuzzy string matching:
- Every company product name was string matched with all amount of blogs.	
- String matching is based on Levinstein's metric.
- Pages with 90%	matching rate were marked up.
- Tests with direct brand name matching showed that we get about 90-100%	
accuracy on each product name deppends on words in title.	
- The result relevance rate for each author is summed from all marks of
his/hers pages.
Relevance	Rate
cleverdata.ru |		info@cleverdata.ru
Levenshtein distance is a	string metric for measuring the difference between
two sequences.	
Informally,	the Levenshtein distance between two words is the minimum
number of single-character edits (i.e.	insertions,	deletions or substitutions)	
required to change one word into the other.
Levinshtein distance between 'beer'	and 'bread'	is 44/100
Levenshtein	distance
cleverdata.ru |		info@cleverdata.ru
The	most	active	authors
write	with		sentiment
rate	in	short	range:	
0.74	+/- 0.03
Sentiment rate
Blogsize(pages)
Sentiments	vs	Blog	size
cleverdata.ru |		info@cleverdata.ru
The	most	discussed	
blogs	have	middle-
size	authors.
Log(Blog size)
Meandiscussion
Discussion	vs	Blog	size
cleverdata.ru |		info@cleverdata.ru
Again,	2	kinds	of	bloggers:
- 'twitter	style'	authors
with	short	posts
- full-length	bloggers
Log(mean words per page)
Log(Blogsize)
Words	vs	Pages
cleverdata.ru |		info@cleverdata.ru
f	you	want	to	make	a	big	
discussion,	you	should	
praise	something.
All	highly	discussed	
authors	are	sentiment	
positive	(>=0.4)
Sentiment rate
Meandiscussion
Discussion	vs	Sentiments
cleverdata.ru |		info@cleverdata.ru
We use Klout service to rank authors
according to online social influence.	
Klout measures the size of a	user's
social media network and correlates the
content created to measure how other
users interact with that content.
- the median Klout score is 40.1
Using	of	Klout	score	for	bloggers
cleverdata.ru |		info@cleverdata.ru
One can distinguish a	population
of beginner bloggers with low
Klout score,	that have tendency
to amplification of sentiments.
Sentiment rate
Kloutscore
Sentiments	vs	Klout	score
cleverdata.ru |		info@cleverdata.ru
• Amount	of		blog	pages
• Mean	discussion	size	
• AlexaRank +	YandexTIC +	Google	PageRank
• Relevance	rate
• Sentiment	rate
• Klout score
Final	Author	Rating	is	based	on
cleverdata.ru |		info@cleverdata.ru
4	independent	sentiment
rating	systems	are	combined
Alexa	Rank
Yandex	Thematic	Citation	Index	
Google	PageRank
list	of	most	PR	effective	authors		
Pragmatic	statistical	information
key	recommendations	for	blogger
resulting	sentiment	rate	is
fully	consistent	with	tests
Blog			
efficiency
rating	
Blog
relevance
rating
Sentiment	
analysis
Make	your	data	clever
Based	on	fuzzy	string	
matching	
Blog	rating	in	
accordance	to	
mentions	of	company	
products	in	text
cleverdata.ru |		info@cleverdata.ru
Name Url Sentiment Pages Mean	
Comments
Hayley	Carr http://www.londonbeautyqueen.com 0.71 229 10.9
Luzanne http://pinkpeonies.co.za 0.77 66 68.3
Allison http://www.neversaydiebeauty.com 0.70 182 42.9
Mica	Kelly,	Beth,	
Jessica	Diner
http://blog.birchbox.co.uk 0.74 196 0.26
Poonam http://beautyandmakeupmatters.com 0.78 142 4.3
Silvie http://mysillylittlegang.com 0.74 571 0.64
TOP	Rated	Authors
cleverdata.ru |		info@cleverdata.ru
Testing	the	result
Hayley	Carr (Top	Rated	Author):	
“BlaBlaBla is	definitely	a	brand	to	be	reckoned	with...	All	of	the	
BlaBlaBla products	have	multiple	purposes,	as	well	as	smelling	
and	feeling	fabulous;	the	packaging	is	clean	and	fresh	whilst	
still	looking	great	in	your	bathroom,	as	well	as	having	unique	
application	methods	that	only	aid	the	product	performance...	
It's	definitely	worth	checking	out	this	growing	brand,	before	it	
starts	taking	over	the	world.	“
cleverdata.ru |		info@cleverdata.ru
Authors	←→		Products
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
Finding	the	most	perspective	
for	promotion	products
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
Let's	build	document-term	
matrix,	where	each	row	is	a	
document,	each	term	is	a	
column	and	a	color	intensity	
indicates	that	a	term	appears	in	
a	document	at	least	once.	
We	can	use	TF-IDF	method	
to	get	document-term	matrix.	
Finding	topics:
the	document-term	matrix
cleverdata.ru |		info@cleverdata.ru
Finding	topics:	TF	- IDF
• Term	frequency	TF(t,d) is	the	number	of	times	that	term	t	
occurs	in	document	d.
• The	inverse	document	frequency	(IDF)	is	a	measure	of	how	
much	information	the	word	provides,	that	is,	whether	the	
term	is	common	or	rare	across	all	documents.
• Term	frequency–inverse	document	frequency,	is	a	numerical	
statistic	that	is	intended	to	reflect	how	important	a	word	is	
to	a	document	in	a	collection	or	corpus.
cleverdata.ru |		info@cleverdata.ru
• NMF	is	a	variant	of	Matrix	
Factorization	where	we	start	
with	a	matrix	D with	document-
term	matrix,	and	constrain	the	
elements	of	W and	T to	be		non-
negative.
• Lets	us	interpret	each	row	of	the	
T matrix	as	a		topic.
Topic	extraction:	NMF
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
• For	each	author	we	build	document-term	matrix.
• For	each	document-term	matrix	we	perform	matrix	
factorization	and	find	main	topics
• For	each	product	we	match	product	name	with	
main	topics	of	author	and	find	the	rate	of	intensity.		
• If	author	have	exact	product	name	in	one	of	
his/hers	titles,	we	set	the	rate	of	intensity	to	0 (the	
author	has	already	made	review	of	the	the
product).
Topic	extraction
cleverdata.ru |		info@cleverdata.ru
Thus	for	each	pair	of	author-product	we	find	rate	of	intensity	and	we	can	
visualize	it	in	form	of	heatmap	where	products	are	sorted	by	mean	rate	of	
intensity	and	authors	are	sorted	by	author	rating:	
Note:	the	most	rated	authors	are	highly	intensive	on	matrix	
The	intensity	matrix
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
cleverdata.ru |		info@cleverdata.ru
Next	we	extract	the	most	resonance	peaks	from	product-author	matrix	of	intensity.	
After	each	peak	extraction	the	column	with	a	peak	is	dropped,	so	for	each	author	
we	get	only	one	product.	
We	need	to	build	recommendations	only	for	4	products	and	we	can	select	40	
best	rated	authors	for	this	task.	
The	intensity	matrix
cleverdata.ru |		info@cleverdata.ru
In	order	to	associate	a	blogger	
with	a	product	we	must:
• Find	products	for	promotion	
• Find	main	topics	of	each	blogger
• Match	topics	of	each	blogger	with product	names
• Find	best	combinations	of	blogger	and	product
• Profit!
cleverdata.ru |		info@cleverdata.ru
BlaBlaBla	Body	Oil Allison	 http://www.neversaydiebeauty.com
BlaBlaBla	Wrinkle	
Repair
Cindy	Batchelor http://mystylespot.net
BlaBlaBla	Face	Serum Marie	Papachatzis http://iamthemakeupjunkie.blogspot.ru
BlaBlaBla	Face	Oil Emily	- Style	Lobster http://stylelobster.com
The	resulting	associations
Data Science Weekend 2017. CleverDATA. Text mining of beauty blogs: о чем говорят женщины?

More Related Content

PDF
Data Science Weekend 2017. E-Contenta. Классификация текстов: в поисках сереб...
Newprolab
 
PDF
Data Science Weekend 2017. Qlean. Как устроено машинное обучение в Qlean
Newprolab
 
PDF
Data Science Weekend 2017. Urbica. Дизайн города, основанный на данных
Newprolab
 
PDF
Data Science Weekend 2017. New Professions Lab. Образование в области Data Sc...
Newprolab
 
PDF
Data Science Weekend 2017. МегаФон. Аналитика больших данных в телекоме. Опыт...
Newprolab
 
PDF
Data Science Weekend 2017. Brand Analytics. Исследование трендов потребления ...
Newprolab
 
PDF
Data Science Weekend 2017. Intento. Machine to Machine Communication in the ...
Newprolab
 
PDF
Data Science Weekend 2017. 1С-Битрикс. Чатбот для подсказки ответов на вопросы
Newprolab
 
Data Science Weekend 2017. E-Contenta. Классификация текстов: в поисках сереб...
Newprolab
 
Data Science Weekend 2017. Qlean. Как устроено машинное обучение в Qlean
Newprolab
 
Data Science Weekend 2017. Urbica. Дизайн города, основанный на данных
Newprolab
 
Data Science Weekend 2017. New Professions Lab. Образование в области Data Sc...
Newprolab
 
Data Science Weekend 2017. МегаФон. Аналитика больших данных в телекоме. Опыт...
Newprolab
 
Data Science Weekend 2017. Brand Analytics. Исследование трендов потребления ...
Newprolab
 
Data Science Weekend 2017. Intento. Machine to Machine Communication in the ...
Newprolab
 
Data Science Weekend 2017. 1С-Битрикс. Чатбот для подсказки ответов на вопросы
Newprolab
 

Viewers also liked (13)

PPTX
Data Science Weekend 2017. Segmento, На пути к идеальной диалоговой системе
Newprolab
 
PDF
Presentazione Savino Università Bocconi
SAVINO SOLUTION - METODO SAVINO®
 
PPTX
Онлайн советник по маркетингу Роман Васильев факты и цифры
Roman Vasilyev
 
PPTX
BizTalks. Роман Кумар Виас (Qlean)
Mail.ru Group
 
DOCX
2016 and 2017 Data Mining Projects @ TMKS Infotech
Manju Nath
 
PDF
Data Science Week 2016. Segmento, "Digital Employee"
Newprolab
 
PDF
Data Science Week 2016. Inten.to. "Мессенджеры и персональные ассистенты"
Newprolab
 
PDF
Data Science Week 2016. Rambler & Co. "Пайплайн машинного обучения на Apache ...
Newprolab
 
PDF
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
Newprolab
 
PDF
Data Science Week 2016. SkyEng. "Data-driven экономика компании"
Newprolab
 
PDF
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
Andrey Akinshin
 
PDF
4 sas and big data short
antishmanti
 
PDF
Data Science Week 2016. Sberbank
Newprolab
 
Data Science Weekend 2017. Segmento, На пути к идеальной диалоговой системе
Newprolab
 
Presentazione Savino Università Bocconi
SAVINO SOLUTION - METODO SAVINO®
 
Онлайн советник по маркетингу Роман Васильев факты и цифры
Roman Vasilyev
 
BizTalks. Роман Кумар Виас (Qlean)
Mail.ru Group
 
2016 and 2017 Data Mining Projects @ TMKS Infotech
Manju Nath
 
Data Science Week 2016. Segmento, "Digital Employee"
Newprolab
 
Data Science Week 2016. Inten.to. "Мессенджеры и персональные ассистенты"
Newprolab
 
Data Science Week 2016. Rambler & Co. "Пайплайн машинного обучения на Apache ...
Newprolab
 
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
Newprolab
 
Data Science Week 2016. SkyEng. "Data-driven экономика компании"
Newprolab
 
Теория и практика .NET-бенчмаркинга (25.01.2017, Москва)
Andrey Akinshin
 
4 sas and big data short
antishmanti
 
Data Science Week 2016. Sberbank
Newprolab
 
Ad

Similar to Data Science Weekend 2017. CleverDATA. Text mining of beauty blogs: о чем говорят женщины? (20)

PPT
Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data sc...
CleverDATA
 
PDF
Pundit at Digital Humanities Austria 2015
Net7
 
PDF
Pundit. Web annotation for the Digital Humanities
Francesca Di Donato
 
PPT
Effective Extraction of Thematically Grouped Key Terms From Text
maria.grineva
 
PPT
EASI Webinar: Twitter And Web Accessibility
Dennis Lembree
 
PPT
"Twitter and Web Accessibility" INDATA Conference 2010
Dennis Lembree
 
PPT
Matec Web2 Session Thurs
Mike Qaissaunee
 
PDF
Paper id 24201441
IJRAT
 
PPTX
SPS Nashville Modern Sharepoint Experience
Theresa Lubelski
 
PPTX
2011 10-05-virtual user group
jhennelly
 
PPT
Advanced Blogging
mythicgroup
 
PPT
PR In Internet (Russian experience)
Natalia
 
PPTX
SPS Nashville Modern Sharepoint Experience
Theresa Lubelski
 
PPT
Extracting Key Terms From Noisy and Multi-theme Documents
maria.grineva
 
PDF
Microprocessor and Microcontroller Interview Questions A complete question ba...
niikaikeade
 
PDF
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
fabiodeazevedo3
 
PPTX
Tips and tricks for contributing to an Open Source project.pptx
Victor Morales
 
PPT
Writing For The World Wide Web
AlterSage
 
PDF
WordPress as a CMS
Brian Rotsztein
 
DOCX
ACA Journal Submissions
Sara Calderon
 
Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data sc...
CleverDATA
 
Pundit at Digital Humanities Austria 2015
Net7
 
Pundit. Web annotation for the Digital Humanities
Francesca Di Donato
 
Effective Extraction of Thematically Grouped Key Terms From Text
maria.grineva
 
EASI Webinar: Twitter And Web Accessibility
Dennis Lembree
 
"Twitter and Web Accessibility" INDATA Conference 2010
Dennis Lembree
 
Matec Web2 Session Thurs
Mike Qaissaunee
 
Paper id 24201441
IJRAT
 
SPS Nashville Modern Sharepoint Experience
Theresa Lubelski
 
2011 10-05-virtual user group
jhennelly
 
Advanced Blogging
mythicgroup
 
PR In Internet (Russian experience)
Natalia
 
SPS Nashville Modern Sharepoint Experience
Theresa Lubelski
 
Extracting Key Terms From Noisy and Multi-theme Documents
maria.grineva
 
Microprocessor and Microcontroller Interview Questions A complete question ba...
niikaikeade
 
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
fabiodeazevedo3
 
Tips and tricks for contributing to an Open Source project.pptx
Victor Morales
 
Writing For The World Wide Web
AlterSage
 
WordPress as a CMS
Brian Rotsztein
 
ACA Journal Submissions
Sara Calderon
 
Ad

More from Newprolab (8)

PDF
Data Science Week 2016. QIWI. "Поиск сообществ в графах пользователей переводов"
Newprolab
 
PPTX
Data Science Week 2016. Microsoft. "Интернет вещей и предиктивная аналитика ...
Newprolab
 
PPTX
Data Science Week 2016. GlowByte, "Культура работы с данными"
Newprolab
 
PDF
Data Science Week 2016. DCA. "Ваш телефон вас понимает. Персонализированные п...
Newprolab
 
PDF
Data Science Week 2016. RockStat. "Мультиканальная атрибуция на основе вовлеч...
Newprolab
 
PDF
Data Science Week 2016. New Professions Lab. "Образование в области Big Data"
Newprolab
 
PDF
Data Science Week 2016. Homeapp. "Создание розничного data-driven продукта"
Newprolab
 
PDF
Data Science Week 2016. E-Contenta. "Data science в медиа-компаниях"
Newprolab
 
Data Science Week 2016. QIWI. "Поиск сообществ в графах пользователей переводов"
Newprolab
 
Data Science Week 2016. Microsoft. "Интернет вещей и предиктивная аналитика ...
Newprolab
 
Data Science Week 2016. GlowByte, "Культура работы с данными"
Newprolab
 
Data Science Week 2016. DCA. "Ваш телефон вас понимает. Персонализированные п...
Newprolab
 
Data Science Week 2016. RockStat. "Мультиканальная атрибуция на основе вовлеч...
Newprolab
 
Data Science Week 2016. New Professions Lab. "Образование в области Big Data"
Newprolab
 
Data Science Week 2016. Homeapp. "Создание розничного data-driven продукта"
Newprolab
 
Data Science Week 2016. E-Contenta. "Data science в медиа-компаниях"
Newprolab
 

Recently uploaded (20)

PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 

Data Science Weekend 2017. CleverDATA. Text mining of beauty blogs: о чем говорят женщины?