SlideShare a Scribd company logo
Analyze radio stations broadcasts
with Apache Spark SQL,
Spotify, and Databricks
Spark User Group Paris - May 2017
Galenki, Russia
1. Spark SQL
a. Dataset API
b. Parquet
c. Databricks
2. Data extraction
3. Data exploration
Paul Leclercq
@polomarcus
Ad tech for 3 years at :
Data Engineer
● Spark : Streaming, SQL, MLLib
● Scala
● Kafka
● NoSQL
Looking for his dream job in Data
in music/sport/cool stuffs industry:)
3
4
Data people
Engineer: store, index high volume of raw data, implement machine learning algo
Hadoop, Amazon S3, Kafka, RabbitMQ, Spark, Flink, Beam, Drill, Druid, NoSQL DB : Cassandra, Redis,
Aerospike
Scientist: PhD, Mathematics degrees : build machine learning algorithms that can
predict business actions
Machine learning/Statistics tools: Scikit-learn, MLLib
Business Analyst: use the data provided for business purposes
Tools with UI: Excel, Chart.io, Talend, Superset, Pivot
5
Why I love Spark
“Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.”
Scalable : ops and code
Batch, Streaming, ML unified distributed engine to process data
6
Spark Usage 2016 Survey
“Apache Spark's module for working with structured data”
● Access a variety of data sources : Hive, JSON, Avro, Parquet, ORC, JSON, JDBC.
● Plug Tableau, Chart.io, Power BI, Excel… thanks to JDBC or ODBC driver
● ~ ANSI SQL:2003
● Dataframe / Dataset
○ Since Spark 2.0, the primary Machine Learning API
○ Also used in Structured Streaming (still ALPHA in Spark 2.1)
Spark SQL
8
Spark SQL - RDD and Dataset (and Dataframe)
RDD = strong typing, lambda functions, DAG
Dataset = RDD (= built on top of RDDs) + Optimized execution engine + in-memory
columnar storage + convenient get data by column name : ds.map(_.myColumn)
Dataframe = Dataset[Row]
From “High performance Spark” by Holden Karau
Databricks’ blog 9
Spark SQL - RDD and Dataset
Plain SQL Query or Dataset API
spark.sql("""
SELECT title, artist
FROM datasetTable
"""
)
dataset.select($"title",$"artist")
10
Spark SQL - Catalyst Queries Optimizer
● General tree transformation framework : Scala’s abstract syntax tree (AST)
● Let the optimizer do the hard work : optimizations happen as late as possible
● Read less data as possible : partition, columnar format, statistic metadata (min, max,
dictionary), pushing predicate into storage system (Postgres specific query)
Protip: spark.sql(SQL_QUERY).explain(extended = true) or Spark UI SQL page
11
Spark SQL - Catalyst Queries Optimizer
● No languages jealous : All different Spark’s Dataset APIs have all the same
performance
12
● Columnar storage
● Optimized I/O
○ Column pruning
○ Predicate pushdown (Stats filter : size, max, min, dictionary)
● Popular and interoperable, supported by many other data processing systems
● Supports schema evolution, nullable=true
● Simple use with Spark
○ df.write.format("parquet").save("nrjnovavirginskyrock.parquet")
○ spark.read.parquet("nrjnovavirginskyrock.parquet")
○ df.write.partitionBy("radio").parquet("radioPartitionedByRadio.parquet")
Storage :
13
Protips:
● For your test jobs:
○ df.write.mode(SaveMode.Overwrite).save("test.parquet")
○ Otherwise they can fail because file already exists
● Learn from the best
○ Parquet’s Julien le Dem How to use Parquet
○ Netflix’s Ryan Bleu : Parquet performance tuning: the missing guide
14
What’s awesome about it?
● Collaboration via notebooks
● Free community edition with a 6Go RAM server, ready to go : https://community.cloud.databricks.com/
● Awesome and simple data viz
And also:
● Mixing Languages in a Notebook, including Markdown see demo later
● Cost management (AWS Spot instances)
● Rest API, Jobs, Security...
What about a open source solution?
● notebooks : Apache Zeppelin
● Managed Spark clusters on AWS or GCP
15
16
Getting the radio stations data - Scala scraper
From “what was this title?” HTML pages or REST API:
● http://www.nrj.fr/chansons-diffusees?__postedForm=broadcastedhitdate&date=1970/01/01 00:00
● http://www.novaplanet.com/radionova/cetaitquoicetitre/$timestamp
● https://www.virginradio.fr/cetait-quoi-ce-titre?date=1970-01-01&hour=00&minute=00
● http://skyrock.fm/api/v3/sound?search_date=1970-01-01&search_hour=00:00
Good real life experience of extracting data :
● Slow or fast servers
● Different semantic: Artist 1 & or AND or / Artist
● Different format : HTML page / JSON 17
Data from the radio stations
case class Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int,
hour:Int, minute: Int, artist:String, allArtists: String, title:String, radio:String)
val dataset = spark.read.myformat("myfile").as[Song]
dataset.show() or display(dataset) on Databricks:
18
Data from the radio stations
dataset.show()
dataset.show(numberOfRows, truncate = false)
19
https://developer.spotify.com/web-api/console/
● Audio features of a track : danceability, positiveness, energy
● Artist : music genre
● Search a track
Positiveness/Valence: September — Earth Wind & Fire, Ska-Boo-Da-Ba — The Skatalites or Hey Ya! — OutKast
Danceability: Trick Me — Kelis, Around the world — Daft Punk or Anaconda — Nicki Minaj
Energy : We Are Your Friends - JUSTICE, Steppin’ stone - Davy Jones, Jerk It Out — Caesars
20
Number of songs * (Artist + track + audiofeatures) = 24K requests
→ Avoid surprises : Always think how large your data is before performing an action
● Destination server’s disk big enough? Powerful enough?
● 3rd party rate limit ? Will others applications would need this service too ?
● Network Cost ? 21
Data from
dataframe.show() / display(dataframe) on Databricks
Why dataframe and not data? → dataframe.printSchema
22
root
|-- tracks: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- album: struct (nullable = true)
| | | |-- album_type: string (nullable = true)
| | | |-- artists: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- external_urls: struct (nullable = true)
| | | | | | |-- spotify: string (nullable = true)
| | | | | |-- href: string (nullable = true)
| | | | | |-- id: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- uri: string (nullable = true)
| | | |-- available_markets: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- external_urls: struct (nullable = true)
| | | | |-- spotify: string (nullable = true)
| | | |-- href: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- images: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- height: long (nullable = true)
| | | | | |-- url: string (nullable = true)
| | | | | |-- width: long (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uri: string (nullable = true)
dataframe.printSchema
23
Not really big data… and I am ok with that!
+300K rows of broadcasts of 8K different songs
● Nova : 95K broadcasts of 5000 different songs
● NRJ : 50K broadcasts of 800 different songs
● Virgin: 60K broacasts of 1200 different songs
● Skyrock: 100K broadcasts of 1000 different songs
Protips: dataset.sample(withReplacement, percentage)
24
How many songs by day ?
SELECT COUNT(*) as number_songs_broadcasted, DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd') AS
date, radio
FROM nrjnova
GROUP BY DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd'), radio
ORDER BY date
Dataframe API
nrjnova.select(date_format($"timestamp".cast("timestamp"),"Y-MM-dd").alias("date"), $"radio")
.orderBy($"timestamp".asc)
.groupBy($"radio", $"ts")
.count()
25
How many songs by day ?
26
How many different songs by month?
27
Radio brainwashing ?
Same song by day
28
Music genres by radio
Genre info by artist only → ["alternative dance","chamber pop","dance-punk","electronic","garage
rock","indie pop","indie r&b","indie rock","indietronica","new rave","synthpop"]
import org.apache.spark.sql.functions.explode
val genres = TrackArtistAudioFeature.select($"name", explode($"genres"),
$"tracks.name",$"radio").toDF("artist", "genres","title","radio")
genres.createOrReplaceTempView("genres")
genres.cache()
29
Music genres by radio
SELECT COUNT(DISTINCT genres) AS number_of_genres, radio
FROM genres
GROUP BY radio
ORDER BY number_of_genres DESC
30
Music genres by radio
31
32
Is Skyrock really “first on rap” ?
SELECT COUNT(genres) AS number_of_hip_hop_songs, genres, radio
FROM genres
WHERE genres LIKE '%rap%' OR genres LIKE '%hip%' OR genres LIKE '%hop%'
GROUP BY genres, radio
HAVING COUNT(genres) > 50
ORDER BY number_of_hip_hop_songs DESC
33
Is Skyrock really “first on rap” ?
34
Songs duration distribution
SELECT ROUND( (COUNT(t.*) / subTotal.total_radio * 100),2) AS percentage_of_songs, subTotal.total_radio,
FLOOR((duration_ms / 1000 ) / 60) AS minute, ROUND( (((duration_ms / 1000 ) % 60)) / 10) * 10 AS second,
t.radio
FROM AudioFeatureArtistTrackRadios t
JOIN (
SELECT count(*) AS total_radio, radio
FROM AudioFeatureArtistTrackRadios
GROUP BY radio
) AS subTotal
ON subTotal.radio = t.radio
GROUP BY 1, 2, 3, 4
ORDER BY minute, second
35
Songs duration distribution
36
Percentage of music by day
SELECT AVG(number_songs_broadcasted) * 3.3 / (24 * 60) * 100 AS percent_of_music,
radio
FROM (
SELECT COUNT(*) AS number_songs_broadcasted, DATE_FORMAT(CAST(timestamp AS
timestamp),'Y-MM-dd') AS date, radio
FROM nrjnova
GROUP BY DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd'), radio
HAVING COUNT(*) > 0 -- avoid radio stations’ system bug
ORDER BY date
)
GROUP BY radio
37
average song duration in
minutes
total minutes by day
Spark SQL - Percentage of music by day
38
What’s an average monday ?
SELECT ROUND(AVG(number_of_tracks)) AS number_of_tracks, radio, hour
FROM (
SELECT COUNT(*) AS number_of_tracks, weekofyear( CAST(timestamp as timestamp)) AS
week_number, CAST(DATE_FORMAT(CAST(timestamp as timestamp),'k') AS int) AS hour, radio
FROM nrjnova
WHERE DATE_FORMAT(CAST(timestamp as timestamp),'EEEE') = "Monday"
GROUP BY weekofyear( CAST(timestamp as timestamp)), DATE_FORMAT(CAST(timestamp as
timestamp),'k'), radio
HAVING COUNT(*) > 0 -- avoid radio stations’ system bug
)
GROUP BY hour, radio
ORDER BY hour
39
What’s an average monday ?
40
How many minutes of advertising?
41
Windowing query example - Most broadcasted
songsSELECT COUNT(*), n.title, n.artist, n.radio, rank, month, year
FROM (
SELECT title, artist, radio,number_of_broadcast, dense_rank() OVER (PARTITION BY radio ORDER BY
number_of_broadcast DESC) AS rank
FROM (
SELECT COUNT(*) AS number_of_broadcast, title, artist, radio
FROM nrjnova
GROUP BY title, artist, radio
) tmp
) top10
JOIN nrjnova n
ON top10.title = n.title AND top10.artist = n.artist AND top10.radio = n.radio
WHERE rank <= 2
GROUP BY n.title, n.artist, n.radio, rank, month, year
ORDER BY month
42
Windowing query example - Most broadcasted
songs
43
Similarities between radio stations with unidirectional inequality
SELECT COUNT(DISTINCT n1.artist, n1.title) AS number_of_similar_songs, CONCAT(n1.radio, "-",
n2.radio) AS radios, n1.radio AS radio_1, ROUND(COUNT(DISTINCT n1.artist, n1.title) /
number_of_song_radio_1 * 100) AS percent_radio_1, number_of_song_radio_1, n2.radio as radio_2,
ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_2 * 100) as percent_radio_2,
number_of_song_radio_2
FROM nrjnova n1
JOIN nrjnova n2
ON n1.radio < n2.radio AND LOWER(n1.artist)=LOWER(n2.artist) AND LOWER(n1.title)=LOWER(n2.title)
GROUP BY n1.radio, n2.radio, number_of_song_radio_1, number_of_song_radio_2
ORDER BY number_of_similar_songs DESC
44
Similarities between radio stations with unidirectional inequality
JOIN radio n2 ON n1.radio = n2.radio →
● (nova, virgin)
● (virgin, nova)
JOIN radio n2 ON n1.radio < n2.radio
● (nova, virgin)
45
Similarities between radio stations with unidirectional inequality
46
Common songs between our 4 radios ?
4 joins ??? → Nope
47
Common songs between our 4 radios ?
SELECT LOWER(title) as Title, LOWER(artist) as Artist, COUNT(DISTINCT (radio))
FROM nrjnova
GROUP BY LOWER(title), LOWER(artist)
HAVING COUNT(DISTINCT (radio)) = ( -- 4, because we have 4 different radios
SELECT MAX (count)
FROM (
SELECT COUNT(DISTINCT (radio)) as count, LOWER(title), LOWER(artist)
FROM nrjnova
GROUP BY LOWER(title), LOWER(artist)
HAVING COUNT(DISTINCT (radio))
)
) 48
Common songs between radios ?
Prince — Kiss
C2C — Happy
Stromae — Formidable
49
Spark SQL - Case statement
SELECT CASE artist
WHEN "Drake"
THEN "New drake name"
ELSE artist END AS artist,
title, radio
FROM nrjnova
50
Resources
Demo’s Notebook available here
“Terra Data” exposition at Cité des sciences, Paris
EPFL Spark Intro from Heather Miller
Deep Dive into Spark SQL’s Catalyst Optimizer
Mastering Apache Spark 2 by Jacek Laskowski
Unsplash: copyrightless-HD-picture platform
51
Bonus - Spotify Playlists
~200 most broadcasted songs in 2016 for each radio :
● “Radio Nova Top 2016” with Calipso Rose, Kaytranada, The Roots, M.I.A...
● “Skyrock Top 2016” with Drake, Major Lazer, Timberlake, Soprano, PNL, Jul…
● “Virgin Top 2016” with Imany, Twenty One Pilots, Sia, Kungs, Julian Perretta…
● “NRJ top 2016” with Enrique Iglesias, Soprano, Coldplay, Kungs, Amir, MHD, Tal
52

More Related Content

What's hot (14)

PDF
Introduction to SparkR
Kien Dang
 
PPTX
Scalding: Reaching Efficient MapReduce
LivePerson
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
guest5b1607
 
PPTX
Scala 20140715
Roger Huang
 
PDF
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
PDF
How Apache Drives Music Recommendations At Spotify
Josh Baer
 
PDF
Om nom nom nom
Anna Pawlicka
 
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
PPTX
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
PDF
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
PDF
Spark cassandra integration, theory and practice
Duyhai Doan
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
PDF
Get started with Lua programming
Etiene Dalcol
 
Introduction to SparkR
Kien Dang
 
Scalding: Reaching Efficient MapReduce
LivePerson
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
guest5b1607
 
Scala 20140715
Roger Huang
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
How Apache Drives Music Recommendations At Spotify
Josh Baer
 
Om nom nom nom
Anna Pawlicka
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Spark cassandra integration, theory and practice
Duyhai Doan
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
Get started with Lua programming
Etiene Dalcol
 

Similar to Analyze one year of radio station songs aired with Spark SQL, Spotify, and Databricks (20)

PPTX
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Roger Huang
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PPTX
The openCypher Project - An Open Graph Query Language
Neo4j
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
PDF
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
Peter Haase
 
PPTX
Presentation
Dimitris Stripelis
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
KEY
Hadoop london
Yahoo Developer Network
 
PDF
Scaling PyData Up and Out
Travis Oliphant
 
PDF
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
PDF
Cypher and apache spark multiple graphs and more in open cypher
Neo4j
 
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Roger Huang
 
Intro to Spark and Spark SQL
jeykottalam
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
The openCypher Project - An Open Graph Query Language
Neo4j
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
Peter Haase
 
Presentation
Dimitris Stripelis
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Scaling PyData Up and Out
Travis Oliphant
 
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Cypher and apache spark multiple graphs and more in open cypher
Neo4j
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
Ad

Recently uploaded (20)

PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PDF
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
Precedence and Associativity in C prog. language
Mahendra Dheer
 
PPTX
Ground improvement techniques-DEWATERING
DivakarSai4
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
Introduction to Fluid and Thermal Engineering
Avesahemad Husainy
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Precedence and Associativity in C prog. language
Mahendra Dheer
 
Ground improvement techniques-DEWATERING
DivakarSai4
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Inventory management chapter in automation and robotics.
atisht0104
 
Introduction to Fluid and Thermal Engineering
Avesahemad Husainy
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
Information Retrieval and Extraction - Module 7
premSankar19
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
Ad

Analyze one year of radio station songs aired with Spark SQL, Spotify, and Databricks

  • 1. Analyze radio stations broadcasts with Apache Spark SQL, Spotify, and Databricks Spark User Group Paris - May 2017 Galenki, Russia
  • 2. 1. Spark SQL a. Dataset API b. Parquet c. Databricks 2. Data extraction 3. Data exploration
  • 3. Paul Leclercq @polomarcus Ad tech for 3 years at : Data Engineer ● Spark : Streaming, SQL, MLLib ● Scala ● Kafka ● NoSQL Looking for his dream job in Data in music/sport/cool stuffs industry:) 3
  • 4. 4
  • 5. Data people Engineer: store, index high volume of raw data, implement machine learning algo Hadoop, Amazon S3, Kafka, RabbitMQ, Spark, Flink, Beam, Drill, Druid, NoSQL DB : Cassandra, Redis, Aerospike Scientist: PhD, Mathematics degrees : build machine learning algorithms that can predict business actions Machine learning/Statistics tools: Scikit-learn, MLLib Business Analyst: use the data provided for business purposes Tools with UI: Excel, Chart.io, Talend, Superset, Pivot 5
  • 6. Why I love Spark “Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.” Scalable : ops and code Batch, Streaming, ML unified distributed engine to process data 6
  • 8. “Apache Spark's module for working with structured data” ● Access a variety of data sources : Hive, JSON, Avro, Parquet, ORC, JSON, JDBC. ● Plug Tableau, Chart.io, Power BI, Excel… thanks to JDBC or ODBC driver ● ~ ANSI SQL:2003 ● Dataframe / Dataset ○ Since Spark 2.0, the primary Machine Learning API ○ Also used in Structured Streaming (still ALPHA in Spark 2.1) Spark SQL 8
  • 9. Spark SQL - RDD and Dataset (and Dataframe) RDD = strong typing, lambda functions, DAG Dataset = RDD (= built on top of RDDs) + Optimized execution engine + in-memory columnar storage + convenient get data by column name : ds.map(_.myColumn) Dataframe = Dataset[Row] From “High performance Spark” by Holden Karau Databricks’ blog 9
  • 10. Spark SQL - RDD and Dataset Plain SQL Query or Dataset API spark.sql(""" SELECT title, artist FROM datasetTable """ ) dataset.select($"title",$"artist") 10
  • 11. Spark SQL - Catalyst Queries Optimizer ● General tree transformation framework : Scala’s abstract syntax tree (AST) ● Let the optimizer do the hard work : optimizations happen as late as possible ● Read less data as possible : partition, columnar format, statistic metadata (min, max, dictionary), pushing predicate into storage system (Postgres specific query) Protip: spark.sql(SQL_QUERY).explain(extended = true) or Spark UI SQL page 11
  • 12. Spark SQL - Catalyst Queries Optimizer ● No languages jealous : All different Spark’s Dataset APIs have all the same performance 12
  • 13. ● Columnar storage ● Optimized I/O ○ Column pruning ○ Predicate pushdown (Stats filter : size, max, min, dictionary) ● Popular and interoperable, supported by many other data processing systems ● Supports schema evolution, nullable=true ● Simple use with Spark ○ df.write.format("parquet").save("nrjnovavirginskyrock.parquet") ○ spark.read.parquet("nrjnovavirginskyrock.parquet") ○ df.write.partitionBy("radio").parquet("radioPartitionedByRadio.parquet") Storage : 13
  • 14. Protips: ● For your test jobs: ○ df.write.mode(SaveMode.Overwrite).save("test.parquet") ○ Otherwise they can fail because file already exists ● Learn from the best ○ Parquet’s Julien le Dem How to use Parquet ○ Netflix’s Ryan Bleu : Parquet performance tuning: the missing guide 14
  • 15. What’s awesome about it? ● Collaboration via notebooks ● Free community edition with a 6Go RAM server, ready to go : https://community.cloud.databricks.com/ ● Awesome and simple data viz And also: ● Mixing Languages in a Notebook, including Markdown see demo later ● Cost management (AWS Spot instances) ● Rest API, Jobs, Security... What about a open source solution? ● notebooks : Apache Zeppelin ● Managed Spark clusters on AWS or GCP 15
  • 16. 16
  • 17. Getting the radio stations data - Scala scraper From “what was this title?” HTML pages or REST API: ● http://www.nrj.fr/chansons-diffusees?__postedForm=broadcastedhitdate&date=1970/01/01 00:00 ● http://www.novaplanet.com/radionova/cetaitquoicetitre/$timestamp ● https://www.virginradio.fr/cetait-quoi-ce-titre?date=1970-01-01&hour=00&minute=00 ● http://skyrock.fm/api/v3/sound?search_date=1970-01-01&search_hour=00:00 Good real life experience of extracting data : ● Slow or fast servers ● Different semantic: Artist 1 & or AND or / Artist ● Different format : HTML page / JSON 17
  • 18. Data from the radio stations case class Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int, hour:Int, minute: Int, artist:String, allArtists: String, title:String, radio:String) val dataset = spark.read.myformat("myfile").as[Song] dataset.show() or display(dataset) on Databricks: 18
  • 19. Data from the radio stations dataset.show() dataset.show(numberOfRows, truncate = false) 19
  • 20. https://developer.spotify.com/web-api/console/ ● Audio features of a track : danceability, positiveness, energy ● Artist : music genre ● Search a track Positiveness/Valence: September — Earth Wind & Fire, Ska-Boo-Da-Ba — The Skatalites or Hey Ya! — OutKast Danceability: Trick Me — Kelis, Around the world — Daft Punk or Anaconda — Nicki Minaj Energy : We Are Your Friends - JUSTICE, Steppin’ stone - Davy Jones, Jerk It Out — Caesars 20
  • 21. Number of songs * (Artist + track + audiofeatures) = 24K requests → Avoid surprises : Always think how large your data is before performing an action ● Destination server’s disk big enough? Powerful enough? ● 3rd party rate limit ? Will others applications would need this service too ? ● Network Cost ? 21
  • 22. Data from dataframe.show() / display(dataframe) on Databricks Why dataframe and not data? → dataframe.printSchema 22
  • 23. root |-- tracks: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- album: struct (nullable = true) | | | |-- album_type: string (nullable = true) | | | |-- artists: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- external_urls: struct (nullable = true) | | | | | | |-- spotify: string (nullable = true) | | | | | |-- href: string (nullable = true) | | | | | |-- id: string (nullable = true) | | | | | |-- name: string (nullable = true) | | | | | |-- type: string (nullable = true) | | | | | |-- uri: string (nullable = true) | | | |-- available_markets: array (nullable = true) | | | | |-- element: string (containsNull = true) | | | |-- external_urls: struct (nullable = true) | | | | |-- spotify: string (nullable = true) | | | |-- href: string (nullable = true) | | | |-- id: string (nullable = true) | | | |-- images: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- height: long (nullable = true) | | | | | |-- url: string (nullable = true) | | | | | |-- width: long (nullable = true) | | | |-- name: string (nullable = true) | | | |-- type: string (nullable = true) | | | |-- uri: string (nullable = true) dataframe.printSchema 23
  • 24. Not really big data… and I am ok with that! +300K rows of broadcasts of 8K different songs ● Nova : 95K broadcasts of 5000 different songs ● NRJ : 50K broadcasts of 800 different songs ● Virgin: 60K broacasts of 1200 different songs ● Skyrock: 100K broadcasts of 1000 different songs Protips: dataset.sample(withReplacement, percentage) 24
  • 25. How many songs by day ? SELECT COUNT(*) as number_songs_broadcasted, DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd') AS date, radio FROM nrjnova GROUP BY DATE_FORMAT(CAST(timestamp as timestamp),'Y-MM-dd'), radio ORDER BY date Dataframe API nrjnova.select(date_format($"timestamp".cast("timestamp"),"Y-MM-dd").alias("date"), $"radio") .orderBy($"timestamp".asc) .groupBy($"radio", $"ts") .count() 25
  • 26. How many songs by day ? 26
  • 27. How many different songs by month? 27
  • 28. Radio brainwashing ? Same song by day 28
  • 29. Music genres by radio Genre info by artist only → ["alternative dance","chamber pop","dance-punk","electronic","garage rock","indie pop","indie r&b","indie rock","indietronica","new rave","synthpop"] import org.apache.spark.sql.functions.explode val genres = TrackArtistAudioFeature.select($"name", explode($"genres"), $"tracks.name",$"radio").toDF("artist", "genres","title","radio") genres.createOrReplaceTempView("genres") genres.cache() 29
  • 30. Music genres by radio SELECT COUNT(DISTINCT genres) AS number_of_genres, radio FROM genres GROUP BY radio ORDER BY number_of_genres DESC 30
  • 31. Music genres by radio 31
  • 32. 32
  • 33. Is Skyrock really “first on rap” ? SELECT COUNT(genres) AS number_of_hip_hop_songs, genres, radio FROM genres WHERE genres LIKE '%rap%' OR genres LIKE '%hip%' OR genres LIKE '%hop%' GROUP BY genres, radio HAVING COUNT(genres) > 50 ORDER BY number_of_hip_hop_songs DESC 33
  • 34. Is Skyrock really “first on rap” ? 34
  • 35. Songs duration distribution SELECT ROUND( (COUNT(t.*) / subTotal.total_radio * 100),2) AS percentage_of_songs, subTotal.total_radio, FLOOR((duration_ms / 1000 ) / 60) AS minute, ROUND( (((duration_ms / 1000 ) % 60)) / 10) * 10 AS second, t.radio FROM AudioFeatureArtistTrackRadios t JOIN ( SELECT count(*) AS total_radio, radio FROM AudioFeatureArtistTrackRadios GROUP BY radio ) AS subTotal ON subTotal.radio = t.radio GROUP BY 1, 2, 3, 4 ORDER BY minute, second 35
  • 37. Percentage of music by day SELECT AVG(number_songs_broadcasted) * 3.3 / (24 * 60) * 100 AS percent_of_music, radio FROM ( SELECT COUNT(*) AS number_songs_broadcasted, DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd') AS date, radio FROM nrjnova GROUP BY DATE_FORMAT(CAST(timestamp AS timestamp),'Y-MM-dd'), radio HAVING COUNT(*) > 0 -- avoid radio stations’ system bug ORDER BY date ) GROUP BY radio 37 average song duration in minutes total minutes by day
  • 38. Spark SQL - Percentage of music by day 38
  • 39. What’s an average monday ? SELECT ROUND(AVG(number_of_tracks)) AS number_of_tracks, radio, hour FROM ( SELECT COUNT(*) AS number_of_tracks, weekofyear( CAST(timestamp as timestamp)) AS week_number, CAST(DATE_FORMAT(CAST(timestamp as timestamp),'k') AS int) AS hour, radio FROM nrjnova WHERE DATE_FORMAT(CAST(timestamp as timestamp),'EEEE') = "Monday" GROUP BY weekofyear( CAST(timestamp as timestamp)), DATE_FORMAT(CAST(timestamp as timestamp),'k'), radio HAVING COUNT(*) > 0 -- avoid radio stations’ system bug ) GROUP BY hour, radio ORDER BY hour 39
  • 40. What’s an average monday ? 40
  • 41. How many minutes of advertising? 41
  • 42. Windowing query example - Most broadcasted songsSELECT COUNT(*), n.title, n.artist, n.radio, rank, month, year FROM ( SELECT title, artist, radio,number_of_broadcast, dense_rank() OVER (PARTITION BY radio ORDER BY number_of_broadcast DESC) AS rank FROM ( SELECT COUNT(*) AS number_of_broadcast, title, artist, radio FROM nrjnova GROUP BY title, artist, radio ) tmp ) top10 JOIN nrjnova n ON top10.title = n.title AND top10.artist = n.artist AND top10.radio = n.radio WHERE rank <= 2 GROUP BY n.title, n.artist, n.radio, rank, month, year ORDER BY month 42
  • 43. Windowing query example - Most broadcasted songs 43
  • 44. Similarities between radio stations with unidirectional inequality SELECT COUNT(DISTINCT n1.artist, n1.title) AS number_of_similar_songs, CONCAT(n1.radio, "-", n2.radio) AS radios, n1.radio AS radio_1, ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_1 * 100) AS percent_radio_1, number_of_song_radio_1, n2.radio as radio_2, ROUND(COUNT(DISTINCT n1.artist, n1.title) / number_of_song_radio_2 * 100) as percent_radio_2, number_of_song_radio_2 FROM nrjnova n1 JOIN nrjnova n2 ON n1.radio < n2.radio AND LOWER(n1.artist)=LOWER(n2.artist) AND LOWER(n1.title)=LOWER(n2.title) GROUP BY n1.radio, n2.radio, number_of_song_radio_1, number_of_song_radio_2 ORDER BY number_of_similar_songs DESC 44
  • 45. Similarities between radio stations with unidirectional inequality JOIN radio n2 ON n1.radio = n2.radio → ● (nova, virgin) ● (virgin, nova) JOIN radio n2 ON n1.radio < n2.radio ● (nova, virgin) 45
  • 46. Similarities between radio stations with unidirectional inequality 46
  • 47. Common songs between our 4 radios ? 4 joins ??? → Nope 47
  • 48. Common songs between our 4 radios ? SELECT LOWER(title) as Title, LOWER(artist) as Artist, COUNT(DISTINCT (radio)) FROM nrjnova GROUP BY LOWER(title), LOWER(artist) HAVING COUNT(DISTINCT (radio)) = ( -- 4, because we have 4 different radios SELECT MAX (count) FROM ( SELECT COUNT(DISTINCT (radio)) as count, LOWER(title), LOWER(artist) FROM nrjnova GROUP BY LOWER(title), LOWER(artist) HAVING COUNT(DISTINCT (radio)) ) ) 48
  • 49. Common songs between radios ? Prince — Kiss C2C — Happy Stromae — Formidable 49
  • 50. Spark SQL - Case statement SELECT CASE artist WHEN "Drake" THEN "New drake name" ELSE artist END AS artist, title, radio FROM nrjnova 50
  • 51. Resources Demo’s Notebook available here “Terra Data” exposition at Cité des sciences, Paris EPFL Spark Intro from Heather Miller Deep Dive into Spark SQL’s Catalyst Optimizer Mastering Apache Spark 2 by Jacek Laskowski Unsplash: copyrightless-HD-picture platform 51
  • 52. Bonus - Spotify Playlists ~200 most broadcasted songs in 2016 for each radio : ● “Radio Nova Top 2016” with Calipso Rose, Kaytranada, The Roots, M.I.A... ● “Skyrock Top 2016” with Drake, Major Lazer, Timberlake, Soprano, PNL, Jul… ● “Virgin Top 2016” with Imany, Twenty One Pilots, Sia, Kungs, Julian Perretta… ● “NRJ top 2016” with Enrique Iglesias, Soprano, Coldplay, Kungs, Amir, MHD, Tal 52