SlideShare a Scribd company logo
Vegas
The Missing Matplotlib for
Scala/Spark
DB Tsai
Roger Menezes
Homepage Kids Page Downloads Page
Netflix Recommendations
Every aspect
of the
Experience is
Machine
Learned
3
2017
> 100M members
> 190 countries
Multiple Devices
Genres: 23 rows/page average
Sims: 10 rows/page average
My List:
Continue Watching:
Popular on Netflix:
Trending Now:
Watch It Again:
Top Picks:
Because You Watched:
Genres:
New Releases:
Recently Added:
Originals RowBillboard:
Machine Learning at Netflix
● Optimize the Experimentation usecase vs Productionization
● Experimentation
○ Opportunity sizing, Data Exploration
○ Feature Identification and Selection
○ Tweaks to ML algos
○ Model Evaluation
Experimenter’s loop
Problem
Explore
Data
Identify
Features
Produce
Model
Evaluate
Model
Share
Findings
Notebooks
● Optimal for Experimentation
● Sharing reproducible research
○ Facilitates feedback loop with Product Managers
● End to end ML experiment.
○ Interactivity drives productivity
Python Notebooks
Python Notebooks
● Seamless Experience - ML experimentation
● Well known Scientific computing libraries
● Huge catalog of Visualization plotting libraries
○ Matplotlib, Seaborn, Bokeh, BQPlot, Lightning, etc.
Scala Notebooks
● Zeppelin, Jupyter, Databricks, Spark-Notebooks, ...
● Computing library gap filling up
● Lack of Visualization Libraries
○ Main friction point in adoption
○ End to End ML use case not convincing
Introducing Vegas
● Visualization Library in Scala
● Mainly built for the notebook use case
● Scala wrapper around Vega-Lite
○ Missing MatPlotLib for the Scala/Spark world.
DECLARATIVE
STATISTICAL
VISUALIZATION
GRAMMAR
IN SCALA
You tell it WHAT should be done with the data, and it knows
HOW to do it!
Operations such as filtering, aggregation, faceting are built
into the visualization, rather than putting the burden on the
user to massage the data into shape.
Complex visualizations can be built with a few high level
abstractions:
DATA
TRANS-
FORMS
SCALES
GUIDES MARKS
cf : Altair Talk by Brian Granger in PyData 2016 https://youtu.be/v5mrwq7yJc4
Added Bonus of Declarative
Visualizations:
INTERACTIVITY!
D3JS
VEGAS
VEGAS CODE EXPANDS OUT TO D3JS CODE!
Anatomy of a plot: Channels
X/Y channel
Shape Channel
Size Channel
Color Channel
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger Menezes
Features…
1. Supports most plot types
2. Trellis plots
3. Layers
Layer 1.
Layer 2.
Layer 3.
4. Notebook and Consoles
5. Built-in spark support
Vegas
.withDataFrame(myDataFrame)
.encodeX(“population”)
.encodeY(“age”)
Mapped Columns
Pass In DF.
6. Visual statistics
● Advanced Binning
● Sorting
● Scaling
● Custom Transforms
● Time Series
● Aggregation
● Filtering
● Math functions (log, etc)
● Descriptive Statistics
How It Works !
1. Specify in Scala
2. Embed HTML
(iFrame)
3. Render within
iFrame using JS
VEGA
D3JS
VEGA-LITE*
VEGAS
MOREABSTRACTION SCALA DSL EMITS TYPE-CHECKED
VEGA-LITE JSON
VEGA-LITE CONVERTS INTERNALLY
TO VEGA JSON SPEC
VEGA TRANSLATES JSON TO D3JS
CODE THAT CAN BE VERY VERBOSE
A SCALA DSL FOR VEGA-LITE
* Vega-Lite
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger Menezes
What’s coming
1. Interactive selections
2. Selections transforms
Contributors
Aish DB Roger
Sudeep Jeremy
Thank you.
@NetflixResearch
@rogermenezes @dbtsai
The missing MatPlotLib
for Scala/Spark
http://vegas-viz.org

More Related Content

What's hot (20)

PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PPTX
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
PDF
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
PDF
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 

Similar to VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger Menezes (20)

PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PDF
Social Networks Analysis
Joud Khattab
 
PPTX
Reveal's Advanced Analytics: Using R & Python
Poojitha B
 
PDF
Data Science in the Cloud @StitchFix
C4Media
 
PDF
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
PDF
Building search and discovery services for Schibsted (LSRS '17)
Sandra Garcia
 
PPTX
Sharing a Startup’s Big Data Lessons
George Stathis
 
PDF
Tools for Visualizing Geospatial Data in a Web Browser
Safe Software
 
PDF
Scratchpads past,present,future
Edward Baker
 
PDF
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
PPTX
Architecting an Open Source AI Platform 2018 edition
David Talby
 
PDF
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
PPTX
.NET per la Data Science e oltre
Marco Parenzan
 
PDF
Introduction to H2O and Model Stacking Use Cases
Jo-fai Chow
 
PDF
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j
 
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
PDF
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
PPTX
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
 
PDF
Avoiding big data antipatterns
grepalex
 
PDF
Practicing at the Cutting Edge
C4Media
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Social Networks Analysis
Joud Khattab
 
Reveal's Advanced Analytics: Using R & Python
Poojitha B
 
Data Science in the Cloud @StitchFix
C4Media
 
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Building search and discovery services for Schibsted (LSRS '17)
Sandra Garcia
 
Sharing a Startup’s Big Data Lessons
George Stathis
 
Tools for Visualizing Geospatial Data in a Web Browser
Safe Software
 
Scratchpads past,present,future
Edward Baker
 
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
.NET per la Data Science e oltre
Marco Parenzan
 
Introduction to H2O and Model Stacking Use Cases
Jo-fai Chow
 
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j
 
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
 
Avoiding big data antipatterns
grepalex
 
Practicing at the Cutting Edge
C4Media
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Ad

Recently uploaded (20)

PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger Menezes