SlideShare a Scribd company logo
Use of standards and related
issues in predictive analytics
KDD 2016, SF 2016-08-16
Paco Nathan, @pacoid

Dir, Learning Group @ O’Reilly Media
PMML referenced by 86 publications in Safari, 2001-2016

https://www.safaribooksonline.com/search/?query=PMML
Pattern: PMML for Cascading and Hadoop

P Nathan, G Kathalagiri (2013-08-11)

https://goo.gl/jk7829
Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/projects/pattern
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far in real-world workflows…
Results shown in blue, hard problems highlighted in red
Generalized Workflow for ML Use Cases in Big Data
Portable Format for Analytics (PFA)
PFA updates the standards w.r.t. more contemporary issues of
system architectures used for predictive analytics: distributed
processing, in-memory computing, serialization, etc.
http://dmg.org/pfa/docs/motivation/
• much more support for distributed systems
• Avro data types
• forward-looking toward more streaming applications
• fits well with higher layers of abstraction, success of
DSLs, etc.
Tuning Spark Streaming for Throughput
Gerard Maas, Virdata (2014-12-22)
“One Size Fits All” Doesn’t Anymore

This common architectural pattern requires interchange…
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine-
and-then-uses-sensors-to-listen-to-it/
IoT alters “velocity” and “volume” dramatically

This growing category of use cases requires interchange…
Lessons from the success of Apache Spark…
interchange is necessary for the ecosystem
major use cases tend to build their own ML libraries – despite a case
where a majority of committers tend to support a common vision and
encourage use of a canonical library (MLLib with DataFrames)
when a successful business grows over time, challenges arise by
definition: managing separated teams, mergers and acquisitions,
increased audits, regulations, etc.
therefore, lack of interchange for analytics represents a serious
technical debt and potential liability
Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten
Lessons from the success of Apache Spark…
direct use of “compilers” becomes atypical as abstraction layers
become smarter for deferred optimization
What to suggest for existing standards?
microservices: how to compose models + parameters
from multiple/distinct services
support for API definitions in Swaggar http://swagger.io/
consider the benefits of Parquet, e.g., how pushdown
predicates enable better optimization of workflows
What to suggest for existing standards?
additional standards emerging for other aspects of
workflow definition:
Jupyter http://jupyter.org/



create and share documents that contain live code,
equations, visualizations and explanatory text — 

a network protocol suite, at heart, for distributed REPL
environments, often along with containerization
see usage in Oriole http://oreilly.com/oriole/index.html

Dat http://dat-data.com/
shares versioned data through a decentralized network
What to suggest for existing standards?
other lingering issues:
• data lineage / provenance
• metadata drift
• public dialog and law:

https://public.resource.org/about/
presenter:
Just Enough Math
O’Reilly (2014)
justenoughmath.com
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/

More Related Content

What's hot (20)

PDF
Graph Analytics in Spark
Paco Nathan
 
PDF
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
PDF
A New Year in Data Science: ML Unpaused
Paco Nathan
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PDF
Data Science with Spark
Krishna Sankar
 
PPTX
Gephi, Graphx, and Giraph
Doug Needham
 
PPT
Graph Analytics for big data
Sigmoid
 
PDF
Architecture in action 01
Krishna Sankar
 
PPT
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
PDF
QCon SĂŁo Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PDF
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
PPTX
GraphLab Conference 2014 Keynote - Carlos Guestrin
Turi, Inc.
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
Spark streaming
Noam Shaish
 
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
PDF
Microservices, Containers, and Machine Learning
Paco Nathan
 
PDF
Agile data science with scala
Andy Petrella
 
PPTX
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Turi, Inc.
 
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
Graph Analytics in Spark
Paco Nathan
 
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
A New Year in Data Science: ML Unpaused
Paco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Data Science with Spark
Krishna Sankar
 
Gephi, Graphx, and Giraph
Doug Needham
 
Graph Analytics for big data
Sigmoid
 
Architecture in action 01
Krishna Sankar
 
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
QCon SĂŁo Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
Turi, Inc.
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Spark streaming
Noam Shaish
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
Microservices, Containers, and Machine Learning
Paco Nathan
 
Agile data science with scala
Andy Petrella
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Turi, Inc.
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 

Viewers also liked (14)

PDF
Data Science Reinvents Learning?
Paco Nathan
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
SF Python Meetup: TextRank in Python
Paco Nathan
 
PDF
What's new with Apache Spark?
Paco Nathan
 
PPT
PMML - Predictive Model Markup Language
aguazzel
 
PDF
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
Paco Nathan
 
PPTX
Predictive analytics from a to z
alpinedatalabs
 
PDF
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
Dell World
 
PDF
On the representation and reuse of machine learning (ML) models
Villu Ruusmann
 
PDF
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
PDF
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
PDF
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
PPTX
Future of data science as a profession
Jose Quesada
 
PDF
Big data & data science challenges and opportunities
Jose Quesada
 
Data Science Reinvents Learning?
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
SF Python Meetup: TextRank in Python
Paco Nathan
 
What's new with Apache Spark?
Paco Nathan
 
PMML - Predictive Model Markup Language
aguazzel
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
Paco Nathan
 
Predictive analytics from a to z
alpinedatalabs
 
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
Dell World
 
On the representation and reuse of machine learning (ML) models
Villu Ruusmann
 
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
Future of data science as a profession
Jose Quesada
 
Big data & data science challenges and opportunities
Jose Quesada
 
Ad

Similar to Use of standards and related issues in predictive analytics (20)

PDF
DevOps for DataScience
Stepan Pushkarev
 
PPTX
03_aiops-1.pptx
FarazulHoda2
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
EPAM ML/AI Accelerator - ODAHU
Dmitrii Suslov
 
PPTX
Deploying Data Science Engines to Production
Mostafa Majidpour
 
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
PDF
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoDB Database
 
PDF
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Edunomica
 
PDF
Pattern -A scoring engine
Shivanna Madalabhavi
 
PDF
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
PPTX
Architecting an Open Source AI Platform 2018 edition
David Talby
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PwC
 
PDF
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PDF
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Sandesh Rao
 
PDF
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
PPTX
Sparkflows.io
sparkflows
 
PPTX
AzureML Welcome to the future of Predictive Analytics
Ruben Pertusa Lopez
 
DevOps for DataScience
Stepan Pushkarev
 
03_aiops-1.pptx
FarazulHoda2
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
EPAM ML/AI Accelerator - ODAHU
Dmitrii Suslov
 
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoDB Database
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Edunomica
 
Pattern -A scoring engine
Shivanna Madalabhavi
 
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Architecting an Open Source AI Platform 2018 edition
David Talby
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PwC
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Sandesh Rao
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Sparkflows.io
sparkflows
 
AzureML Welcome to the future of Predictive Analytics
Ruben Pertusa Lopez
 
Ad

More from Paco Nathan (9)

PDF
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
 
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
 
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
PDF
Humans in the loop: AI in open source and industry
Paco Nathan
 
PDF
Computable Content
Paco Nathan
 
PDF
Computable Content: Lessons Learned
Paco Nathan
 
PDF
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
Humans in the loop: AI in open source and industry
Paco Nathan
 
Computable Content
Paco Nathan
 
Computable Content: Lessons Learned
Paco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 

Use of standards and related issues in predictive analytics

  • 1. Use of standards and related issues in predictive analytics KDD 2016, SF 2016-08-16 Paco Nathan, @pacoid
 Dir, Learning Group @ O’Reilly Media
  • 2. PMML referenced by 86 publications in Safari, 2001-2016
 https://www.safaribooksonline.com/search/?query=PMML
  • 3. Pattern: PMML for Cascading and Hadoop
 P Nathan, G Kathalagiri (2013-08-11)
 https://goo.gl/jk7829
  • 5. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in real-world workflows… Results shown in blue, hard problems highlighted in red Generalized Workflow for ML Use Cases in Big Data
  • 6. Portable Format for Analytics (PFA) PFA updates the standards w.r.t. more contemporary issues of system architectures used for predictive analytics: distributed processing, in-memory computing, serialization, etc. http://dmg.org/pfa/docs/motivation/ • much more support for distributed systems • Avro data types • forward-looking toward more streaming applications • fits well with higher layers of abstraction, success of DSLs, etc.
  • 7. Tuning Spark Streaming for Throughput Gerard Maas, Virdata (2014-12-22) “One Size Fits All” Doesn’t Anymore
 This common architectural pattern requires interchange…
  • 8. bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine- and-then-uses-sensors-to-listen-to-it/ IoT alters “velocity” and “volume” dramatically
 This growing category of use cases requires interchange…
  • 9. Lessons from the success of Apache Spark… interchange is necessary for the ecosystem major use cases tend to build their own ML libraries – despite a case where a majority of committers tend to support a common vision and encourage use of a canonical library (MLLib with DataFrames) when a successful business grows over time, challenges arise by definition: managing separated teams, mergers and acquisitions, increased audits, regulations, etc. therefore, lack of interchange for analytics represents a serious technical debt and potential liability
  • 10. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Tungsten Lessons from the success of Apache Spark… direct use of “compilers” becomes atypical as abstraction layers become smarter for deferred optimization
  • 11. What to suggest for existing standards? microservices: how to compose models + parameters from multiple/distinct services support for API definitions in Swaggar http://swagger.io/ consider the benefits of Parquet, e.g., how pushdown predicates enable better optimization of workflows
  • 12. What to suggest for existing standards? additional standards emerging for other aspects of workflow definition: Jupyter http://jupyter.org/
 
 create and share documents that contain live code, equations, visualizations and explanatory text — 
 a network protocol suite, at heart, for distributed REPL environments, often along with containerization see usage in Oriole http://oreilly.com/oriole/index.html
 Dat http://dat-data.com/ shares versioned data through a decentralized network
  • 13. What to suggest for existing standards? other lingering issues: • data lineage / provenance • metadata drift • public dialog and law:
 https://public.resource.org/about/
  • 14. presenter: Just Enough Math O’Reilly (2014) justenoughmath.com monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/