SlideShare a Scribd company logo
BigPetStore-Flink
A Comprehensive Blueprint for Apache
Flink.
Suneel Marthi
Flink Forward 2015, Berlin
About Me
• Senior Principal Engineer, Office of Technology, Red Hat
• Committer and PMC member on Apache Mahout
• Contributor to DeepLearning4J and Oryx 2.0
• Co-Organizer of Washington DC Apache Flink Meetup
• Founder of Boston Apache Flink Meetup
Outline Of Talk
• What is BigPetStore?
• Why BigPetStore?
• Synthetic Data
• BigPetStore - MapReduce, Spark
• BigPetStore - Flink
• Future possibilities
What is BigPetStore?
• Blueprints for Big Data
applications
• Consists of:
– Data Generators
– Examples using tools in
Big Data ecosystem to
process data
– Build system and tests for
integrating tools and
multiple JVM languages
• Part of Apache Bigtop
• Used for:
– Templates for infrastructure
(build, integration, testing)
– Educational examples
– Testing
– Demos
– Benchmarking
Why BigPetStore?(1)
As a developer, I want an application blueprint that…
• scales to a size approximating my data-domain
• includes idiomatic unit and integration testing
• demonstrates ETL as well analytics
In other words…
Word count was great for MapReduce, but we need
something more to demonstrate the advanced capabilities
of newer processing engines
Why BigPetStore?(2)
PetStores have been around for a while to showcase
different technologies starting with Sun’s Web Petstore in
the early days of J2EE
Everyone knows what a PetStore is, hence it’s intuitive to
non-developers
What about a Big Data PetStore?
Vision
• Bigtop Data Generators - a resource for all Apache
projects!
• To build more sophisticated blueprints for users and
developer
• Useful for smoke testing infrastructure and applications!
Case for Synthetic Data
• Most company Data is private and confidential
• Licensing concerns with sharing the data
• Secure data cannot be moved out of production
• Enable more realistic example applications
• Enable more comprehensive testing than regular
wordcount or TeraSort
Bigtop Data Generators
• BigPetStore Data Generator
• Bigtop Weatherman
• Bigtop Bazaar
• Locations Library
• Sampler Library
• Name Generator
• Product Generator
BigPetStore-Mapreduce (BIGTOP-1270)
• Originally, a MapReduce
application for demonstrating
Mapreduce, Pig, Mahout.
• Primitive “hierarchical” data
generator for generating fake
petstore transaction (at any scale).
• Part of ASF Bigtop and at Red
Hat, and other companies, for
testing the Hadoop ecosystem.
New Data Generator for BigPetStore
• Motivation: realistic ML/analytics examples
• Goal: More complex patterns embedded in data
• Mathematical modeling and simulation
– Sampling from PDFs
– (Hidden) Markov Models
– Poisson processes
– Stochastic differential equations
Next Step: A Platform Independent Data
Generator.
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud
Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
BigPetStore Data Model
• Generative Model leveraging well-known mathematical
modeling techniques to simulate factors influencing
customers’ purchasing habits.
• Several cases real data is used to parameterize the model
BigPetStore Data Model
BigPetStore-TransactionQueue
• no need for API calls, just use docker
• Generate load for any app: Not just JVM apps.
• docker run -t -i smarthi/bigpetstore-transaction-queue
BigPetStore-Spark (BIGTOP-1535)
-RJ Nowling rewrote the BigPetStore
data generator components to generate
more complex data sets, with patterns
varying in many dimensions.
-BigPetStore-Spark was then added to
ASF BigTop, demonstrating that the
data generator could be used in a
distributed context.
BigPetStore-Flink (Bigtop-1927 & Bigtop-1928)
• A Flink application blueprint.
• Generates data at any scale.
• Uses Flink streams to write generated data to disk.
• Uses Flink DataStream transformations to transform data
sets for analytics.
BigPetStore Flink
Future Endeavors
• How to help users build their own models?
• How to use the Bigtop Data Generators for load testing?
• How to produce synthetic copies from real datasets?
• Better libraries and abstractions to reduce boilerplate
• Research: Investigating Probabilistic Programming
Languages which provide advanced sampling and
inference algorithms combined with high-level DSLs for
model specifications
Future: BigPetStore - Flink
A BigPetStore Blueprint for:
• Flink Batch
• Flink Table API
• Flink ML algorithms
Resources
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet
Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
https://github.com/apache/bigtop/tree/master/bigtop-data-generators
https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore
BigTop Data Generators available as a library:
http://dl.bintray.com/rnowling/bigpetstore
TL;DR
• BigTop Data Generators - a resource for all Apache BigData projects
• Comprehensive Blueprints
• Smoke and integration testing
• Load testing
• Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 &
BIGTOP-1928)
• Future Endeavors
• Expand BigPetStore Flink as new Flink features become available
• Make models easier to build
• Easier ways to generate synthetic data from models built on real data

More Related Content

What's hot (20)

PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
PDF
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PPTX
SICS: Apache Flink Streaming
Turi, Inc.
 
PPTX
Apache flink
Ahmed Nader
 
PDF
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Flink Streaming
Gyula Fóra
 
PPTX
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PPTX
Flink Case Study: Capital One
Flink Forward
 
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Introduction to Apache Flink
datamantra
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
SICS: Apache Flink Streaming
Turi, Inc.
 
Apache flink
Ahmed Nader
 
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Flink Streaming
Gyula Fóra
 
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Flink Case Study: Capital One
Flink Forward
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Stateful Distributed Stream Processing
Gyula Fóra
 

Viewers also liked (20)

PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
PPTX
Michael Häusler – Everyday flink
Flink Forward
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
PPTX
Apache Flink Training: DataSet API Basics
Flink Forward
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PPTX
Aljoscha Krettek – Notions of Time
Flink Forward
 
Apache Flink Training: System Overview
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Michael Häusler – Everyday flink
Flink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Apache Flink Training: DataSet API Basics
Flink Forward
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Aljoscha Krettek – Notions of Time
Flink Forward
 
Ad

Similar to Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink (20)

PPTX
Implementing BigPetStore with Apache Flink
Márton Balassi
 
PPTX
The Flink - Apache Bigtop integration
Márton Balassi
 
PPTX
Feature Store as a Data Foundation for Machine Learning
Provectus
 
PPTX
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Baiji He
 
PPTX
Meet the squirrel @ #CSHUG
Márton Balassi
 
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPTX
Telecom datascience master_public
Vincent Michel
 
PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
PPTX
Meet the Infochimps Platform
Infochimps, a CSC Big Data Business
 
PDF
A Tool For Big Data Analysis using Apache Spark
datamantra
 
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
PPTX
Big Data Ingestion Using Hadoop - Capstone Presentation
Samkannan
 
PPTX
Capstone presentation
Vikal Gupta
 
PPTX
How and why you need to build a big data lab
Chris Kernaghan
 
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
PDF
Business of Big Data
Leonid Zhukov
 
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Implementing BigPetStore with Apache Flink
Márton Balassi
 
The Flink - Apache Bigtop integration
Márton Balassi
 
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Baiji He
 
Meet the squirrel @ #CSHUG
Márton Balassi
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Telecom datascience master_public
Vincent Michel
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
Meet the Infochimps Platform
Infochimps, a CSC Big Data Business
 
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Started with-apache-spark
Happiest Minds Technologies
 
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
Big Data Ingestion Using Hadoop - Capstone Presentation
Samkannan
 
Capstone presentation
Vikal Gupta
 
How and why you need to build a big data lab
Chris Kernaghan
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
Business of Big Data
Leonid Zhukov
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Biography of Daniel Podor.pdf
Daniel Podor
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

  • 1. BigPetStore-Flink A Comprehensive Blueprint for Apache Flink. Suneel Marthi Flink Forward 2015, Berlin
  • 2. About Me • Senior Principal Engineer, Office of Technology, Red Hat • Committer and PMC member on Apache Mahout • Contributor to DeepLearning4J and Oryx 2.0 • Co-Organizer of Washington DC Apache Flink Meetup • Founder of Boston Apache Flink Meetup
  • 3. Outline Of Talk • What is BigPetStore? • Why BigPetStore? • Synthetic Data • BigPetStore - MapReduce, Spark • BigPetStore - Flink • Future possibilities
  • 4. What is BigPetStore? • Blueprints for Big Data applications • Consists of: – Data Generators – Examples using tools in Big Data ecosystem to process data – Build system and tests for integrating tools and multiple JVM languages • Part of Apache Bigtop • Used for: – Templates for infrastructure (build, integration, testing) – Educational examples – Testing – Demos – Benchmarking
  • 5. Why BigPetStore?(1) As a developer, I want an application blueprint that… • scales to a size approximating my data-domain • includes idiomatic unit and integration testing • demonstrates ETL as well analytics In other words… Word count was great for MapReduce, but we need something more to demonstrate the advanced capabilities of newer processing engines
  • 6. Why BigPetStore?(2) PetStores have been around for a while to showcase different technologies starting with Sun’s Web Petstore in the early days of J2EE Everyone knows what a PetStore is, hence it’s intuitive to non-developers
  • 7. What about a Big Data PetStore?
  • 8. Vision • Bigtop Data Generators - a resource for all Apache projects! • To build more sophisticated blueprints for users and developer • Useful for smoke testing infrastructure and applications!
  • 9. Case for Synthetic Data • Most company Data is private and confidential • Licensing concerns with sharing the data • Secure data cannot be moved out of production • Enable more realistic example applications • Enable more comprehensive testing than regular wordcount or TeraSort
  • 10. Bigtop Data Generators • BigPetStore Data Generator • Bigtop Weatherman • Bigtop Bazaar • Locations Library • Sampler Library • Name Generator • Product Generator
  • 11. BigPetStore-Mapreduce (BIGTOP-1270) • Originally, a MapReduce application for demonstrating Mapreduce, Pig, Mahout. • Primitive “hierarchical” data generator for generating fake petstore transaction (at any scale). • Part of ASF Bigtop and at Red Hat, and other companies, for testing the Hadoop ecosystem.
  • 12. New Data Generator for BigPetStore • Motivation: realistic ML/analytics examples • Goal: More complex patterns embedded in data • Mathematical modeling and simulation – Sampling from PDFs – (Hidden) Markov Models – Poisson processes – Stochastic differential equations
  • 13. Next Step: A Platform Independent Data Generator. Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  • 14. BigPetStore Data Model • Generative Model leveraging well-known mathematical modeling techniques to simulate factors influencing customers’ purchasing habits. • Several cases real data is used to parameterize the model
  • 16. BigPetStore-TransactionQueue • no need for API calls, just use docker • Generate load for any app: Not just JVM apps. • docker run -t -i smarthi/bigpetstore-transaction-queue
  • 17. BigPetStore-Spark (BIGTOP-1535) -RJ Nowling rewrote the BigPetStore data generator components to generate more complex data sets, with patterns varying in many dimensions. -BigPetStore-Spark was then added to ASF BigTop, demonstrating that the data generator could be used in a distributed context.
  • 18. BigPetStore-Flink (Bigtop-1927 & Bigtop-1928) • A Flink application blueprint. • Generates data at any scale. • Uses Flink streams to write generated data to disk. • Uses Flink DataStream transformations to transform data sets for analytics.
  • 20. Future Endeavors • How to help users build their own models? • How to use the Bigtop Data Generators for load testing? • How to produce synthetic copies from real datasets? • Better libraries and abstractions to reduce boilerplate • Research: Investigating Probabilistic Programming Languages which provide advanced sampling and inference algorithms combined with high-level DSLs for model specifications
  • 21. Future: BigPetStore - Flink A BigPetStore Blueprint for: • Flink Batch • Flink Table API • Flink ML algorithms
  • 22. Resources Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014 https://github.com/apache/bigtop/tree/master/bigtop-data-generators https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore BigTop Data Generators available as a library: http://dl.bintray.com/rnowling/bigpetstore
  • 23. TL;DR • BigTop Data Generators - a resource for all Apache BigData projects • Comprehensive Blueprints • Smoke and integration testing • Load testing • Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 & BIGTOP-1928) • Future Endeavors • Expand BigPetStore Flink as new Flink features become available • Make models easier to build • Easier ways to generate synthetic data from models built on real data