SlideShare a Scribd company logo
Starschema
Experience and Innovation
• Who we are and what we are doing
• Big Data era
• BSP (Bulk synchronous parallel)
• Apache Giraph
• Storm
• Our use case
• Conclusion
Topics today
Starschema
Experience and Innovation
Continuous growth
25 FTE plus external resources,
over $1.5million EBIT
Open source projects
Share the knowledge with the public.
Open source project in ETL and data
warehousing fields.
Founded in 2006
Company was founded by private
owners with decade of BI and data
warehouse background
R&D
Cooperation with Obuda University,
NKE, EU co-founded technology
research and development
COMPANY Data
Facts about Starschema
Starschema
Experience and Innovation
Big Data eraThe rise of Hadoop
Starschema
Experience and Innovation
Google Year of WP Apache Year
GFS 2003 HDFS 2007
MapReduce 2004 Hadoop MR 2007
BigTable 2006 HBase 2007
Chubby Lock Service 2006 ZooKeeper 2007
Pregel 2009 Giraph 2011
Dremel 2010 Drill 2012 ?
Which is next? (Curator, Falcon, MRQL, etc.)
• Leslie Valiant - article in nov. 1990
• Supersteps
• Data stored in local memory
• Asynchronous data processing
• Barrier sync
• Optimal load balacing (more logical processes
than physcal processors, random allocation of
processes)
• Solution differences (procotols, buffer
management, routing strategies)
• No deadlock or any other race conditions
(since no circular dependency)
• Use cases
BSP (Bulk synchronous parallel)What is it? What is it good for?
Starschema
Experience and Innovation
Storm Apache Giraph
Starschema
Experience and Innovation
Apache Giraph
Starschema
Experience and Innovation
• A loose implementation of Pregel
• Avery Chink: We can't use it at Yahoo, that's too bad
• Developed at Yahoo
• Runs on existing MapReduce infrastructure
• Netty based comm. instead of Hadoop RPC
• In-memory
• Fault tolerant
• Internal state is saved at user-defined intervals
• Master/slave architecture
What is it?
Storm
Starschema
Experience and Innovation
• Storm is a free and open source distributed real time
computation system
• Developed at BackType, open-sourced by Twitter in 2011
• Guaranteed data processing
• Horizontal scalability
• Fault tolerance
• ZeroMQ for message passing
• Processing unbounded
sequence of tuples
• Groupings
What is it?
Storm
Starschema
Experience and Innovation
What is it for?
• Analyze, clean, normalize
• Real-time calculation
• Real-time ETL
• Failure detection from log files
• Machine data analysis
• IT early-warning systems, security and fraud detection
• Traffic information, DOS attack
• Stream processing - Continous computation -
Distributed RPC
Our use case
Starschema
Experience and Innovation
• Real-time calculation
• Error detection
• Horizontal scalability
• Fast implementation
• High-availability
• Error prediction
POC: Processing machine data from sensors
Requirements
Our use case part 2
Starschema
Experience and Innovation
• Choosen tool: Storm
• One spout for each sensor
• Dynamic add and remove of spouts
• Error detection based on statistical calculations
• ~ 200 lines
• HA capability of Storm
POC: Processing machine data from sensors
Solution:
Conclusion
• Extend existing infrastructure
• Answer to new questions
• Re-think old problems
• New solutions, new features
• Happy customers/users
• $$$
Starschema
Experience and Innovation
Starschema
Experience and Innovation
What else to use?
• Yahoo S4 (Apache Incubator project)
• Apache Hama (Top level Apache project)
• GoldenOrb
• Signal/Collect
QUESTIONS & ANSWERS
Q…A
Starschema
Experience and Innovation
borosg@starschema.net
www.starschema.net

More Related Content

PDF
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
PDF
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
PPTX
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
PPTX
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 
PDF
Bringing Deep Learning into production
Paolo Platter
 
PDF
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
PPTX
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
PPTX
Data Science at Scale by Sarah Guido
Spark Summit
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 
Bringing Deep Learning into production
Paolo Platter
 
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Data Science at Scale by Sarah Guido
Spark Summit
 

What's hot (20)

PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
PDF
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Dataconomy Media
 
PPTX
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
Dataconomy Media
 
PDF
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
PDF
Architecture at Scale
Elasticsearch
 
PDF
Automate your data flows with Apache NIFI
Adam Doyle
 
PPTX
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
PDF
How to teach your data scientist to leverage an analytics cluster with Presto...
Alluxio, Inc.
 
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
PPTX
The of Operational Analytics Data Store
Rommel Garcia
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
PDF
Microsoft cosmos
Karthik Murugesan
 
ODP
Impala turbocharge your big data access
Ophir Cohen
 
PDF
PyCon Singapore 2013 Keynote
Wes McKinney
 
PDF
Uber's data science workbench
Ran Wei
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Dataconomy Media
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
Dataconomy Media
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Architecture at Scale
Elasticsearch
 
Automate your data flows with Apache NIFI
Adam Doyle
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
How to teach your data scientist to leverage an analytics cluster with Presto...
Alluxio, Inc.
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
The of Operational Analytics Data Store
Rommel Garcia
 
DataFrames: The Extended Cut
Wes McKinney
 
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Microsoft cosmos
Karthik Murugesan
 
Impala turbocharge your big data access
Ophir Cohen
 
PyCon Singapore 2013 Keynote
Wes McKinney
 
Uber's data science workbench
Ran Wei
 
Ad

Similar to Budapest Big Data Meetup Real-time stream processing (20)

PPTX
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 
PDF
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Data Driven Innovation
 
PDF
An overview of modern scalable web development
Tung Nguyen
 
PDF
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PDF
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
PDF
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
PDF
Ankus, bigdata deployment and orchestration framework
Ashrith Mekala
 
PDF
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 
PDF
IBM Aspera overview
Carlos Martin Hernandez
 
PDF
Self-Driving Data Center
Sergey A. Razin
 
PPTX
5 Things that Make Hadoop a Game Changer
Caserta
 
PPTX
Open source big data landscape and possible ITS applications
SoftwareMill
 
PDF
Wasp2 - IoT and Streaming Platform
Paolo Platter
 
PDF
Chris Nicholson, CEO Skymind at The AI Conference
MLconf
 
PPTX
Big Data & Hadoop Introduction
Jayant Mukherjee
 
PDF
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
PDF
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Data Driven Innovation
 
An overview of modern scalable web development
Tung Nguyen
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Ankus, bigdata deployment and orchestration framework
Ashrith Mekala
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 
IBM Aspera overview
Carlos Martin Hernandez
 
Self-Driving Data Center
Sergey A. Razin
 
5 Things that Make Hadoop a Game Changer
Caserta
 
Open source big data landscape and possible ITS applications
SoftwareMill
 
Wasp2 - IoT and Streaming Platform
Paolo Platter
 
Chris Nicholson, CEO Skymind at The AI Conference
MLconf
 
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Ad

Recently uploaded (20)

PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
The Future of Artificial Intelligence (AI)
Mukul
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 

Budapest Big Data Meetup Real-time stream processing

  • 2. • Who we are and what we are doing • Big Data era • BSP (Bulk synchronous parallel) • Apache Giraph • Storm • Our use case • Conclusion Topics today Starschema Experience and Innovation
  • 3. Continuous growth 25 FTE plus external resources, over $1.5million EBIT Open source projects Share the knowledge with the public. Open source project in ETL and data warehousing fields. Founded in 2006 Company was founded by private owners with decade of BI and data warehouse background R&D Cooperation with Obuda University, NKE, EU co-founded technology research and development COMPANY Data Facts about Starschema Starschema Experience and Innovation
  • 4. Big Data eraThe rise of Hadoop Starschema Experience and Innovation Google Year of WP Apache Year GFS 2003 HDFS 2007 MapReduce 2004 Hadoop MR 2007 BigTable 2006 HBase 2007 Chubby Lock Service 2006 ZooKeeper 2007 Pregel 2009 Giraph 2011 Dremel 2010 Drill 2012 ? Which is next? (Curator, Falcon, MRQL, etc.)
  • 5. • Leslie Valiant - article in nov. 1990 • Supersteps • Data stored in local memory • Asynchronous data processing • Barrier sync • Optimal load balacing (more logical processes than physcal processors, random allocation of processes) • Solution differences (procotols, buffer management, routing strategies) • No deadlock or any other race conditions (since no circular dependency) • Use cases BSP (Bulk synchronous parallel)What is it? What is it good for? Starschema Experience and Innovation
  • 7. Apache Giraph Starschema Experience and Innovation • A loose implementation of Pregel • Avery Chink: We can't use it at Yahoo, that's too bad • Developed at Yahoo • Runs on existing MapReduce infrastructure • Netty based comm. instead of Hadoop RPC • In-memory • Fault tolerant • Internal state is saved at user-defined intervals • Master/slave architecture What is it?
  • 8. Storm Starschema Experience and Innovation • Storm is a free and open source distributed real time computation system • Developed at BackType, open-sourced by Twitter in 2011 • Guaranteed data processing • Horizontal scalability • Fault tolerance • ZeroMQ for message passing • Processing unbounded sequence of tuples • Groupings What is it?
  • 9. Storm Starschema Experience and Innovation What is it for? • Analyze, clean, normalize • Real-time calculation • Real-time ETL • Failure detection from log files • Machine data analysis • IT early-warning systems, security and fraud detection • Traffic information, DOS attack • Stream processing - Continous computation - Distributed RPC
  • 10. Our use case Starschema Experience and Innovation • Real-time calculation • Error detection • Horizontal scalability • Fast implementation • High-availability • Error prediction POC: Processing machine data from sensors Requirements
  • 11. Our use case part 2 Starschema Experience and Innovation • Choosen tool: Storm • One spout for each sensor • Dynamic add and remove of spouts • Error detection based on statistical calculations • ~ 200 lines • HA capability of Storm POC: Processing machine data from sensors Solution:
  • 12. Conclusion • Extend existing infrastructure • Answer to new questions • Re-think old problems • New solutions, new features • Happy customers/users • $$$ Starschema Experience and Innovation
  • 13. Starschema Experience and Innovation What else to use? • Yahoo S4 (Apache Incubator project) • Apache Hama (Top level Apache project) • GoldenOrb • Signal/Collect