SlideShare a Scribd company logo
A Visual Workbench for Big Data
 Analytics on Hadoop




bigdata.pervasive.com •+1.855.356.DATA
Visual Workbench for Hadoop

• Agenda
   –   Pervasive Software
   –   History of DataRush
   –   Dataflow Concepts
   –   Hadoop Integration
   –   Demo
   –   Performance Testing




                              bigdata.pervasive.com •+1.855.356.DATA   2
Who is Pervasive?

Global Software Company
   •   Tens of thousands of users across the globe
   •   Operations in Americas, EMEA, Asia
   •   ~260 employees

Strong Financials
   •   $51 million revenue (trailing 12-month)
   •   48 consecutive quarters of profitability
   •   $46 million in the bank
   •   NASDAQ:PVSW since 1997

Leader in Data Innovation
   • 25% of top-line revenue re-invested in R&D
   • Software to manage, integrate and analyze data, in the cloud or on-premises,
     throughout the entire data lifecycle




                                                             bigdata.pervasive.com •+1.855.356.DATA   3
History of DataRush

• Initially developed as next-gen data engine for
  integration
• Requirements
   –   High data throughput
   –   Scalable (data, multicore)
   –   Based on dataflow concepts
   –   Component based architecture
   –   Easy to extend
   –   Easily fits in visual development environment
• Embedded in Pervasive products (DataProfiler)
• Extended with SDK for more general use




                                                       bigdata.pervasive.com •+1.855.356.DATA   4
Dataflow Concepts

 •   Operators (nodes) linked together in a directed graph
 •   Data flows along edges
 •   Shared nothing architecture
 •   Provides pipeline parallelism
 •   Supports data parallelism
 •   Data scalable




                                            bigdata.pervasive.com •+1.855.356.DATA   5
Compilation to Execution Plan

                                                                                                       Compiled to a set
                                                                                                       of physical graphs




Phase 1                                                         Phase 2

          Reader   FilterRows   DeriveFields   Group(partial)             Repartition         Group(final)         Writer




          Reader   FilterRows   DeriveFields   Group(partial)             Repartition         Group(final)         Writer




          Reader   FilterRows   DeriveFields   Group(partial)             Repartition         Group(final)         Writer




          Reader   FilterRows   DeriveFields   Group(partial)             Repartition         Group(final)         Writer




                                                                                        bigdata.pervasive.com •+1.855.356.DATA
Operator Library




                   bigdata.pervasive.com •+1.855.356.DATA
KNIME

• KNIME
   – Open source analytics workflow tool for the desktop
   – Web site: www.knime.org
   – Supports team collaboration and resource sharing:
      • KNIME Teamspace
      • KNIME Server
      • KNIME Report
• Integrated with DataRush
   – DataRush dataflow executor integrated as a plug-in extension
   – Includes DataRush operators
   – Product: RushAnalytics for KNIME




                                                      bigdata.pervasive.com •+1.855.356.DATA   8
DataRush + KNIME




                   bigdata.pervasive.com •+1.855.356.DATA   9
Integration with Hadoop

• Data Level
   – HDFS access
      • File system abstraction – works with all I/O operators
      • Distributed execution – uses splits much like MR
   – HBase
      • Temporal key-value data store based on column families
      • Fast loading using HFile integration
      • Fast temporal queries
• Execution
   – Distributed execution uses distribute DataRush engines (not
     MapReduce)
   – Integrating with YARN for resource sharing




                                                       bigdata.pervasive.com •+1.855.356.DATA   10
Distributed Execution



 Perf                 Cluster                          Node
Monitor               Manager         Allocates
                                      Resources       Manager

Web Browser
                                                  Spawns
              Initiates Job




                                                                          Data

                        Client                        Executor                                   HDFS




                  Local Phase Graph                   Phase Graph


                                                                    bigdata.pervasive.com •+1.855.356.DATA   11
Distributed I/O

                  ReadSplit
                              • Allows downstream
                                operators to be
                                parallelized
                  ReadSplit
                              • Parallelization
                                concepts are the
AssignSplits
                                same whether the
                                graph is run locally or
                  ReadSplit
                                distributed



                  ReadSplit




                                     bigdata.pervasive.com •+1.855.356.DATA   12
Demo




bigdata.pervasive.com •+1.855.356.DATA
Performance Test

                                                        TPC-H : 1 Terabyte Test : Run times
• DataRush versus PIG
                                                          892
      – Used TPC-H data           Q21
                                                                                                              3528

      – Generated 1TB data
                                                  543
        set in HDFS               Q18
                                                                           1742
      – Ran several “queries”
        coded in DataRush and                      626
                                  Q10
                                                            1027
        PIG
      – Run times in seconds      Q9
                                                                1198
                                                                                         2356                                 DataRush
        (smaller is better)
                                                                                                                              PIG
                                            273
                                  Q6
                                             363


                                                   660
                                  Q3
Cluster Configuration:                                              1414

•    5 worker nodes
•    2 X Intel E5-2650 (8 core)              401
                                  Q1
                                                                                  2036
•    64GB RAM
•    24 X 1TB SATA 7200 rpm             0   500          1000      1500     2000         2500    3000     3500       4000
                                                                     Run time in seconds



                                                                                          bigdata.pervasive.com •+1.855.356.DATA    14
DataRush/RushAnalytics Solutions

• Opera Solutions
   – Data science solutions provider
   – Embedding DataRush in engineered solutions
• Healthcare
   – Claims cleansing & processing
• Retail
   – Market basket analysis
   – Product category resolution (MDM)
• Telecom
   – CDR processing & analysis


“Pervasive DataRush’s efficiency and ability to automatically
scale, whether on a single server or a Hadoop cluster, supports our
vision for consistent, reusable, scalable Big Data analytics.”
                  – Armando Escalante, Chief Operating Officer, Opera Solutions



                                                        bigdata.pervasive.com •+1.855.356.DATA   15
Summary

• Easy development of Hadoop workloads
   – Using drag-and-drop desktop GUI
   – Team oriented - Supports collaboration with others
   – No code to write - MapReduce included
• Scalable Execution
   – Executes within Hadoop cluster
   – Scales from desktop to server to cluster with no workflow
     changes
   – Scales as cluster does
   – Handles small to very large data sizes
   – TPC-H performance testing shows improved performance over
     comparable PIG scripts



                                                 bigdata.pervasive.com •+1.855.356.DATA   16
Questions?

• My contact info:

  jfalgout@pervasive.com
  @jimfalgout

• Website

   bigdata.pervasive.com




                           bigdata.pervasive.com •+1.855.356.DATA   17

More Related Content

What's hot (20)

PPTX
Big Data Performance and Capacity Management
rightsize
 
PPTX
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR Technologies
 
PDF
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
PPTX
Back to School - St. Louis Hadoop Meetup September 2016
Adam Doyle
 
PDF
Philly DB MapR Overview
MapR Technologies
 
PPTX
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
PDF
Hadoop Operations - Best practices from the field
Uwe Printz
 
PDF
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 
PDF
Scaling Hadoop at LinkedIn
DataWorks Summit
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PPTX
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
PDF
Yarns About Yarn
Cloudera, Inc.
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
PDF
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
PDF
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 
PPT
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Leons Petražickis
 
Big Data Performance and Capacity Management
rightsize
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR Technologies
 
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
Back to School - St. Louis Hadoop Meetup September 2016
Adam Doyle
 
Philly DB MapR Overview
MapR Technologies
 
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Hadoop Operations - Best practices from the field
Uwe Printz
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Data warehousing with Hadoop
hadooparchbook
 
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Yarns About Yarn
Cloudera, Inc.
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Leons Petražickis
 

Similar to Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop (20)

PDF
Sharing resources with non-Hadoop workloads
DataWorks Summit
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PDF
Google Compute and MapR
MapR Technologies
 
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
PDF
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
PDF
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
PDF
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Precisely
 
PDF
What’s New in Syncsort Integrate? New User Experience for Fast Data Onboarding
Precisely
 
PPTX
Tez big datacamp-la-bikas_saha
Data Con LA
 
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
PDF
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
PDF
Modernize Your Oracle Environment with an Agile Data Infrastructure
NetApp
 
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
PPTX
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
PPTX
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Sharing resources with non-Hadoop workloads
DataWorks Summit
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Google Compute and MapR
MapR Technologies
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Precisely
 
What’s New in Syncsort Integrate? New User Experience for Fast Data Onboarding
Precisely
 
Tez big datacamp-la-bikas_saha
Data Con LA
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
Modernize Your Oracle Environment with an Agile Data Infrastructure
NetApp
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Ad

Recently uploaded (20)

PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
July Patch Tuesday
Ivanti
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 

Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

  • 1. A Visual Workbench for Big Data Analytics on Hadoop bigdata.pervasive.com •+1.855.356.DATA
  • 2. Visual Workbench for Hadoop • Agenda – Pervasive Software – History of DataRush – Dataflow Concepts – Hadoop Integration – Demo – Performance Testing bigdata.pervasive.com •+1.855.356.DATA 2
  • 3. Who is Pervasive? Global Software Company • Tens of thousands of users across the globe • Operations in Americas, EMEA, Asia • ~260 employees Strong Financials • $51 million revenue (trailing 12-month) • 48 consecutive quarters of profitability • $46 million in the bank • NASDAQ:PVSW since 1997 Leader in Data Innovation • 25% of top-line revenue re-invested in R&D • Software to manage, integrate and analyze data, in the cloud or on-premises, throughout the entire data lifecycle bigdata.pervasive.com •+1.855.356.DATA 3
  • 4. History of DataRush • Initially developed as next-gen data engine for integration • Requirements – High data throughput – Scalable (data, multicore) – Based on dataflow concepts – Component based architecture – Easy to extend – Easily fits in visual development environment • Embedded in Pervasive products (DataProfiler) • Extended with SDK for more general use bigdata.pervasive.com •+1.855.356.DATA 4
  • 5. Dataflow Concepts • Operators (nodes) linked together in a directed graph • Data flows along edges • Shared nothing architecture • Provides pipeline parallelism • Supports data parallelism • Data scalable bigdata.pervasive.com •+1.855.356.DATA 5
  • 6. Compilation to Execution Plan Compiled to a set of physical graphs Phase 1 Phase 2 Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer bigdata.pervasive.com •+1.855.356.DATA
  • 7. Operator Library bigdata.pervasive.com •+1.855.356.DATA
  • 8. KNIME • KNIME – Open source analytics workflow tool for the desktop – Web site: www.knime.org – Supports team collaboration and resource sharing: • KNIME Teamspace • KNIME Server • KNIME Report • Integrated with DataRush – DataRush dataflow executor integrated as a plug-in extension – Includes DataRush operators – Product: RushAnalytics for KNIME bigdata.pervasive.com •+1.855.356.DATA 8
  • 9. DataRush + KNIME bigdata.pervasive.com •+1.855.356.DATA 9
  • 10. Integration with Hadoop • Data Level – HDFS access • File system abstraction – works with all I/O operators • Distributed execution – uses splits much like MR – HBase • Temporal key-value data store based on column families • Fast loading using HFile integration • Fast temporal queries • Execution – Distributed execution uses distribute DataRush engines (not MapReduce) – Integrating with YARN for resource sharing bigdata.pervasive.com •+1.855.356.DATA 10
  • 11. Distributed Execution Perf Cluster Node Monitor Manager Allocates Resources Manager Web Browser Spawns Initiates Job Data Client Executor HDFS Local Phase Graph Phase Graph bigdata.pervasive.com •+1.855.356.DATA 11
  • 12. Distributed I/O ReadSplit • Allows downstream operators to be parallelized ReadSplit • Parallelization concepts are the AssignSplits same whether the graph is run locally or ReadSplit distributed ReadSplit bigdata.pervasive.com •+1.855.356.DATA 12
  • 14. Performance Test TPC-H : 1 Terabyte Test : Run times • DataRush versus PIG 892 – Used TPC-H data Q21 3528 – Generated 1TB data 543 set in HDFS Q18 1742 – Ran several “queries” coded in DataRush and 626 Q10 1027 PIG – Run times in seconds Q9 1198 2356 DataRush (smaller is better) PIG 273 Q6 363 660 Q3 Cluster Configuration: 1414 • 5 worker nodes • 2 X Intel E5-2650 (8 core) 401 Q1 2036 • 64GB RAM • 24 X 1TB SATA 7200 rpm 0 500 1000 1500 2000 2500 3000 3500 4000 Run time in seconds bigdata.pervasive.com •+1.855.356.DATA 14
  • 15. DataRush/RushAnalytics Solutions • Opera Solutions – Data science solutions provider – Embedding DataRush in engineered solutions • Healthcare – Claims cleansing & processing • Retail – Market basket analysis – Product category resolution (MDM) • Telecom – CDR processing & analysis “Pervasive DataRush’s efficiency and ability to automatically scale, whether on a single server or a Hadoop cluster, supports our vision for consistent, reusable, scalable Big Data analytics.” – Armando Escalante, Chief Operating Officer, Opera Solutions bigdata.pervasive.com •+1.855.356.DATA 15
  • 16. Summary • Easy development of Hadoop workloads – Using drag-and-drop desktop GUI – Team oriented - Supports collaboration with others – No code to write - MapReduce included • Scalable Execution – Executes within Hadoop cluster – Scales from desktop to server to cluster with no workflow changes – Scales as cluster does – Handles small to very large data sizes – TPC-H performance testing shows improved performance over comparable PIG scripts bigdata.pervasive.com •+1.855.356.DATA 16
  • 17. Questions? • My contact info: [email protected] @jimfalgout • Website bigdata.pervasive.com bigdata.pervasive.com •+1.855.356.DATA 17