SlideShare a Scribd company logo
Big Data 
Architectural 
Overview 
By Fujio Turner 
VS 
@FujioTurner
Who is ? What is HPCC Systems? 
LexisNexis is a provider of legal, 
tax, regulatory, news, business 
information, and analysis to 
legal, corporate, government,! 
accounting and academic 
markets. ! 
! 
LexisNexis has been in 
business since 1977 with over 
30,000 employees worldwide. 
LexisNexis Risk is the division 
of the LexisNexis which focuses 
on data, Big Data processing, 
linking and vertical expertise 
and supports HPCC Systems 
as an open source project 
under Apache 2.0 License. 
http://hpccsystems.com/
Comparison 
Block Based File Based 
JAVA C++ 
Petabytes 
1-80,000 Jobs/day 
Since 2005 
Exabytes 
Indexed: 2K-3K Jobs/sec* 
Since 2000 
? ? ? ? ? ? 
Thor Roxie 
In-Memory: 30 - 40 Jobs/min* 
Non-Indexed: 4-1,040,000 Jobs/day 
*based on job (size / result set / complexity)
Non-Indexed Full Data Set 
1 20 
Customers Development Business 
http://hpccsystems.com/why-hpcc/benchmarks
“I’m sub-second 
fast.” 
“I can query all 
or part of your 
data.” 
Cluster Architecture 
Thor Roxie 
Single Threaded 
Hard Disk 
Index(optional) 
Multi-Threaded 
Hard Disk 
Index(optional) 
In-memory 
SSD 
Either/Both
How do the platforms ! 
handle the same data?! 
Example 
300GB File 
Name State Age 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Customer Data May 2010
Name Node 
Store Data 
Data Nodes 
! 
a? 
! 
b? 
! 
c? 
big blocks 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Data is stored in 
random blocks. 
? ? ?
Name Node 
Store Data 
Data Nodes 
block a = server 1 
…… b = …….. 2 
…… c = …….. 3 
! 
a? 
! 
b? 
! 
c? 
big blocks 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Block location are 
stored in memory. 
? ? ?
Store Data 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Data is distributed 
evenly in the cluster 
with replica copies 
and is seen as a 
file (example below). 
K.. CA 45 M.. MI 27 S.. FL 64 
Thor Master 
Thor Slaves 
File Name 
~/customers_2010-05
Store Data 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
File locations are 
stored on disk. 
File Location & Job Scheduler 
K.. CA 45 M.. MI 27 S.. FL 64 
Thor Master 
Thor Slaves 
Dali 
File Name 
~/customers_2010-05
What state do most people live in? 
Blocks are scanned 
for wanted data 
! 
a? 
! 
b? 
! 
c? 
Name Node 
Data Nodes
What state do most people live in? 
! 
a? 
Mapper 
! 
b? 
! 
c? 
Name Node 
Data Nodes 
CA 1 
FL 1 
MI 1 
FL 1 
CA 1 
MI 1 
MI .. 
Found data is sent 
to Mapper(s) in 
Key/Value pairs 
and stored.
What state do most people live in? 
! 
a? 
Mapper 
! 
b? 
! 
c? 
Name Node 
Data Nodes 
Reducer 
CA 120 
MI 500 
FL 7 
CA 1 
FL 1 
MI 1 
FL 1 
CA 1 
MI 1 
MI .. 
Stored data is sent 
to Reducer(s) to be 
aggregated.
What state do most people live in? 
! 
a? 
Mapper 
! 
b? 
! 
c? 
Name Node 
Data Nodes 
Reducer 
CA 120 
MI 500 
FL 7 
CA 1 
FL 1 
MI 1 
FL 1 
CA 1 
MI 1 
MI .. 
Cannot use SSD in 
Mapper or Reducer 
due to too many 
writes.
What state do most people live in? 
1a. 
File Location & Job Scheduler 1.a A pre-compiled 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
2. 
query is triggered. 
(Mostly used in Roxie) 
1b. Ad-hoc query. 
! 
2.Query is sent to Dali 
to get file locations. 
1b.
What state do most people live in? 
File Location & Job Scheduler 
3. ESP 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
3. Job is placed in 
que to be sent to 
Thor Master. Thor 
Master coordinates 
job execution on 
Thor Slave nodes.
What state do most people live in? 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
File Location & Job Scheduler 
Job are done 
locally on slaves 
and/or 
coordinated by 
master globally.
What state do most people live in? 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
4. 
4. 
MI 500 
CA 120 
FL 7 
File Location & Job Scheduler 
4.Job is returned with 
optional grouped by & 
sorted by at run time.
What state do most people live in? 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
MI 500 
CA 120 
FL 7 
File Location & Job Scheduler 
SORT! 
GROUP! 
DEDUP! 
JOIN! 
MERGE! 
BETWEEN! 
LENGTH! 
REGEX! 
ROUND! 
SUM! 
COUNT! 
TRIM! 
WHEN! 
AVE! 
CASE! 
NORMALIZE! 
DENORMALIZE! 
K-MEANS! 
more …. 
Multiple other actions can be 
done on the data in a single job.
Closer Look at Finding Data 
Full block is scanned to find your data. 
Blocks can be many terabytes in size. 
! 
a? ! 
K CA 45 
a b c 
K CA 45
Closer Look at Finding Data 
! 
a? ! 
K CA 45 
a b c 
When data is found 
its sent to mapper. 
CA , 1 
K CA 45
Closer Look at Finding Data 
! 
a? ! 
K CA 45 
a b c 
K CA 45 
Data location is know. 
! 
“Apply Schema on Read” during time 
of query. 
! 
Data is processed locally. 
Name State Age
Closer Look at Finding Data 
! 
a? ! 
File size can be a few bytes 
to 4 exabytes with no limits 
on the total number of files 
that can be stored. 
K CA 45 
a b c 
K CA 45
Speed 
! 
a? 
128GB - 1TB 
8TB - 16TB or more 
2013 
1.5 - 12.5% of data is in memory 
and only recently used data is in memory.
Speed - Part 1 
File Name 
~/customers_2010-05 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
File Name 
~/customers_2010-05_index 
• index per file 
• customize by field(s) 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
CA row #3 
MI row #17 
MI row #4 
FL row #5 
Indexing 
Index Index Index
1 40 
Non-Indexed 
1 200 
To 
Indexed
Example Index Example Index 
1 40 
Non-Indexed 
1 200+ 
To 
Indexed 
male row #345 
female row #4 
male row #97 
female row #267 
CA row #3 
MI row #17 
MI row #4 
FL row #5
Speed - Part 2 
Roxie 
Index In-Memory 
Roxie Master K CA 45 M MI 27 S FL 64 
Index Index Index 
Roxie Slaves
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Roxie is Multi-Threaded 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Roxie is Multi-Threaded 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves 
SSD are OK - write few / read many
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Roxie is Multi-Threaded 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves 
2004
Thor Master 
Common Cluster 
Dali ESP 
Thor Slaves 
Roxie Master 
Roxie Slaves 
Data is mostly 
unstructured. Use Thor to 
do ETL & create indexes. 
Send results to Roxie for 
user queries.
High Speed Cluster 
Dali ESP 
Roxie Master 
Data is mostly structured. 
Main goal is to have fast 
queries all the time. 
Roxie Slaves
Thor Master 
Storage Cluster 
Dali ESP 
Data is structured or unstructured. 
Main goal is to storage lots of data 
and query using indexes on all or 
part of the data in the cluster. 
Thor Slaves
Complex or Multi-Step Queries 
! 
a? 
Mapper 
! 
b? 
Reducer 
! 
c? 
Name Node 
Data Nodes / Task Tracker 
Job Tracker 
Job Tracker 
coordinates 
multi step 
jobs.
Job Tracker 
3 hours 1 hours 1 hours 6 hours 
CA 120 
MI 500 
FL 7 
Food 31 
Water 99 
Candy 84 
Wed 80 
Fri 73 
Sun 96 
1 2 3 
4 5 6 
7 8 9 
1 hours 
Sum 80 
Count 73
How do I Query HPCC Systems? 
ECL (Enterprise Control Language) is a C++ based query 
language for use with HPCC Systems Big Data platform. 
ECLs syntax and format is very simple and easy to learn.! 
! 
Note - ECL is very similar to Hadoop’s pig ,but! 
more expressive and feature rich.
ECL (Enterprise Control Language) 
C++ based query language 
SQL w/ JOINS 
Map/Reduce 
GraphDB 
Machine 
Learning 
Simple to Complex Queries
Query is Completed in a Single Job! 
Asynchronously 
Count 
Sort 
Group 
Classification 
Country = ‘US’ 
Country = ‘US’ 
Join 
Index of 
~/facebook_2013 
~/twitter_2013 
~/facebook_2013 
(ROXIE) 0.27 seconds to (THOR) few hours 
SORT! 
GROUP! 
DEDUP! 
JOIN! 
MERGE! 
BETWEEN! 
LENGTH! 
REGEX! 
ROUND! 
SUM! 
COUNT! 
TRIM! 
WHEN! 
AVE! 
CASE! 
NORMALIZE! 
DENORMALIZE! 
K-MEANS! 
more …. 
+
Machine Learning Built-in 
http://hpccsystems.com/ml 
Regression! 
Linear Regression 
Classification! 
Naive Bayes 
Perceptron 
Decisions Trees 
Logistic Regression 
Clustering! 
K-Means 
KD Trees 
Agglomerative/Hierarchical 
Association Analysis! 
AprioriN 
EclatN 
Rules 
Michael Payne ,of Clemson University, 
on high speed machine learning with 
PB-BLAS in HPCC Systems. 
http://youtu.be/s_HWlMwi6iI
Un-Structured Data?! 
Example 
Lorem Ipsum is 
simply dummy text 
of the printing 
lots of text 
300GB File
Un-Structured Data 
Lorem Ipsum is 
simply dummy text 
of the printing 
Regular 
Expression in C++ 
or 
Pattern Match in 
ECL 
Regular Expression in Java 
Reg Ex+ + 
meta data 
stored only 
Filtered Data 
+ 
Indexes
Full Text Search 
Lorem Ipsum is 
simply dummy text 
of the printing 
Pattern Match in ECL 
and 
Rex Ex + or
Management & Administration 
vs 
More Moving Parts = More Downtime
“I want sub-second speed but made investment in HDFS.” 
Roxie Master K CA 45 M MI 27 S FL 64 
Index Index Index 
Roxie Slaves 
! 
a? 
! 
b? 
! 
c? 
Hadoop / HPCC Transport Plug-in 
Name Node 
Data Nodes / Task Tracker 
http://hpccsystems.com/products-and-services/products/modules/hadoop-integration
Migrating from Hadoop to HPCC Systems 
Roxie Master K CA 45 M MI 27 S FL 64 
Index Index Index 
Roxie Slaves 
Name Node 
Data Nodes / Task Tracker 
Thor Master 
Thor Slaves 
Slowly replace Hadoop with Thor.
Alternative Query Methods
HPCC Systems Security 
User / Group Authentication 
Third Party Authentication 
Kerberos OK 
Encrypt Data on Disk optional
For More HPCC! 
“How To’s”! 
Go to SlideShare 
http://www.slideshare.net/FujioTurner/
Watch how to install 
HPCC Systems 
in 5 Minutes 
Download HPCC Systems 
Open Source 
Community Edition 
http://hpccsystems.com/download/ 
http://www.youtube.com/watch?v=8SV43DCUqJg 
or 
Source Code 
https://github.com/hpcc-systems

More Related Content

What's hot (20)

PPTX
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
PPTX
Practical Hadoop using Pig
David Wellman
 
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
PPTX
Hadoop
Jaydeep Patel
 
PPTX
Understanding Hadoop
Mahendran Ponnusamy
 
PDF
Big data hadooop analytic and data warehouse comparison guide
Danairat Thanabodithammachari
 
PDF
Introduction to hadoop ecosystem
Rupak Roy
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Native erasure coding support inside hdfs presentation
lin bao
 
PDF
Introductive to Hive
Rupak Roy
 
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Interview questions on Apache spark [part 2]
knowbigdata
 
PDF
Embedded R Execution using SQL
Brendan Tierney
 
PDF
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
PDF
SQL for Elasticsearch
Jodok Batlogg
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PDF
Lambda Architecture using Google Cloud plus Apps
Simon Su
 
PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
PDF
Import Database Data using RODBC in R Studio
Rupak Roy
 
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
Practical Hadoop using Pig
David Wellman
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Steve Loughran
 
Understanding Hadoop
Mahendran Ponnusamy
 
Big data hadooop analytic and data warehouse comparison guide
Danairat Thanabodithammachari
 
Introduction to hadoop ecosystem
Rupak Roy
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Native erasure coding support inside hdfs presentation
lin bao
 
Introductive to Hive
Rupak Roy
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Interview questions on Apache spark [part 2]
knowbigdata
 
Embedded R Execution using SQL
Brendan Tierney
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
SQL for Elasticsearch
Jodok Batlogg
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Lambda Architecture using Google Cloud plus Apps
Simon Su
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Import Database Data using RODBC in R Studio
Rupak Roy
 

Similar to HPCC Systems vs Hadoop (20)

PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PPTX
Real World Performance - OLTP
Connor McDonald
 
PDF
Optimizing Hive Queries
DataWorks Summit
 
PDF
Pluk2013 bodybuilding ratheesh
Ratheesh Kaniyala
 
ODP
Kerry osborne hadoop meets exadata
Enkitec
 
PDF
HPCC Presentation
Subrata Debnath
 
KEY
Make Life Suck Less (Building Scalable Systems)
Bradford Stephens
 
PDF
Colvin exadata mistakes_ioug_2014
marvin herrera
 
PDF
Think Exa!
Enkitec
 
ODP
Beyond php - it's not (just) about the code
Wim Godden
 
ODP
Hadoop Meets Exadata- Kerry Osborne
Enkitec
 
PPTX
DBMS-Unit5-PPT.pptx important for revision
yuvivarmaa
 
ODP
MySQL And Search At Craigslist
Jeremy Zawodny
 
PDF
My Sql And Search At Craigslist
MySQLConference
 
PDF
Yet Another Replication Tool: RubyRep
Denish Patel
 
PDF
A Consolidation Success Story by Karl Arao
Enkitec
 
PDF
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Optimizing Hive Queries
Owen O'Malley
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop and Hive Development at Facebook
S S
 
Real World Performance - OLTP
Connor McDonald
 
Optimizing Hive Queries
DataWorks Summit
 
Pluk2013 bodybuilding ratheesh
Ratheesh Kaniyala
 
Kerry osborne hadoop meets exadata
Enkitec
 
HPCC Presentation
Subrata Debnath
 
Make Life Suck Less (Building Scalable Systems)
Bradford Stephens
 
Colvin exadata mistakes_ioug_2014
marvin herrera
 
Think Exa!
Enkitec
 
Beyond php - it's not (just) about the code
Wim Godden
 
Hadoop Meets Exadata- Kerry Osborne
Enkitec
 
DBMS-Unit5-PPT.pptx important for revision
yuvivarmaa
 
MySQL And Search At Craigslist
Jeremy Zawodny
 
My Sql And Search At Craigslist
MySQLConference
 
Yet Another Replication Tool: RubyRep
Denish Patel
 
A Consolidation Success Story by Karl Arao
Enkitec
 
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Ad

HPCC Systems vs Hadoop

  • 1. Big Data Architectural Overview By Fujio Turner VS @FujioTurner
  • 2. Who is ? What is HPCC Systems? LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,! accounting and academic markets. ! ! LexisNexis has been in business since 1977 with over 30,000 employees worldwide. LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License. http://hpccsystems.com/
  • 3. Comparison Block Based File Based JAVA C++ Petabytes 1-80,000 Jobs/day Since 2005 Exabytes Indexed: 2K-3K Jobs/sec* Since 2000 ? ? ? ? ? ? Thor Roxie In-Memory: 30 - 40 Jobs/min* Non-Indexed: 4-1,040,000 Jobs/day *based on job (size / result set / complexity)
  • 4. Non-Indexed Full Data Set 1 20 Customers Development Business http://hpccsystems.com/why-hpcc/benchmarks
  • 5. “I’m sub-second fast.” “I can query all or part of your data.” Cluster Architecture Thor Roxie Single Threaded Hard Disk Index(optional) Multi-Threaded Hard Disk Index(optional) In-memory SSD Either/Both
  • 6. How do the platforms ! handle the same data?! Example 300GB File Name State Age Kevin CA 45 Mark MI 27 Sara FL 64 Customer Data May 2010
  • 7. Name Node Store Data Data Nodes ! a? ! b? ! c? big blocks Kevin CA 45 Mark MI 27 Sara FL 64 Data is stored in random blocks. ? ? ?
  • 8. Name Node Store Data Data Nodes block a = server 1 …… b = …….. 2 …… c = …….. 3 ! a? ! b? ! c? big blocks Kevin CA 45 Mark MI 27 Sara FL 64 Block location are stored in memory. ? ? ?
  • 9. Store Data Kevin CA 45 Mark MI 27 Sara FL 64 Data is distributed evenly in the cluster with replica copies and is seen as a file (example below). K.. CA 45 M.. MI 27 S.. FL 64 Thor Master Thor Slaves File Name ~/customers_2010-05
  • 10. Store Data Kevin CA 45 Mark MI 27 Sara FL 64 File locations are stored on disk. File Location & Job Scheduler K.. CA 45 M.. MI 27 S.. FL 64 Thor Master Thor Slaves Dali File Name ~/customers_2010-05
  • 11. What state do most people live in? Blocks are scanned for wanted data ! a? ! b? ! c? Name Node Data Nodes
  • 12. What state do most people live in? ! a? Mapper ! b? ! c? Name Node Data Nodes CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI .. Found data is sent to Mapper(s) in Key/Value pairs and stored.
  • 13. What state do most people live in? ! a? Mapper ! b? ! c? Name Node Data Nodes Reducer CA 120 MI 500 FL 7 CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI .. Stored data is sent to Reducer(s) to be aggregated.
  • 14. What state do most people live in? ! a? Mapper ! b? ! c? Name Node Data Nodes Reducer CA 120 MI 500 FL 7 CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI .. Cannot use SSD in Mapper or Reducer due to too many writes.
  • 15. What state do most people live in? 1a. File Location & Job Scheduler 1.a A pre-compiled Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP 2. query is triggered. (Mostly used in Roxie) 1b. Ad-hoc query. ! 2.Query is sent to Dali to get file locations. 1b.
  • 16. What state do most people live in? File Location & Job Scheduler 3. ESP Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali 3. Job is placed in que to be sent to Thor Master. Thor Master coordinates job execution on Thor Slave nodes.
  • 17. What state do most people live in? Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP File Location & Job Scheduler Job are done locally on slaves and/or coordinated by master globally.
  • 18. What state do most people live in? Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP 4. 4. MI 500 CA 120 FL 7 File Location & Job Scheduler 4.Job is returned with optional grouped by & sorted by at run time.
  • 19. What state do most people live in? Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP MI 500 CA 120 FL 7 File Location & Job Scheduler SORT! GROUP! DEDUP! JOIN! MERGE! BETWEEN! LENGTH! REGEX! ROUND! SUM! COUNT! TRIM! WHEN! AVE! CASE! NORMALIZE! DENORMALIZE! K-MEANS! more …. Multiple other actions can be done on the data in a single job.
  • 20. Closer Look at Finding Data Full block is scanned to find your data. Blocks can be many terabytes in size. ! a? ! K CA 45 a b c K CA 45
  • 21. Closer Look at Finding Data ! a? ! K CA 45 a b c When data is found its sent to mapper. CA , 1 K CA 45
  • 22. Closer Look at Finding Data ! a? ! K CA 45 a b c K CA 45 Data location is know. ! “Apply Schema on Read” during time of query. ! Data is processed locally. Name State Age
  • 23. Closer Look at Finding Data ! a? ! File size can be a few bytes to 4 exabytes with no limits on the total number of files that can be stored. K CA 45 a b c K CA 45
  • 24. Speed ! a? 128GB - 1TB 8TB - 16TB or more 2013 1.5 - 12.5% of data is in memory and only recently used data is in memory.
  • 25. Speed - Part 1 File Name ~/customers_2010-05 Kevin CA 45 Mark MI 27 Sara FL 64 File Name ~/customers_2010-05_index • index per file • customize by field(s) Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves CA row #3 MI row #17 MI row #4 FL row #5 Indexing Index Index Index
  • 26. 1 40 Non-Indexed 1 200 To Indexed
  • 27. Example Index Example Index 1 40 Non-Indexed 1 200+ To Indexed male row #345 female row #4 male row #97 female row #267 CA row #3 MI row #17 MI row #4 FL row #5
  • 28. Speed - Part 2 Roxie Index In-Memory Roxie Master K CA 45 M MI 27 S FL 64 Index Index Index Roxie Slaves
  • 29. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves
  • 30. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Roxie is Multi-Threaded Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves
  • 31. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Roxie is Multi-Threaded Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves SSD are OK - write few / read many
  • 32. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Roxie is Multi-Threaded Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves 2004
  • 33. Thor Master Common Cluster Dali ESP Thor Slaves Roxie Master Roxie Slaves Data is mostly unstructured. Use Thor to do ETL & create indexes. Send results to Roxie for user queries.
  • 34. High Speed Cluster Dali ESP Roxie Master Data is mostly structured. Main goal is to have fast queries all the time. Roxie Slaves
  • 35. Thor Master Storage Cluster Dali ESP Data is structured or unstructured. Main goal is to storage lots of data and query using indexes on all or part of the data in the cluster. Thor Slaves
  • 36. Complex or Multi-Step Queries ! a? Mapper ! b? Reducer ! c? Name Node Data Nodes / Task Tracker Job Tracker Job Tracker coordinates multi step jobs.
  • 37. Job Tracker 3 hours 1 hours 1 hours 6 hours CA 120 MI 500 FL 7 Food 31 Water 99 Candy 84 Wed 80 Fri 73 Sun 96 1 2 3 4 5 6 7 8 9 1 hours Sum 80 Count 73
  • 38. How do I Query HPCC Systems? ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.! ! Note - ECL is very similar to Hadoop’s pig ,but! more expressive and feature rich.
  • 39. ECL (Enterprise Control Language) C++ based query language SQL w/ JOINS Map/Reduce GraphDB Machine Learning Simple to Complex Queries
  • 40. Query is Completed in a Single Job! Asynchronously Count Sort Group Classification Country = ‘US’ Country = ‘US’ Join Index of ~/facebook_2013 ~/twitter_2013 ~/facebook_2013 (ROXIE) 0.27 seconds to (THOR) few hours SORT! GROUP! DEDUP! JOIN! MERGE! BETWEEN! LENGTH! REGEX! ROUND! SUM! COUNT! TRIM! WHEN! AVE! CASE! NORMALIZE! DENORMALIZE! K-MEANS! more …. +
  • 41. Machine Learning Built-in http://hpccsystems.com/ml Regression! Linear Regression Classification! Naive Bayes Perceptron Decisions Trees Logistic Regression Clustering! K-Means KD Trees Agglomerative/Hierarchical Association Analysis! AprioriN EclatN Rules Michael Payne ,of Clemson University, on high speed machine learning with PB-BLAS in HPCC Systems. http://youtu.be/s_HWlMwi6iI
  • 42. Un-Structured Data?! Example Lorem Ipsum is simply dummy text of the printing lots of text 300GB File
  • 43. Un-Structured Data Lorem Ipsum is simply dummy text of the printing Regular Expression in C++ or Pattern Match in ECL Regular Expression in Java Reg Ex+ + meta data stored only Filtered Data + Indexes
  • 44. Full Text Search Lorem Ipsum is simply dummy text of the printing Pattern Match in ECL and Rex Ex + or
  • 45. Management & Administration vs More Moving Parts = More Downtime
  • 46. “I want sub-second speed but made investment in HDFS.” Roxie Master K CA 45 M MI 27 S FL 64 Index Index Index Roxie Slaves ! a? ! b? ! c? Hadoop / HPCC Transport Plug-in Name Node Data Nodes / Task Tracker http://hpccsystems.com/products-and-services/products/modules/hadoop-integration
  • 47. Migrating from Hadoop to HPCC Systems Roxie Master K CA 45 M MI 27 S FL 64 Index Index Index Roxie Slaves Name Node Data Nodes / Task Tracker Thor Master Thor Slaves Slowly replace Hadoop with Thor.
  • 49. HPCC Systems Security User / Group Authentication Third Party Authentication Kerberos OK Encrypt Data on Disk optional
  • 50. For More HPCC! “How To’s”! Go to SlideShare http://www.slideshare.net/FujioTurner/
  • 51. Watch how to install HPCC Systems in 5 Minutes Download HPCC Systems Open Source Community Edition http://hpccsystems.com/download/ http://www.youtube.com/watch?v=8SV43DCUqJg or Source Code https://github.com/hpcc-systems