✕ 
CloverETL versus Hadoop 
in light of transforming very large data sets 
in parallel 
a deathmatch or happy together ?
= 
similarities 
• Both technologies use data parallelism - input data are split into 
“partitions” which are then processed in parallel. 
• Each partition is processed the same way (same algorithm used). 
• At the end of the processing, results of individually processed partitions 
need to be merged to process final result. 
Part 1 
Part 2 
Part 3 
Final result 
data 
split 
data 
merge 
data 
process
✕ 
differences 
• Hadoop uses Map->Reduce pattern originally developed by Google for 
Web indexing and searching. Processing is divided into Map phase 
(filtering&sorting) and Reduction phase (summary operation). 
Hadoop approach expects that initial large volume of data is reduced to 
much smaller result -> e.g. search for pages with certain keyword. 
• CloverETL is based on pipeline-parallelism pattern where individual 
specialized components perform various operations on flow of data 
records - parsing, filtering, joining, aggregating,de-duping... 
Clover is optimized for large volumes of data flowing through it and 
being transformed on-the-fly.
= 
similarities 
Both technologies use partitioned&distributed storage of data (filesystem). 
• Hadoop uses HDFS (Hadoop Distributed Filesystem) with individual 
DataNodes residing on physical nodes of Hadoop/HDFS cluster. 
• CloverETL uses Partitioned Sandbox where data are spread over 
physical nodes of CloverETL Cluster. Each node is also a data processing 
node typically processing locally stored data (not exclusively). One node 
can be part of more than one Partitioned Sandbox.
✕ 
differences 
HDFS operates at byte level (data are read&written as streams of 
bytes). It includes data loss prevention through data redundancy. 
HDFS is based on “write-once, read-many-times” pattern. 
CloverETL’s Partitioned Sandbox operates at record level (data are 
read&written as complete records). Data loss prevention is left to be 
handled by the underlying file system storage. Clover’s Partitioned 
sandbox supports very high I/O throughput needed for massive data 
transformations.
CloverETL ✕ Hadoop HDFS 
HDFS stores, splits and distributes data at byte level 
4 5 6 , N Y , J OH N n 
split 
split 
CloverETL stores, splits and distributes data at record level 
456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn 
split split
Hadoop HDFS 
organises files into large blocks of bytes (64MB or more) 
which are then physically stored on different nodes of 
Hadoop cluster 
HDFS data file 
Node 1 Node 2 
split record 
Block1 Block2 
data records{ 
data blocks of 64MB 
•block 1 
•block 3 
•block 5 
•block 7 
•... 
•block 2 
•block 4 
•block 6 
•block 8 
•...
Hadoop HDFS 
partitions, distributes and stores data at byte level 
4 5 6 , N Y , J OH N n 
split 
1st part stored 
2nd part stored 
Node 1 Node 2 
☛ One data record in source data can end up being split between two different nodes 
☛ Writing or reading such record requires accessing two different nodes via network 
☛ HDFS presents files as single continuous stream of data (similar to any local filesystem)
Block1 
Block2 
Hadoop HDFS 
☛ Parallel writing to one HDFS file is impossible 
Two processes can not write to one data block at the same time. 
Two processes trying to write in parallel to one HDFS file (two different blocks) 
will face the block boundary issue - with potential collision. 
not enough space 
already filled space by 2nd process 
Node 1 
Node 2 
n-th record 
1st process 
2nd process 
where to write ? 
data blocks of 64MB output file 
block 1 
block 2 
➟ file grows (blocks added) 
executed 
on Node1 
executed 
on Node2 
writes to 
Node1 & 2 
writes to 
Node1 & 2 
starts writing to Block2
CloverETL Partitioned Sandbox 
partitions, distributes and stores data at record level 
456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn 
Node 2 
gets stored 
gets stored 
split split 
Node 1 Node 2 
☛ Nodes contain complete records. 
☛ Writing or reading records means accessing locally stored data only 
☛ Partitioned data are located in multiple files located on individual nodes. Clover offers 
unified user view over those files. When processing, partition files are accessed individually.
CloverETL Partitioned Sandbox 
Node 1 
Node 2 
☛ Parallel writing to Partitioned Sandbox is easy 
Two processes write to two independent partitions of Clover sandbox. 
Each process writes to partition which is local to node where it runs - no 
collisions. 
1st process 
2nd process 
Partition 1 
456,NY,JOHNn 458,VA,WILLIAMn 460,MA,MAGn ➟ 
Partition 2 
457,NJ,ANNn 459,IL,MEGANn 461,WA,RYANn ➟ 
executed 
on Node1 
executed 
on Node2 
writes to 
Node1 only 
writes to 
Node2 only
Fault resiliency 
☛ HDFS implements fault tolerance 
HDFS replicates individual data blocks across cluster nodes thus ensuring fault 
tolerance. 
☛ Clover delegates fault resiliency to local file system 
Clover provides unified view on data stored locally on nodes. It’s nodes’ setup (OS, 
filesystem) responsible for fault resiliency.
public 
16 
17 public 
18 
19 
20 
21 
InterruptedException { 
22 String line = value.toString(); 
23 StringTokenizer tokenizer = 
24 
25 word.set(tokenizer.nextToken()); 
26 context.write(word, one); 
27 } 
28 } 
29 } 
30 
31 public 
32 
33 
34 
35 
36 
37 sum += val.get(); 
38 } 
39 context.write(key, 
40 } 
41 } 
42 
43 public 
44 Configuration conf = 
45 
46 Job job = 
47 
48 job.setOutputKeyClass(Text.class); 
49 job.setOutputValueClass(IntWritable.class); 
50 
51 job.setMapperClass(Map.class); 
52 job.setReducerClass(Reduce.class); 
53 
How Hadoop processes data 
process 4 
reduce() 
Merge data 
(partially) 
Sort temp data 
Block1 
Block2 
Block3 
process 1 
map() 
map() 
map() 
process 5 
reduce() 
process 2 
process 3 
Map data 
to key->value pairs 
output.part1 
output.part2 
Input data file 
• Hadoop concentrates transformation logic into 2 stages - map & reduce. 
• Complex logic must be split to multiple map & reduce phases with temporary data being stored 
in between 
• Intense network communication happens when reducers (one or more) merge data from multiple 
mappers (mappers and reducers may run on different nodes) 
• If multiple reducers are used (to accelerate processing) the resulting data are located in multiple 
output files (need to be merged again to produce single final result)
How CloverETL processes data 
Input data file 
Partition 1 
456,NY,JOHNn 458,VA,WILLIAMn ➟ 
Partition 2 
457,NJ,ANNn 459,IL,MEGANn ➟ 
output.full 
Transformation logic 
with pipeline-parallelism 
Transformation logic 
with pipeline-parallelism 
• Clover processes data via set of transformation components running in pipeline-parallelism mode 
• Even complex transformation can be performed without temporarily storing data 
• Individual processing nodes obey data locality - each cluster node processes only locally stored 
data partition 
• Clover allows partitioned output data be automatically presented as one singe result 
Wikipedia > Pipeline parallelism - When multiple components run on same data set i.e. when a record is processed in one component and a previous 
record is being processed in another components.
✕ 
differences 
☛ HDFS optimizes for 
storage 
HDFS optimizes for storing vast amount of 
data across hundreds of cluster nodes. It 
follows the ““write-once, read-many-times” 
pattern. 
☛ Clover optimizes for I/O 
throughput 
Clover optimizes for very fast writing or 
reading of data in parallel on dozens of 
cluster nodes. This lends itself nicely to 
read&process&write (aka ETL)
Which approach is better ? 
it depends.. 
better for typical data transformation/integration 
tasks where all/most input data records get transformed 
and written out. 
Clover Partitioned sandbox expects short-term storage of data. 
better when storing vast amount of data which 
are written by single process and potentially read by 
several processes. 
HDFS expects long-term storage of data.
? 
which one 
Wouldn’t it be nice to have the best from both worlds ? 
It’s possible ! 
• Clover is able to read&write data from HDFS 
• Clover can read and process HDFS stored data in parallel 
• Clover can write the results of processing to its Partitioned sandbox in 
parallel or store them back to HDFS as serial file 
• Data processing tasks can be visually designed in CloverETL 
…thus taking advantage of both worlds.
CloverETL parallel reading from HDFS 
Input data file on HDFS 
Block1 
Block2 
Block3 
Multiple instances of 
Parallel Reader access 
HDFS 
to read data in parallel 
Standard CloverETL 
debugging available 
Final result written as 
single serial file to local 
filesystem 
Data processing 
performed by CloverETL 
standard components 
In this scenario: 
•HDFS serves as a storage system for raw source data 
•CloverETL is the data processing engine
+ 
Benchmarks
The (simple) scenario 
• Apache log stored on HDFS 
• ~274 million web log records 
• Extract year, month and IP address 
• Aggregate data to get number of unique visitors per 
month 
• Running on cluster of 4 HW nodes, using: 
• Hadoop only 
• Hadoop+Hive 
• CloverETL only 
• CloverETL + Hadoop/HDFS
The (simple) scenario results 
Time (sec) 
Hadoop 329 
8 reducers 
Hadoop Hive Query 127 
CloverETL only 59 
Partitioned Sandbox 
CloverETL + Hadoop/HDFS 72 
Segmented Parallel Reading from HDFS
+ 
synergy 
CloverETL brings 
•fast parallel processing 
•visual design & debugging 
•support for formats and 
communication protocols 
•process automation & monitoring 
Hadoop/HDFS brings 
•low cost storage of big data 
•fault resiliency through 
controllable data replication 
“Happy Together” 
song by 
The Turtles
+ 
synergy 
For more information on 
• CloverETL Cluster architecture: 
http://www.cloveretl.com/products/server/cluster 
http://www.slideshare.net/cloveretl/cloveretl-cluster 
• CloverETL in general: 
http://www.cloveretl.com

More Related Content

PPT
Hadoop ppt2
PPTX
Big data- HDFS(2nd presentation)
PPT
Meethadoop
PDF
Apache Hadoop MapReduce Tutorial
PPTX
Introduction to HDFS
PDF
Tutorial Haddop 2.3
PDF
Hadoop distributed computing framework for big data
PPT
Anatomy of file write in hadoop
Hadoop ppt2
Big data- HDFS(2nd presentation)
Meethadoop
Apache Hadoop MapReduce Tutorial
Introduction to HDFS
Tutorial Haddop 2.3
Hadoop distributed computing framework for big data
Anatomy of file write in hadoop

What's hot (20)

PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
HDFS Design Principles
PDF
Hdfs architecture
PDF
Hadoop Distributed File System
PDF
Native erasure coding support inside hdfs presentation
PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
PPTX
Hadoop Distributed File System
PPTX
Introduction to HDFS and MapReduce
PDF
getFamiliarWithHadoop
PPTX
Hadoop HDFS Concepts
PDF
Lecture 2 part 1
PPTX
Understanding Hadoop
PPTX
Hadoop Distributed File System
PDF
HDFS for Geographically Distributed File System
PPTX
Hadoop Distributed File System
PDF
Resilient Distributed Datasets
PPT
Hadoop - Introduction to mapreduce
PDF
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
PPTX
Setting up a big data platform at kelkoo
Distributed Computing with Apache Hadoop: Technology Overview
Apache hadoop, hdfs and map reduce Overview
HDFS Design Principles
Hdfs architecture
Hadoop Distributed File System
Native erasure coding support inside hdfs presentation
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Hadoop Distributed File System
Introduction to HDFS and MapReduce
getFamiliarWithHadoop
Hadoop HDFS Concepts
Lecture 2 part 1
Understanding Hadoop
Hadoop Distributed File System
HDFS for Geographically Distributed File System
Hadoop Distributed File System
Resilient Distributed Datasets
Hadoop - Introduction to mapreduce
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Setting up a big data platform at kelkoo
Ad

Similar to CloverETL + Hadoop (20)

PPTX
HADOOP.pptx
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
PPTX
Big Data Reverse Knowledge Transfer.pptx
PDF
Hadoop 3.0 - Revolution or evolution?
PPT
Hadoop Architecture
PDF
Hadoop data management
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Hadoop 3.0 - Revolution or evolution?
PPTX
Big Data-Session, data engineering and scala
PPTX
Introduction to HDFS
PPTX
PPTX
module 2.pptx
PPTX
PPTX
Hadoop Interview Questions and Answers
PPTX
Hadoop and HDFS
PPTX
Unit-1 Introduction to Big Data.pptx
PPTX
Bigdata and Hadoop
PDF
Hadoop overview.pdf
PDF
Scalable Storage for Massive Volume Data Systems
HADOOP.pptx
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Big Data Reverse Knowledge Transfer.pptx
Hadoop 3.0 - Revolution or evolution?
Hadoop Architecture
Hadoop data management
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop 3.0 - Revolution or evolution?
Big Data-Session, data engineering and scala
Introduction to HDFS
module 2.pptx
Hadoop Interview Questions and Answers
Hadoop and HDFS
Unit-1 Introduction to Big Data.pptx
Bigdata and Hadoop
Hadoop overview.pdf
Scalable Storage for Massive Volume Data Systems
Ad

Recently uploaded (20)

PPTX
"Secure File Sharing Solutions on AWS".pptx
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
Visual explanation of Dijkstra's Algorithm using Python
PPTX
assetexplorer- product-overview - presentation
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
Introduction to Windows Operating System
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
Cybersecurity: Protecting the Digital World
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
AI Guide for Business Growth - Arna Softech
PPTX
Computer Software - Technology and Livelihood Education
PPTX
Computer Software and OS of computer science of grade 11.pptx
"Secure File Sharing Solutions on AWS".pptx
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Wondershare Recoverit Full Crack New Version (Latest 2025)
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Visual explanation of Dijkstra's Algorithm using Python
assetexplorer- product-overview - presentation
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Tech Workshop Escape Room Tech Workshop
GSA Content Generator Crack (2025 Latest)
MCP Security Tutorial - Beginner to Advanced
Introduction to Windows Operating System
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Cybersecurity: Protecting the Digital World
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
AI Guide for Business Growth - Arna Softech
Computer Software - Technology and Livelihood Education
Computer Software and OS of computer science of grade 11.pptx

CloverETL + Hadoop

  • 1. ✕ CloverETL versus Hadoop in light of transforming very large data sets in parallel a deathmatch or happy together ?
  • 2. = similarities • Both technologies use data parallelism - input data are split into “partitions” which are then processed in parallel. • Each partition is processed the same way (same algorithm used). • At the end of the processing, results of individually processed partitions need to be merged to process final result. Part 1 Part 2 Part 3 Final result data split data merge data process
  • 3. ✕ differences • Hadoop uses Map->Reduce pattern originally developed by Google for Web indexing and searching. Processing is divided into Map phase (filtering&sorting) and Reduction phase (summary operation). Hadoop approach expects that initial large volume of data is reduced to much smaller result -> e.g. search for pages with certain keyword. • CloverETL is based on pipeline-parallelism pattern where individual specialized components perform various operations on flow of data records - parsing, filtering, joining, aggregating,de-duping... Clover is optimized for large volumes of data flowing through it and being transformed on-the-fly.
  • 4. = similarities Both technologies use partitioned&distributed storage of data (filesystem). • Hadoop uses HDFS (Hadoop Distributed Filesystem) with individual DataNodes residing on physical nodes of Hadoop/HDFS cluster. • CloverETL uses Partitioned Sandbox where data are spread over physical nodes of CloverETL Cluster. Each node is also a data processing node typically processing locally stored data (not exclusively). One node can be part of more than one Partitioned Sandbox.
  • 5. ✕ differences HDFS operates at byte level (data are read&written as streams of bytes). It includes data loss prevention through data redundancy. HDFS is based on “write-once, read-many-times” pattern. CloverETL’s Partitioned Sandbox operates at record level (data are read&written as complete records). Data loss prevention is left to be handled by the underlying file system storage. Clover’s Partitioned sandbox supports very high I/O throughput needed for massive data transformations.
  • 6. CloverETL ✕ Hadoop HDFS HDFS stores, splits and distributes data at byte level 4 5 6 , N Y , J OH N n split split CloverETL stores, splits and distributes data at record level 456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn split split
  • 7. Hadoop HDFS organises files into large blocks of bytes (64MB or more) which are then physically stored on different nodes of Hadoop cluster HDFS data file Node 1 Node 2 split record Block1 Block2 data records{ data blocks of 64MB •block 1 •block 3 •block 5 •block 7 •... •block 2 •block 4 •block 6 •block 8 •...
  • 8. Hadoop HDFS partitions, distributes and stores data at byte level 4 5 6 , N Y , J OH N n split 1st part stored 2nd part stored Node 1 Node 2 ☛ One data record in source data can end up being split between two different nodes ☛ Writing or reading such record requires accessing two different nodes via network ☛ HDFS presents files as single continuous stream of data (similar to any local filesystem)
  • 9. Block1 Block2 Hadoop HDFS ☛ Parallel writing to one HDFS file is impossible Two processes can not write to one data block at the same time. Two processes trying to write in parallel to one HDFS file (two different blocks) will face the block boundary issue - with potential collision. not enough space already filled space by 2nd process Node 1 Node 2 n-th record 1st process 2nd process where to write ? data blocks of 64MB output file block 1 block 2 ➟ file grows (blocks added) executed on Node1 executed on Node2 writes to Node1 & 2 writes to Node1 & 2 starts writing to Block2
  • 10. CloverETL Partitioned Sandbox partitions, distributes and stores data at record level 456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn Node 2 gets stored gets stored split split Node 1 Node 2 ☛ Nodes contain complete records. ☛ Writing or reading records means accessing locally stored data only ☛ Partitioned data are located in multiple files located on individual nodes. Clover offers unified user view over those files. When processing, partition files are accessed individually.
  • 11. CloverETL Partitioned Sandbox Node 1 Node 2 ☛ Parallel writing to Partitioned Sandbox is easy Two processes write to two independent partitions of Clover sandbox. Each process writes to partition which is local to node where it runs - no collisions. 1st process 2nd process Partition 1 456,NY,JOHNn 458,VA,WILLIAMn 460,MA,MAGn ➟ Partition 2 457,NJ,ANNn 459,IL,MEGANn 461,WA,RYANn ➟ executed on Node1 executed on Node2 writes to Node1 only writes to Node2 only
  • 12. Fault resiliency ☛ HDFS implements fault tolerance HDFS replicates individual data blocks across cluster nodes thus ensuring fault tolerance. ☛ Clover delegates fault resiliency to local file system Clover provides unified view on data stored locally on nodes. It’s nodes’ setup (OS, filesystem) responsible for fault resiliency.
  • 13. public 16 17 public 18 19 20 21 InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = 24 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public 32 33 34 35 36 37 sum += val.get(); 38 } 39 context.write(key, 40 } 41 } 42 43 public 44 Configuration conf = 45 46 Job job = 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 How Hadoop processes data process 4 reduce() Merge data (partially) Sort temp data Block1 Block2 Block3 process 1 map() map() map() process 5 reduce() process 2 process 3 Map data to key->value pairs output.part1 output.part2 Input data file • Hadoop concentrates transformation logic into 2 stages - map & reduce. • Complex logic must be split to multiple map & reduce phases with temporary data being stored in between • Intense network communication happens when reducers (one or more) merge data from multiple mappers (mappers and reducers may run on different nodes) • If multiple reducers are used (to accelerate processing) the resulting data are located in multiple output files (need to be merged again to produce single final result)
  • 14. How CloverETL processes data Input data file Partition 1 456,NY,JOHNn 458,VA,WILLIAMn ➟ Partition 2 457,NJ,ANNn 459,IL,MEGANn ➟ output.full Transformation logic with pipeline-parallelism Transformation logic with pipeline-parallelism • Clover processes data via set of transformation components running in pipeline-parallelism mode • Even complex transformation can be performed without temporarily storing data • Individual processing nodes obey data locality - each cluster node processes only locally stored data partition • Clover allows partitioned output data be automatically presented as one singe result Wikipedia > Pipeline parallelism - When multiple components run on same data set i.e. when a record is processed in one component and a previous record is being processed in another components.
  • 15. ✕ differences ☛ HDFS optimizes for storage HDFS optimizes for storing vast amount of data across hundreds of cluster nodes. It follows the ““write-once, read-many-times” pattern. ☛ Clover optimizes for I/O throughput Clover optimizes for very fast writing or reading of data in parallel on dozens of cluster nodes. This lends itself nicely to read&process&write (aka ETL)
  • 16. Which approach is better ? it depends.. better for typical data transformation/integration tasks where all/most input data records get transformed and written out. Clover Partitioned sandbox expects short-term storage of data. better when storing vast amount of data which are written by single process and potentially read by several processes. HDFS expects long-term storage of data.
  • 17. ? which one Wouldn’t it be nice to have the best from both worlds ? It’s possible ! • Clover is able to read&write data from HDFS • Clover can read and process HDFS stored data in parallel • Clover can write the results of processing to its Partitioned sandbox in parallel or store them back to HDFS as serial file • Data processing tasks can be visually designed in CloverETL …thus taking advantage of both worlds.
  • 18. CloverETL parallel reading from HDFS Input data file on HDFS Block1 Block2 Block3 Multiple instances of Parallel Reader access HDFS to read data in parallel Standard CloverETL debugging available Final result written as single serial file to local filesystem Data processing performed by CloverETL standard components In this scenario: •HDFS serves as a storage system for raw source data •CloverETL is the data processing engine
  • 20. The (simple) scenario • Apache log stored on HDFS • ~274 million web log records • Extract year, month and IP address • Aggregate data to get number of unique visitors per month • Running on cluster of 4 HW nodes, using: • Hadoop only • Hadoop+Hive • CloverETL only • CloverETL + Hadoop/HDFS
  • 21. The (simple) scenario results Time (sec) Hadoop 329 8 reducers Hadoop Hive Query 127 CloverETL only 59 Partitioned Sandbox CloverETL + Hadoop/HDFS 72 Segmented Parallel Reading from HDFS
  • 22. + synergy CloverETL brings •fast parallel processing •visual design & debugging •support for formats and communication protocols •process automation & monitoring Hadoop/HDFS brings •low cost storage of big data •fault resiliency through controllable data replication “Happy Together” song by The Turtles
  • 23. + synergy For more information on • CloverETL Cluster architecture: http://www.cloveretl.com/products/server/cluster http://www.slideshare.net/cloveretl/cloveretl-cluster • CloverETL in general: http://www.cloveretl.com