SlideShare a Scribd company logo
Building a Hadoop Connector
pastiaro.wordpress.com 
@rpastia
Building a Hadoop Connector
Building a connector – The Wrong Way 
Mapper Reducer
Building a Hadoop Connector
Building a connector – The Right Way 
Mapper Partitioner Reducer 
Input 
Format 
Input 
Split 
Record 
Reader 
Output 
Format 
Record 
Writer
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector
The InputFormat: From Input to Mapper 
--range 2014-09-01;2014-09-20 
--number_of_mappers 4 
2014-09-01 2014-09-02 
2014-09-03 
2014-09-04 
2014-09-05 
… … … 
2014-09-06 
2014-09-20 
Input Split 1 
2014-09-01 
2014-09-02 
. 
. 
. 
2014-09-05 
Record Reader 1 
(2014-09-01-A; record A) 
(2014-09-01-B; record B) 
(2014-09-01-…; record …) 
(2014-09-02-A; record A) 
(2014-09-02-B; record B) 
(2014-09-02-…; record …) 
(2014-09-05-A; record A) 
(2014-09-05-B; record B) 
(2014-09-05-…; record …) 
Mapper
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector
The InputFormat: From Input to Mapper 
--range 2014-09-01;2014-09-20 
--number_of_mappers 4 
2014-09-01 2014-09-02 
2014-09-03 
2014-09-04 
2014-09-05 
… … … 
2014-09-06 
2014-09-20 
Input Split 1 
2014-09-01 
2014-09-02 
. 
. 
. 
2014-09-05 
Record Reader 1 
(2014-09-01-A; record A) 
(2014-09-01-B; record B) 
(2014-09-01-…; record …) 
(2014-09-02-A; record A) 
(2014-09-02-B; record B) 
(2014-09-02-…; record …) 
(2014-09-05-A; record A) 
(2014-09-05-B; record B) 
(2014-09-05-…; record …) 
Mapper
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector

More Related Content

More from Bigstep (8)

PDF
Data Lake and the rise of the microservices
Bigstep
 
PDF
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
PDF
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Bigstep
 
PPTX
How to Automate Big Data with Ansible
Bigstep
 
PPTX
Cassandra Performance Benchmark
Bigstep
 
PPTX
Couchbase In The Cloud - A Performance Benchmark
Bigstep
 
PDF
Getting the Most Out of Your NoSQL DB
Bigstep
 
PPTX
Getting the most out of Impala - Best practices for infrastructure optimization
Bigstep
 
Data Lake and the rise of the microservices
Bigstep
 
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Bigstep
 
How to Automate Big Data with Ansible
Bigstep
 
Cassandra Performance Benchmark
Bigstep
 
Couchbase In The Cloud - A Performance Benchmark
Bigstep
 
Getting the Most Out of Your NoSQL DB
Bigstep
 
Getting the most out of Impala - Best practices for infrastructure optimization
Bigstep
 

Recently uploaded (20)

PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
PPT
Brief History of Python by Learning Python in three hours
adanechb21
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
System Center 2025 vs. 2022; What’s new, what’s next_PDF.pdf
Q-Advise
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
Brief History of Python by Learning Python in three hours
adanechb21
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
System Center 2025 vs. 2022; What’s new, what’s next_PDF.pdf
Q-Advise
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
Presentation about variables and constant.pptx
kr2589474
 
Troubleshooting Virtual Threads in Java!
Tier1 app
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Ad

Building a Hadoop Connector

  • 4. Building a connector – The Wrong Way Mapper Reducer
  • 6. Building a connector – The Right Way Mapper Partitioner Reducer Input Format Input Split Record Reader Output Format Record Writer
  • 10. The InputFormat: From Input to Mapper --range 2014-09-01;2014-09-20 --number_of_mappers 4 2014-09-01 2014-09-02 2014-09-03 2014-09-04 2014-09-05 … … … 2014-09-06 2014-09-20 Input Split 1 2014-09-01 2014-09-02 . . . 2014-09-05 Record Reader 1 (2014-09-01-A; record A) (2014-09-01-B; record B) (2014-09-01-…; record …) (2014-09-02-A; record A) (2014-09-02-B; record B) (2014-09-02-…; record …) (2014-09-05-A; record A) (2014-09-05-B; record B) (2014-09-05-…; record …) Mapper
  • 16. The InputFormat: From Input to Mapper --range 2014-09-01;2014-09-20 --number_of_mappers 4 2014-09-01 2014-09-02 2014-09-03 2014-09-04 2014-09-05 … … … 2014-09-06 2014-09-20 Input Split 1 2014-09-01 2014-09-02 . . . 2014-09-05 Record Reader 1 (2014-09-01-A; record A) (2014-09-01-B; record B) (2014-09-01-…; record …) (2014-09-02-A; record A) (2014-09-02-B; record B) (2014-09-02-…; record …) (2014-09-05-A; record A) (2014-09-05-B; record B) (2014-09-05-…; record …) Mapper

Editor's Notes

  • #2: Hi guys! My name is Radu and I would like to show you how to write a connector for Hadoop in MapReduce, and how easy it actually is.
  • #3: Quickly about myself. I am a software developer in the Big Data team at Orange Romania (just started actually) and I have been working with Hadoop for about two years. Before this I was working with backends, data processing, batch jobs and my passion for these kind of things is what eventually got me to Hadoop.
  • #4: Now let’s jump straight into the topic: why are Hadoop connectors an important topic? This is because Hadoop is very often paired with another system that is better suited for real-time operations and no matter the setup you will eventually need to transfer data between the two. We’ll use MapReduce to do this in an optimal way. Let’s start!
  • #5: First, avoid the pitfalls! You might be tempted to connect to the other system || either in the mapper || or in the reducer, since you’re already handling the data within these objects. || This is not a good idea!
  • #6: These classes are not supposed to handle IO by themselves. If you do this, you will lose some of the features that come straight out of the MapReduce framework, the classes will be harder to test, and code will be less reusable.
  • #7: So then, how do you build a Hadoop connector in MapReduce? What else is the besides the Mapper and the Reducer? We have the InputFormat with it’s Input Split and Record Reader; we have the Partitioner; and we have the OutputFormat with it’s RecordWriter. So I’m pretty sure the colors already gave it away, we’ll use the InputFormat to import data, and the OutputFormat to export it. And now I’ll show you how to do each one.
  • #8: Let’s start with importing. Our data source will probably be some type of NoSQL DB and so the first thing we’ll have to do, is to think about how to find all the data, all the keys, and how to partition them so that we can query different partitions in parallel. There are several ways to do this and in the next slides I’ll assume that our data store will allow us to easily get all records from a given date. Next we’ll need to define our configuration parameters and make sure we get them into the Configuration object. Finally we’ll have to implement the InputFormat with the InputSplit and the RecordReader.
  • #9: About Configuration parameters. We can of course use the Hadoop ToolRunner class to handle them but I recommend you also checkout the Apache Commons CLI library because it provides a few nice extra features. You end up with a command line like this; notice how we’re importing 20-days-worth of data, and we’re specifying 4 mappers. This means that we’ll get four processes importing this data in parallel.
  • #10: The first class that we are going to look at is the InputFormat. We can have it split our input data into as many Input Splits as we want. Then, we’ll use it to create Record Readers that actually connect to the data source and provide us with the data.
  • #11: Let’s see how this whole process works. So we are importing 20 days of data. || First, our range is expanded to the actual days. || Now, we want 4 mappers, that means 4 input splits, so we’ll create input splits with 5 days each. || Next, each input split will get a record reader that reads in succession each record from each of the five days. || Finally, the mapper is called for each record.
  • #12: Here’s what the code looks like. We’ll extend the base Hadoop class and override the getSplits method. Inside we create InputSplit objects, set the partitions in each one and then add them to the list of InputSplits, until we’ve covered all partitions.
  • #13: Next, the InputSplit class. Once all input splits have been constructed, map tasks are fired up throughout the Hadoop cluster and each task gets an input split. This is why the InputSplits need to be serialized, so we need to implement the Writable interface. So, we’ll need to store the data partitions, and we’ll need to override four methods of the base class: getLenght, getLocations and two more for the serialization.
  • #14: Let’s look at an example implementation to make things more clear. Storing the dataPartitions: we use an ArrayList as class member, with proper setter and getter. The length reported to the framework can be the size of this array if we can’t otherwise determine the precise data size. The getLocations method is used by the framework to select where to run tasks so that data locality is achieved. If the data store we’re connecting to is on a different cluster, as it usually is, we’ll simply return an empty array.
  • #15: Last, the serialization. This can be easily implemented by leveraging the writable classes built into Hadoop. Here, we are loading our data partitions into an ArrayWritable and calling the write method on that object. Similarly we are de-serializing data by calling the readFields method on a new ArrayWritable object.
  • #16: To finish our import, the last piece that we need is the RecordReader. This is where we’ll actually connect to the data source. We can override the initialize method to fire up our database client and to load the partitions from the input split. Here, we are creating a queue out of all the partitions, that will be then queried one after the other. Then, we have to override the method that iterates over the data and the methods that make it available to the mapper; this is pretty straightforward and specific to the data source so we will not go into more details.
  • #17: That’s it! We now have our data reaching the mapper. Here, we can use a simple identity mapper to save the unaltered data or we can even run a full MapReduce job before sending it to the output.
  • #18: A few words on exporting. This is done in a very similar way to the import, but it’s even simpler. Still, specific to the export is that we must first decide on what operation we want to perform to the data that we are exporting. We can of course simply add, or store, the data, but we could also replace or even delete existing records. When exporting we don’t have to deal with splits anymore, so we just have to implement an OutputFormat and a RecordWriter.
  • #19: The OutputFormat simply provides a RecordWriter. Since this class does not have an Initialize method, we’ll use the constructor to connect to the data store. Then, the write method can be used to perform the operations on the data.
  • #20: That’s all there is to exporting! Now that we know how to implement both import and export we could even use them together in the same job to move data between two different databases.