A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics

13 likes4,886 views

1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates. 2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization. 3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.

Technology

Hadoop
ImageProcessing
Pipeline(HIP)
June 10, 2015
Russell Foltz-Smith
Anil Gupta

2
Image Processing Pipeline
● Acquire Images of Vehicle
● Identify updates/deletes to Images
● Generate unique URL for Images
● Crop and Resize Images
● Copy images to Asset Servers
● Dedupe Images

4
Why Hadoop?
● High Scalability
● Store historical data of Images
● Fault tolerance
● Identify updates to images on basis of content of
URL

5
HIP Components
1. HBase: Datastore for Images and archiving Images
2. MapReduce: Computation engine for Image
Processor
3. Kafka: Publisher/Subscriber for pushing images to
Asset Servers
4. OpenCV Java: Image Processing library
5. Avro: Serialization library for storing data on HDFS

6
HBase Data Model
Tables:
1. IMAGE: Store current set of Images with some metadata
2. IMAGE_ARCHIVE: Stores historical data of Vehicles and
Original Images

7
Column Family Description Versions
I • Store all images of vehicle.
• Stores an Image in each Column
1
H • Stores metadata of all Images 1
Table: IMAGE
RowKey: <Vin_Number>
HBase Data Model
Read patterns for “I” and “H” are mutually exclusive

8
Column Family Description Versions
I Store original images of vehicle.
Only 1 column is stored.
10
A Stores fields of Avro Object of Vehicle
and Image for analytics
10
Table: IMAGE_ARCHIVE
RowKey: <Provider_id><Dealer_Id><vehicle_vin><Image_Index>
HBase Data Model

9
HBase Tuning
● Pre-split tables
● Keep Column names short(2-8 letters)
● Region size 8-10 GB
● Asynchronous clients should buffer(autoFlush=false) Put
operations
● Disable periodic Major Compaction

Pipeline Dataflow Overview
10
InventoryProcessor
Output
[Mapper] Parse &
Validate Records
[Reducer] Identify
CRUD Operation
Kafka
HBase
Asset Servers

CRUD in Reducer
11
Start
Is Deleted?
Yes
Delete Row
in HBase
No
Is Insert?
Yes
Download Images
Generate 6 Sizes
of Image
No Get HTTP Headers of
ImageURL and
Compare with Existing
NoHeader
Mismatch?
Do
Nothing
Yes
1. Write to HBase
2. Write to Kafka

Cascading Downloads
12
One JVM
Process
Yes
[ChainReducer]
ImageProcessorReducer
NoSocket timeout in
500 milliseconds?
No
1. Write to HBase
2. Write to Kafka
ImageProcessorMapper
ImageProcessorRetryMapper
Socket timeout
in 5 seconds?
Mark URL as
“Cannot
Process”

1313
Kafka Producer
● One message per Image file
● Producer Message Format:
● Key: ImageFileName (kafka.serializer.StringEncoder)
● Value: Image (kafka.serializer.DefaultEncoder)
Key: /inventory/10584/15/5YJSA1DP0DFP1156/6ZBQHFKBVMY7OTBO-251.jpg
Value:

14
Kafka Producer Tuning
Property Value Default Value
request.required.acks 1 0
message.send.max.retries 30 3
retry.backoff.ms 5000 100
client.id HIP “”
For Producer, to sustain NODE failure:
retry.backoff.ms * message.send.max.retries(default:100*3) > Zookeeper Timeout(default:60000)
Failure recovery in
300ms. Really?

Kafka Brokers Tuning
Property Value Default Value
log.retention.bytes 24 GB -1(unlimited)
socket.send.buffer.bytes 10485760 1048576
socket.receive.buffer.bytes 10485760 1048576
1. Data is purged when any of log.retention.bytes OR log.retention.hours exceeds.
2. log.retention.bytes = diskspace/number_of_partitions on each node

161616
OpenCV
● Used Java bindings of OpenCV to avoid using Hadoop
Streaming
● Java api is quite straight forward to encode, decode, crop
and resize.
Memory Leak:
Mat.release() has to be used to free up memory used by Mat.

17
Performance
0
50
100
150
200
250
300
350
400
3 6 9 12 15 18
H
o
u
r
s
Images(Millions)
HIP
ImageProcessor1.0
HIP scales
Linearly and
at least 10x
faster

18
Cascading Downloads
0
2
4
6
8
10
12
14
3 6 9 12 15 18
H
o
u
r
s
Images(Millions)
HIP with Cascading
HIP without Cascading
20%
performance
gain

A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics

1. Hadoop ImageProcessing Pipeline(HIP) June 10, 2015 Russell Foltz-Smith Anil Gupta

2. 2 Image Processing Pipeline ● Acquire Images of Vehicle ● Identify updates/deletes to Images ● Generate unique URL for Images ● Crop and Resize Images ● Copy images to Asset Servers ● Dedupe Images

3. 3 Image Processing Pipeline Example HIP

4. 4 Why Hadoop? ● High Scalability ● Store historical data of Images ● Fault tolerance ● Identify updates to images on basis of content of URL

5. 5 HIP Components 1. HBase: Datastore for Images and archiving Images 2. MapReduce: Computation engine for Image Processor 3. Kafka: Publisher/Subscriber for pushing images to Asset Servers 4. OpenCV Java: Image Processing library 5. Avro: Serialization library for storing data on HDFS

6. 6 HBase Data Model Tables: 1. IMAGE: Store current set of Images with some metadata 2. IMAGE_ARCHIVE: Stores historical data of Vehicles and Original Images

7. 7 Column Family Description Versions I • Store all images of vehicle. • Stores an Image in each Column 1 H • Stores metadata of all Images 1 Table: IMAGE RowKey: <Vin_Number> HBase Data Model Read patterns for “I” and “H” are mutually exclusive

8. 8 Column Family Description Versions I Store original images of vehicle. Only 1 column is stored. 10 A Stores fields of Avro Object of Vehicle and Image for analytics 10 Table: IMAGE_ARCHIVE RowKey: <Provider_id><Dealer_Id><vehicle_vin><Image_Index> HBase Data Model

9. 9 HBase Tuning ● Pre-split tables ● Keep Column names short(2-8 letters) ● Region size 8-10 GB ● Asynchronous clients should buffer(autoFlush=false) Put operations ● Disable periodic Major Compaction

10. Pipeline Dataflow Overview 10 InventoryProcessor Output [Mapper] Parse & Validate Records [Reducer] Identify CRUD Operation Kafka HBase Asset Servers

11. CRUD in Reducer 11 Start Is Deleted? Yes Delete Row in HBase No Is Insert? Yes Download Images Generate 6 Sizes of Image No Get HTTP Headers of ImageURL and Compare with Existing NoHeader Mismatch? Do Nothing Yes 1. Write to HBase 2. Write to Kafka

12. Cascading Downloads 12 One JVM Process Yes [ChainReducer] ImageProcessorReducer NoSocket timeout in 500 milliseconds? No 1. Write to HBase 2. Write to Kafka ImageProcessorMapper ImageProcessorRetryMapper Socket timeout in 5 seconds? Mark URL as “Cannot Process”

13. 1313 Kafka Producer ● One message per Image file ● Producer Message Format: ● Key: ImageFileName (kafka.serializer.StringEncoder) ● Value: Image (kafka.serializer.DefaultEncoder) Key: /inventory/10584/15/5YJSA1DP0DFP1156/6ZBQHFKBVMY7OTBO-251.jpg Value:

14. 14 Kafka Producer Tuning Property Value Default Value request.required.acks 1 0 message.send.max.retries 30 3 retry.backoff.ms 5000 100 client.id HIP “” For Producer, to sustain NODE failure: retry.backoff.ms * message.send.max.retries(default:100*3) > Zookeeper Timeout(default:60000) Failure recovery in 300ms. Really?

15. Kafka Brokers Tuning Property Value Default Value log.retention.bytes 24 GB -1(unlimited) socket.send.buffer.bytes 10485760 1048576 socket.receive.buffer.bytes 10485760 1048576 1. Data is purged when any of log.retention.bytes OR log.retention.hours exceeds. 2. log.retention.bytes = diskspace/number_of_partitions on each node

16. 161616 OpenCV ● Used Java bindings of OpenCV to avoid using Hadoop Streaming ● Java api is quite straight forward to encode, decode, crop and resize. Memory Leak: Mat.release() has to be used to free up memory used by Mat.

17. 17 Performance 0 50 100 150 200 250 300 350 400 3 6 9 12 15 18 H o u r s Images(Millions) HIP ImageProcessor1.0 HIP scales Linearly and at least 10x faster

18. 18 Cascading Downloads 0 2 4 6 8 10 12 14 3 6 9 12 15 18 H o u r s Images(Millions) HIP with Cascading HIP without Cascading 20% performance gain

19. 19 FUTURE… Machine Learning!

20. Thanks! Questions? 20

Editor's Notes

#2: To change OPENING SLIDE background image (placing image inside shape):This must be done on the MASTER LAYOUT: “COVER” Go to “Slide Master View”. Right-Click on current background image In pop-up display select "Format Picture“ Below “SHAPE OPTIONS” and under “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” If necessary … Select Crop Tool drop down and select “Fit” (to insure image is not distorted) If necessary … Select Crop Tool again to resize and position image inside shape
#3: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#4: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#5: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#6: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#7: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#8: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#9: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#10: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#11: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#12: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#13: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#14: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#15: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#16: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#17: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#18: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#19: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#20: To change background image on this slide: Right-Click on current background image In pop-up display select "Format Background“ Below “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” To change title of the deck in the footer (lower right): Go to “Slide Master View”. Select to the SLIDE MASTER. (The large slide with “1”) In lower right corner text box select all the current title text and replace with new text. Capitalize each word.
#21: To change SECTION BREAK SLIDE background image (placing image inside shape):This must be done on the MASTER LAYOUT: “SECTION#0?”. There are 5 “SECTION” master layouts with different background images. Go to “Slide Master View”. Right-Click on current background image In pop-up display select "Format Picture“ Below “SHAPE OPTIONS” and under “FILL”, Select “Picture or texture fill“ Below “Insert picture from” select “File” Locate your replacement image where stored on your computer. Click “Insert” If necessary … Select Crop Tool drop down and select “Fit” (to insure image is not distorted) If necessary … Select Crop Tool again to resize and position image inside shape

A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics (20)

More from DataWorks Summit (20)

Recently uploaded (20)

A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics

Editor's Notes