Casual mass parallel data processing in Java

Casual mass parallel data
processing in Java
Alexey Ragozin

Mar 2014

Build Vs. Buy
Build
• No dedicated team to
support infrastructure
• Very specific tasks
• Exclusive use of
infrastructure
• Reasonable scale

Buy
• Product can bought as
service (internal or external)
• Large scale
• Multi tenancy
• You are going to use
advanced features
(e.g. map/reduce)

“Casual” computing
•
•
•
•
•

Small computation farms (< 100 servers)
Team owns both application and grid
Java platform
Reasonably short batches (< 24 hours)
Reasonably small data sets (< 10 TiB)

Simple master slave topology
Master process
Scheduler

Task queue
Ad v
e
Tas rtise
k
Rep

Slave

Slave

ort

Slave

Simple master slave topology
Control plane
 RMI

Queue / scheduler
 Simple in memory queue
 May be more complex than just task queue

Data plane
…

Data plane
Never, ever, try to send data over RMI 
File system
 Avoid network mounts!

In-memory key-value
 Client side sharding works best

Disk database (RDBMS or NoSQL)
 Consider prefetch of data

Direct socket streaming
…

Distributed objects revised
Pit falls of CORBA/RMI
• IDL – functional contract
• IDL – protocol

Separating concerns
• Functional contract – wrapper object
• Protocol – hidden remote interface

Distributed objects revised
Renewed distributed objects paradigm
Strong
• Polymorphism
• Encapsulation
 Network protocol, caching aspects etc

Weak
• Homogenous code base required
• Synchronous network communications

Deployment problem
Brute force

Computation grid software







 Compile and run batch
Behind scene
 Your classes would be collected
 Associated with batch
 Deployed on participating slaves

Build / package
Deploy / SCP
Restart slaves
Start batch
Change code, repeat

Central scheduler topology
Batch controller
Batch controller

Queue server
Add tasks
Consume
reports

Task queue

task
Task
ort
Rep

Pu l l

Slave

Slave

Slave

Flavors of parallel processing
Flow organized tasks
• Input data available before
task starts
• e.g. Map/Reduce

Collaborative tasks
• Tasks communicate
intermediate results to each
other
• e.g. physic simulations

Get back to data plane
Rules of thumb
•
•
•
•

Insert / delete – never update
Write locally (reducing risks)
Read remotely (retry on error)
Store input as is
 File system
 Document / column oriented NoSQL

• Input and temporary data is different
 Choose right store for each

Exploiting file system
Avoid network file systems
• File system concept is not designed to be distributed
• Good network file system cannot not exists
• Use simple remote file access protocols
• SCP (unencrypted data transfer options added by CERN guys)
• HTTP (if you really do not want SCP)

Cheap SAN could be build from open source

Algorithmic optimization
Parallel computing
• N times speed up will increase
your OPEX and CAPEX cost by N*lg(N)

Algorithmic optimization
•
•
•
•

Up front costs only
Orders of magnitude optimization opportunities
Exciting coding
Ecological way of computing 

Streaming algorithms
Finding N most frequent elements
• Min-Count

Estimating number of unique values
• HyperLogLog

Distribution histograms
https://github.com/addthis/stream-lib

https://github.com/rwl/ParallelColt

NanoCloud – drastically simplified
coding for computing clusters

As easy as …

@Test
public void hello_remote_world() {
Cloud cloud = CloudFactory.createSimpleSshCloud();
cloud.node("myserver.acme.com").exec(new Callable<Void>(){
@Override
public Void call() throws Exception {
String localhost = InetAddress.getLocalHost().toString();
System.out.println("Hi! I'm running on " + localhost);
return null;
}
});
}

All you need is …
NanoCloud requirements
 SSHd
 Java (1.6 and above) present
 Works though NAT and firewalls
 Works on Amazon EC2

 Works everywhere where SSH works

Master – slave communications

SSH

Master process
diag

Slave host

(Single TCP)

Agent

multiplexed slave streams

Slave
controller

Slave
controller

std out
std err
std in

RMI
(TCP)

Slave

Slave

Links
NanoCloud
• https://code.google.com/p/gridkit/wiki/NanoCloudTutorial
• Maven Central: org.gridkit.lab:telecontrol-ssh:0.7.23
• http://blog.ragozin.info/2013/01/remote-code-execution-in-java-made.html

ANT task
• https://github.com/gridkit/gridant

Thank you
http://blog.ragozin.info
- my articles
http://code.google.com/p/gridkit
http://github.com/gridkit
- my open source code
http://aragozin.timepad.ru
- community events in Moscow

Alexey Ragozin
alexey.ragozin@gmail.com

Casual mass parallel data processing in Java

More Related Content

What's hot (20)

Similar to Casual mass parallel data processing in Java (20)

More from Altoros (20)

Recently uploaded (20)

Casual mass parallel data processing in Java