MapReduce In The Cloud Infinispan Distributed Task Execution Framework

MapReduce In The Cloud
Infinispan Distributed Task Execution Framework

Manik Surtani
May 3rd 2011, JUDCon - Boston

Background

• Emergence of data beyond human scale
• Outgrows current platforms in scale, structure, processing time
• Abundance of unstructured, machine generated data
• Does not fit into current software paradigms
• Not confined to Twitter, Facebook and Google only
• Need new platforms built for Big Data from ground up

Big Data

• How did Big Data happen?
• Exponential changes in storage, bandwidth and data
creation
• Data exhaust
• Drowning in data and not knowing what to do with it
• Leverage, not discard Big Data
• We are not even aware of data revolution around us

Hard disk cost & capacity history
1000000 1000000

100000 100000

10000
10000
1000
1000
100
100
10

1 10

0.1 1
1980 1985 1990 1995 2000 2005 2010
Cost per GB in US$ Capacity in MB

Source: http://en.wikipedia.org/wiki/History_of_hard_disk_drives

“There was 5 exabytes of
information created between the
dawn of civilization through 2003....

Eric Schmidt, August 4th, 2010

Source: http://www.readwriteweb.com/archives/google_ceo_schmidt_people_arent_ready_for_the_tech.php

“There was 5 exabytes of
information created between the
dawn of civilization through 2003....

... that much information is
now created every two days”

Eric Schmidt, August 4th, 2010

Source: http://www.readwriteweb.com/archives/google_ceo_schmidt_people_arent_ready_for_the_tech.php

Big Data challenges

• Powerful platforms available today, however...
• Need to be genius to build systems on top of them
• Equivalent to programming client/server apps in assembly
language
• Is there a need for a new language?
• Confusion is as big as Big Data
• Simpler yet equally powerful solutions needed

“Infinispan without
MapReduce is like owning a
Ferrari without a
Driving License”
- Vladimir Blagojevic

Infinispan Big Data goals

• Why waste a Ferrari?
• Humble beginnings
• Simplicity without sacrificing power
• Capitalize on existing Infinispan infrastructure
• Reuse of current programming abstractions
• Two frameworks: distributed executors and MapReduce

Infinispan Distributed Execution
Framework

• Leverage familiar ExecutorService, Callable abstractions
• Expand it to distributed, parallel computing paradigm
• Looks like a regular ExecutorService
• Feels like a regular ExecutorService
• The magic that goes on within Infinispan is completely
transparent to users
• Nevertheless, users can experience it :-)

So simple it fits on one slide
public interface DistributedExecutorService extends ExecutorService {

<T, K> Future<T> submit(Callable<T> task, K... input);

<T> List<Future<T>> submitEverywhere(Callable<T> task);

<T, K > List<Future<T>> submitEverywhere(Callable<T> task, K... input);
}

public interface DistributedCallable<K, V, T> extends Callable<T> {

void setEnvironment(Cache<K, V> cache, Set<K> inputKeys);

}

Do not forget Gene Amdahl

Speedup = 1/(p/n)+(1-p)

However, problems that increase the percentage of
parallel time with their size are more scalable than
problems with fixed percentage of parallel time

p = parallel fraction
n = number of processors

Source: https://computing.llnl.gov/tutorials/parallel_comp/

Infinispan MapReduce

• We already have a data grid!
• Leverages Infinispan’s DIST mode
• Cache data is input for MapReduce tasks
• Task components: Mapper, Reducer, Collator
• MapReduceTask cohering them together

MapReduce model

Source: http://labs.google.com/papers/mapreduce.html

Mapper, Reducer, Collator
public interface Mapper<KIn, VIn, KOut, VOut> extends Serializable {

void map(KIn key, VIn value, Collector<KOut, VOut> collector);

}

public interface Reducer<KOut, VOut> extends Serializable {

VOut reduce(KOut reducedKey, Iterator<VOut> iter);

}

public interface Collator<KOut, VOut, R> {

R collate(Map<KOut, VOut> reducedResults);

}

Roadmap

• Improve task execution container
• Failover, execution policies
• Make sure it scales to terabytes and petabytes
• Integration with Hibernate OGM
• Analytics and BI tools
• Do we need data analysis language?

Parting thoughts

• Data revolution is here, today!
• Profound socio-economic impact
• Do not sleep through it
• Infinispan as a platform for Big Data
• Join us in these exciting endeavours

Questions?
http://www.infinispan.org
http://blog.infinispan.org
http://www.twitter.com/infinispan
#infinispan on FreeNode

Rate this talk!
http://spkr8.com/t/7384

MapReduce In The Cloud Infinispan Distributed Task Execution Framework

More Related Content

Similar to MapReduce In The Cloud Infinispan Distributed Task Execution Framework (20)

Recently uploaded (20)

MapReduce In The Cloud Infinispan Distributed Task Execution Framework

Editor's Notes