Cloud computing and Hadoop introduction

BioCloud

Random large-scale tools that you
can use

Disclaimer

I'm working on computer security research... no biology
background anywhere in my field, not even on computer virus ;)

While working, I stumbled across hadoop for scalable web
spidering purposes.

I'm not a bioinformatician (yet)... but I saw a powerful tool that
could be useful in your research field(s):

"biodatacrunching" ?

Glossary

• Cluster (beowulf)
• Grid
• Cloud

Biology and computer science

• Increasingly resource-hungry applications
o Nowadays, they can be approached by "brute force"
o More data means more "iron" to crunch it
• Local IT team nor budget keep up with this pace
o €€€ spent on new hardware
o €€€ spent on IT personnel
o Isn't it wiser to scale one machine at a time ?
• Developers get angry or frustrated on
o Delays on software installation and config
o Unscheduled downtimes
o Delays as a result of not enough computing power

What is cloud computing ?

In plain english:
http://www.youtube.com/watch?v=XdBd14rjcs0

Infraestructure

• Amazon
o EC2
o S3
o AMI
 Recently added BioInformatic appliances
 Public data sets
• Eukalyptus
o EC2 + AMI server-side open source implementation
o We run it for our internal projects
• Enomalism
• Rightscale & Service Cloud
o Tools/Consultants for the upcoming cloud issues

Application layer
• Tecnologias para paralelizar
aplicaciones

Application layer

• Hadoop
o Open source mapreduce implementation
o Java based, but any language can be used
• Cloudburst-bio
o MapReduce fine tuned implementation for Bio (XXX)

What is hadoop

Quotation from official web page:

"Hadoop is a software platform that lets one easily write and
run applications that process vast amounts of data."

"vast amounts of data (ATGTTAG...)" + "easily" = sounds good

isn't it ? or is it vaporware ?

Why is it used for ?

• Attack problems that imply several GB, TB even PB of data
• The programmer does not care on job management
o The focus is on data transformation, piping (useful work)

• Not intended for realtime processing
• Suitable to offload databases from long batch jobs

What is MapReduce

Joel on software explanation
Useful to crunch *tons* of data parallellized by design

HDFS: Hadoop Distributed FileSystem

Who is using it ?

• Google
o Lots of internal projects (proprietary MapReduce)
 GMail spam machine learning
 Google maps
 ...

• Yahoo
o Internal web graph (powers search engine)
o Pig (sqlish abstraction)
o Sort 1 terabyte of data in 209 seconds

• Facebook
o Users big graph, used for data mining (Hive)

Hadoop has (lots of) new friends

• Nutch
• Mahout
• Hbase
• Hama
• Pig
• ZooKeeper
• Smartfrog
• ...

Next steps ?

Identify resource-hungry applications (batch vs interactive)
Migrate apps to cloud
1) Allocate a certain fixed amount of money
2) Give a try on amazon EC2
3) Optional: Build (local) rocks cluster with Eukaliptus cloud

Test, deploy, automate, automate and automate ... puppet ?

(a few) References

http://www.cloudera.com/hadoop-training-thinking-at-scale
http://www.slideshare.net/tag/hadoop
http://sourceforge.net/projects/cloudburst-bio/
http://hadoop.apache.org/core/
http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html

Cloud computing and Hadoop introduction

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Cloud computing and Hadoop introduction (20)

Recently uploaded (20)

Cloud computing and Hadoop introduction