SlideShare a Scribd company logo
An	
  Introduc+on	
  to	
  	
  
Data	
  Intensive	
  Compu+ng	
  
	
  
Chapter	
  2:	
  Data	
  Management	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Data	
  Group	
  
	
  
Collin	
  BenneC	
  
Open	
  Data	
  Group	
  
	
  
November	
  14,	
  2011	
  
1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
a.  Databases	
  
b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
b.  MapReduce	
  
c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
(1100-­‐1200)	
  
	
  
What	
  Are	
  the	
  Choices?	
  
Databases	
  	
  
(SqlServer,	
  Oracle,	
  DB2)	
  
File	
  Systems	
  
Distributed	
  File	
  Systems	
  
(Hadoop,	
  Sector)	
  
Clustered	
  
File	
  Systems	
  
(glusterfs,	
  …)	
  
NoSQL	
  Databases	
  
(HBase,	
  Accumulo,	
  
Cassandra,	
  SimpleDB,	
  …)	
  
Applica+ons	
  	
  
(R,	
  SAS,	
  Excel,	
  etc.	
  )	
  
What	
  is	
  the	
  Fundamental	
  Trade	
  Off?	
  
Scale	
  up	
  
Scale	
  out	
  
vs	
   …	
  
2.1	
  	
  Databases	
  
Advice	
  From	
  Jim	
  Gray	
  
1.  Analyzing	
  big	
  data	
  requires	
  
scale-­‐out	
  solu+ons	
  not	
  scale-­‐up	
  
solu+ons	
  (GrayWulf)	
  
2.  Move	
  the	
  analysis	
  to	
  the	
  data.	
  
3.  Work	
  with	
  scien+sts	
  to	
  find	
  the	
  
most	
  common	
  “20	
  queries”	
  and	
  
make	
  them	
  fast.	
  
4.  Go	
  from	
  “working	
  to	
  working.”	
  
PaCern	
  1:	
  Put	
  the	
  metadata	
  in	
  a	
  
database	
  and	
  point	
  to	
  files	
  in	
  a	
  
file	
  system.	
  	
  
Example:	
  Sloan	
  Digital	
  Sky	
  Survey	
  
•  Two	
  surveys	
  in	
  one	
  
– Photometric	
  survey	
  in	
  5	
  bands	
  
– Spectroscopic	
  redshii	
  survey	
  
•  Data	
  is	
  public	
  
– 40	
  TB	
  of	
  raw	
  data	
  
– 5	
  TB	
  processed	
  catalogs	
  
– 2.5	
  Terapixels	
  of	
  images	
  
•  Catalog	
  uses	
  Microsoi	
  SQLServer	
  
•  Started	
  in	
  1992,	
  finished	
  in	
  2008	
  
•  JHU	
  SkyServer	
  serves	
  millions	
  of	
  queries	
  	
  
Example:	
  Bionimbus	
  Genomics	
  Cloud	
  
www.bionimbus.org	
  
Database	
  
Services	
  
Analysis	
  Pipelines	
  
&	
  Re-­‐analysis	
  
Services	
  
GWT-­‐based	
  Front	
  End	
  
Data	
  	
  
Cloud	
  Services	
  
Data	
  
Inges+on	
  
Services	
  
U+lity	
  Cloud	
  
Services	
  
Intercloud	
  
Services	
  
Database	
  
Services	
  
Analysis	
  Pipelines	
  
&	
  Re-­‐analysis	
  
Services	
  
GWT-­‐based	
  Front	
  End	
  
Large	
  Data	
  	
  
Cloud	
  Services	
  
Data	
  
Inges+on	
  
Services	
  
Elas+c	
  Cloud	
  
Services	
  
Intercloud	
  
Services	
  
(Hadoop,	
  
Sector/Sphere)	
  
(Eucalyptus,	
  
OpenStack)	
  
(PostgreSQL)	
  
ID	
  Service	
  
(UDT,	
  
replica+on)	
  
Sec+on	
  2.2	
  
Distributed	
  File	
  Systems	
  
Sector/Sphere	
  
Hadoop’s	
  Large	
  Data	
  Cloud	
  
Storage	
  Services	
  
Compute	
  Services	
  
13
Hadoop’s	
  Stack	
  
Applica+ons	
  
Hadoop	
  Distributed	
  File	
  
System	
  (HDFS)	
  
Hadoop’s	
  MapReduce	
  
Data	
  Services	
   NoSQL	
  Databases	
  
PaCern	
  2:	
  Put	
  the	
  data	
  into	
  a	
  
distributed	
  file	
  system.	
  
Hadoop	
  Design	
  
•  Designed	
  to	
  run	
  over	
  commodity	
  components	
  
that	
  fail.	
  
•  Data	
  replicated,	
  typically	
  three	
  +mes.	
  
•  Block-­‐based	
  storage.	
  
•  Op+mized	
  for	
  efficient	
  scans	
  with	
  high	
  
throughput,	
  not	
  low	
  latency	
  access.	
  
•  Designed	
  for	
  write	
  once,	
  read	
  many.	
  
•  Append	
  opera+on	
  planned	
  for	
  future.	
  
Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  	
  
Architecture	
  
Name	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
Client	
  
control	
  
Data	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
data	
  
Rack	
   Rack	
   Rack	
  
•  HDFS	
  is	
  block-­‐
based.	
  
•  WriCen	
  in	
  Java.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  
Architecture	
  
•  Broadly	
  similar	
  to	
  Google	
  File	
  System	
  and	
  
Hadoop	
  Distributed	
  File	
  System.	
  
•  Uses	
  na+ve	
  file	
  system.	
  	
  It	
  is	
  not	
  block	
  based.	
  
•  Has	
  security	
  server	
  that	
  provides	
  
authoriza+ons.	
  
•  Has	
  mul+ple	
  master	
  name	
  servers	
  so	
  that	
  
there	
  is	
  no	
  single	
  point	
  of	
  failure.	
  
•  Use	
  UDT	
  to	
  support	
  wide	
  area	
  opera+ons.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  	
  
Architecture	
  
Master	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
Client	
  
control	
  
Slave	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
data	
  
Rack	
   Rack	
   Rack	
  
•  HDFS	
  is	
  file-­‐
based.	
  
•  WriCen	
  in	
  C++.	
  
•  Security	
  server.	
  
•  Mul+ple	
  masters.	
  
Security	
  Server	
  
control	
  
Master	
  Node	
  
GlusterFS	
  Architecture	
  
•  No	
  metadata	
  server.	
  
•  No	
  single	
  point	
  of	
  failure.	
  
•  Uses	
  algorithms	
  to	
  determine	
  loca+on	
  of	
  data.	
  
•  Can	
  scale	
  out	
  by	
  adding	
  more	
  bricks.	
  
GlusterFS	
  Architecture	
  
Brick	
  
Brick	
  
Brick	
  
Client	
  
Brick	
  
Brick	
  
Brick	
  
data	
  
Rack	
   Rack	
   Rack	
  
File-­‐based.	
  
GlusterFS	
  Server	
  
Sec+on	
  2.3	
  
NoSQL	
  Databases	
  
21	
  
Evolu+on	
  
•  Standard	
  architecture	
  for	
  simple	
  web	
  
applica+ons:	
  
– Presenta+on:	
  front-­‐end,	
  load	
  balanced	
  web	
  servers	
  
– Business	
  logic	
  layer	
  	
  
– Backend	
  database	
  
•  Database	
  layer	
  does	
  not	
  scale	
  with	
  large	
  
numbers	
  of	
  users	
  or	
  large	
  amounts	
  of	
  data	
  
•  Alterna+ves	
  arose	
  
– Sharded	
  (par++oned)	
  databases	
  or	
  master-­‐slave	
  dbs	
  
– memcache	
  
22	
  
Scaling	
  RDMS	
  
•  Master	
  –	
  slave	
  database	
  systems	
  
– Writes	
  to	
  master	
  
– Reads	
  from	
  slaves	
  
– Can	
  be	
  boClenecks	
  wri+ng	
  to	
  slaves;	
  can	
  be	
  
inconsistent	
  
•  Sharded	
  databases	
  
– Applica+ons	
  and	
  queries	
  must	
  understand	
  sharing	
  
schema	
  
– Both	
  reads	
  and	
  writes	
  scale	
  
– No	
  na+ve,	
  direct	
  support	
  for	
  joins	
  across	
  shards	
  
23	
  
NoSQL	
  Systems	
  
•  Suggests	
  No	
  SQL	
  support,	
  also	
  Not	
  Only	
  SQL	
  
•  One	
  or	
  more	
  of	
  the	
  ACID	
  proper+es	
  not	
  
supported	
  
•  Joins	
  generally	
  not	
  supported	
  
•  Usually	
  flexible	
  schemas	
  
•  Some	
  well	
  known	
  examples:	
  Google’s	
  BigTable,	
  
Amazon’s	
  Dynamo	
  &	
  Facebook’s	
  Cassandra	
  
•  Quite	
  a	
  few	
  recent	
  open	
  source	
  systems	
  
24	
  
PaCern	
  3:	
  Put	
  the	
  data	
  into	
  a	
  
NoSQL	
  applica+on.	
  
26	
  
C	
  
A	
   P	
  
Consistency	
  
Availability	
   Par++on-­‐resiliency	
  
CA:	
  available	
  and	
  
consistent,	
  unless	
  there	
  
is	
  a	
  par++on.	
  
AP:	
  a	
  reachable	
  replica	
  
provides	
  service	
  even	
  in	
  a	
  
par++on,	
  but	
  may	
  be	
  
inconsistent.	
  
CP:	
  always	
  consistent,	
  even	
  in	
  a	
  
par++on,	
  but	
  a	
  reachable	
  replica	
  
may	
  deny	
  service	
  without	
  
quorum.	
  
Dynamo,	
  Cassandra	
  	
  
BigTable,	
  
HBase	
  
CAP	
  –	
  Choose	
  Two	
  
Per	
  Opera+on	
  
CAP	
  Theorem	
  
•  Proposed	
  by	
  Eric	
  Brewer,	
  2000	
  
•  Three	
  proper+es	
  of	
  a	
  system:	
  consistency,	
  
availability	
  and	
  par++ons	
  
•  You	
  can	
  have	
  at	
  most	
  two	
  of	
  these	
  three	
  
proper+es	
  for	
  any	
  shared-­‐data	
  system	
  
•  Scale	
  out	
  requires	
  par++ons	
  
•  Most	
  large	
  web-­‐based	
  systems	
  choose	
  
availability	
  over	
  consistency	
  
28	
  Reference:	
  Brewer,	
  PODC	
  2000;	
  Gilbert/Lynch,	
  SIGACT	
  News	
  2002	
  
Eventual	
  Consistency	
  
•  If	
  no	
  updates	
  occur	
  for	
  a	
  while,	
  all	
  updates	
  
eventually	
  propagate	
  through	
  the	
  system	
  and	
  
all	
  the	
  nodes	
  will	
  be	
  consistent	
  
•  Eventually,	
  a	
  node	
  is	
  either	
  updated	
  or	
  
removed	
  from	
  service.	
  	
  	
  
•  Can	
  be	
  implemented	
  with	
  Gossip	
  protocol	
  
•  Amazon’s	
  Dynamo	
  popularized	
  this	
  approach	
  
•  Some+mes	
  this	
  is	
  called	
  BASE	
  (Basically	
  
Available,	
  Soi	
  state,	
  Eventual	
  consistency),	
  as	
  
opposed	
  to	
  ACID	
   29	
  
Different	
  Types	
  of	
  NoSQL	
  Systems	
  
•  Distributed	
  Key-­‐Value	
  Systems	
  
–  Amazon’s	
  S3	
  Key-­‐Value	
  Store	
  (Dynamo)	
  
–  Voldemort	
  
–  Cassandra	
  
•  Column-­‐based	
  Systems	
  
–  BigTable	
  
–  HBase	
  
–  Cassandra	
  
•  Document-­‐based	
  systems	
  
–  CouchDB	
  
30	
  
Hbase	
  Architecture	
  
HRegionServer	
  
Client	
   Client	
   Client	
   Client	
  Client	
  
HBaseMaster	
  
REST API
Disk	
  
HRegionServer	
  
Java	
  Client	
  
Disk	
  
HRegionServer	
  
Disk	
  
HRegionServer	
  
Disk	
  
HRegionServer	
  
Source:	
  Raghu	
  Ramakrishnan	
  
HRegion	
  Server	
  
•  Records	
  par++oned	
  by	
  column	
  family	
  into	
  HStores	
  
–  Each	
  HStore	
  contains	
  many	
  MapFiles	
  
•  All	
  writes	
  to	
  HStore	
  applied	
  to	
  single	
  memcache	
  
•  Reads	
  consult	
  MapFiles	
  and	
  memcache	
  
•  Memcaches	
  flushed	
  as	
  MapFiles	
  (HDFS	
  files)	
  when	
  full	
  
•  Compac+ons	
  limit	
  number	
  of	
  MapFiles	
  
HRegionServer	
  
HStore	
  
MapFiles	
  
Memcache	
  writes	
  
Flush	
  to	
  disk	
  
reads	
  
Source:	
  Raghu	
  Ramakrishnan	
  
Facebook’s	
  Cassandra	
  
•  Modeled	
  aier	
  BigTable’s	
  data	
  model	
  
•  Modeled	
  aier	
  Dynamo’s	
  eventual	
  consistency	
  
•  Peer	
  to	
  peer	
  storage	
  architecture	
  using	
  
consistent	
  hashing	
  (Chord	
  hashing)	
  
33	
  
Databases	
   NoSQL	
  Systems	
  
Scalability	
   100’s	
  TB	
   100’s	
  PB	
  
Func+onality	
   Full	
  SQL-­‐based	
  queries,	
  
including	
  joins	
  
Op+mized	
  access	
  to	
  
sorted	
  tables	
  (tables	
  with	
  
single	
  keys)	
  
Op+mized	
   Databases	
  op+mized	
  
for	
  safe	
  writes	
  
Clouds	
  op+mized	
  for	
  
efficient	
  reads	
  
Consistency	
  
model	
  
ACID	
  (Atomicity,	
  
Consistency,	
  Isola+on	
  
&	
  Durability)	
  –	
  
database	
  always	
  
consist	
  
Eventual	
  consistency	
  –	
  
updates	
  eventually	
  
propagate	
  through	
  
system	
  
Parallelism	
   Difficult	
  because	
  of	
  
ACID	
  model;	
  shared	
  
nothing	
  is	
  possible	
  
Basic	
  design	
  incorporates	
  
parallelism	
  over	
  
commodity	
  components	
  	
  
Scale	
   Racks	
   Data	
  center	
   34	
  
Sec+on	
  2.3	
  	
  
Case	
  Study:	
  Project	
  Matsu	
  
Zoom	
  Levels	
  /	
  Bounds	
  
Zoom	
  Level	
  1:	
  4	
  images	
   Zoom	
  Level	
  2:	
  16	
  images	
  
Zoom	
  Level	
  3:	
  64	
  images	
   Zoom	
  Level	
  4:	
  256	
  images	
  
Source:	
  Andrew	
  Levine	
  
Mapper	
  Input	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Input	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  resizes	
  and/or	
  cuts	
  up	
  the	
  original	
  
image	
  into	
  pieces	
  to	
  output	
  Bounding	
  Boxes	
  
(minx	
  =	
  -­‐135.0	
  miny	
  =	
  45.0	
  maxx	
  =	
  -­‐112.5	
  maxy	
  =	
  67.5)	
  
Step	
  1:	
  Input	
  to	
  Mapper	
  
Step	
  2:	
  Processing	
  in	
  Mapper	
   Step	
  3:	
  Mapper	
  Output	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Mapper	
  
Source:	
  Andrew	
  Levine	
  
Reducer	
  Key	
  Input:	
  Bounding	
  Box	
  
(minx	
  =	
  -­‐45.0	
  miny	
  =	
  -­‐2.8125	
  maxx	
  =	
  -­‐43.59375	
  maxy	
  =	
  -­‐2.109375)	
  
Reducer	
  Value	
  Input:	
  
Step	
  1:	
  Input	
  to	
  Reducer	
  
…	
  
Step	
  2:	
  Reducer	
  Output	
  
Assemble	
  Images	
  based	
  on	
  bounding	
  box	
  
•  Output	
  to	
  HBase	
  
•  Builds	
  up	
  Layers	
  
for	
  WMS	
  for	
  
various	
  datasets	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Reducer	
  
Source:	
  Andrew	
  Levine	
  
HBase	
  Tables	
  
•  Open	
  Geospa+al	
  Consor+um	
  (OGC)	
  Web	
  
Mapping	
  Service	
  (WMS)	
  Query	
  translates	
  to	
  
HBase	
  scheme	
  
– Layers,	
  Styles,	
  Projec+on,	
  Size	
  
•  Table	
  name:	
  WMS	
  Layer	
  
– Row	
  ID:	
  Bounding	
  Box	
  of	
  image	
  
-­‐Column	
  Family:	
  Style	
  Name	
  and	
  Projec+on	
  
	
  	
  	
  -­‐Column	
  Qualifier:	
  Width	
  x	
  Height	
  
	
  	
  	
  	
  	
  	
  -­‐Value:	
  Buffered	
  Image	
  
Sec+on	
  2.4	
  
Distributed	
  Key-­‐Value	
  Stores	
  
S3	
  
PaCern	
  4:	
  Put	
  the	
  data	
  into	
  a	
  
distributed	
  key-­‐value	
  store.	
  
S3	
  Buckets	
  
•  S3	
  bucket	
  names	
  must	
  be	
  unique	
  across	
  AWS	
  
•  A	
  good	
  prac+ce	
  is	
  to	
  use	
  a	
  paCern	
  like	
  
	
   	
  tutorial.osdc.org/dataset1.txt	
  
for	
  a	
  domain	
  you	
  own.	
  
•  The	
  file	
  is	
  then	
  referenced	
  as	
  
	
  tutorial.osdc.org.s3.	
  amazonaws.com/
dataset1.txt	
  
•  If	
  you	
  own	
  osdc.org	
  you	
  can	
  create	
  a	
  DNS	
  
CNAME	
  entry	
  to	
  access	
  the	
  file	
  as	
  
tutorial.osdc.org/dataset1.txt	
  
S3	
  Keys	
  
•  Keys	
  must	
  be	
  unique	
  within	
  a	
  bucket.	
  
•  Values	
  can	
  be	
  as	
  large	
  as	
  5	
  TB	
  (formerly	
  5	
  GB)	
  
S3	
  Security	
  
•  AWS	
  access	
  key	
  (user	
  name)	
  
•  This	
  func+on	
  as	
  your	
  S3	
  username.	
  It	
  is	
  an	
  
alphanumeric	
  text	
  string	
  that	
  uniquely	
  
iden+fies	
  users.	
  	
  
•  AWS	
  Secret	
  key	
  (func+ons	
  as	
  password)	
  
AWS	
  Account	
  Informa+on	
  
Access	
  Keys	
  
User	
  Name	
   Password	
  
Other	
  Amazon	
  Data	
  Services	
  
•  Amazon	
  Simple	
  Database	
  Service	
  (SDS)	
  
•  Amazon’s	
  Elas+c	
  Block	
  Storage	
  (EBS)	
  
Sec+on	
  2.5	
  
Moving	
  Large	
  Data	
  Sets	
  
The	
  Basic	
  Problem	
  
•  TCP	
  was	
  never	
  designed	
  to	
  move	
  large	
  data	
  
sets	
  over	
  wide	
  area	
  high	
  performance	
  
networks.	
  
•  As	
  a	
  general	
  rule,	
  reading	
  data	
  off	
  disks	
  is	
  
slower	
  than	
  transpor+ng	
  it	
  over	
  the	
  network.	
  	
  	
  
TCP Throughput vs RTT and Packet Loss
0.01%
0.05%
0.1%
0.1%
0.5%
1000
800
600
400
200
1 10 100 200 400
1000
800
600
400
200
Throughput(Mb/s)
Round Trip Time (ms)
LAN US-EU US-ASIAUS
Source:	
  Yunhong	
  Gu,	
  	
  2007,	
  experiments	
  over	
  wide	
  area	
  1G.	
  
The	
  Solu+on	
  
•  Use	
  parallel	
  TCP	
  streams	
  
– GridFTP	
  
•  Use	
  specialized	
  network	
  protocols	
  
– UDT,	
  FAST,	
  etc.	
  
•  Use	
  RAID	
  to	
  stripe	
  data	
  across	
  disks	
  to	
  
improve	
  throughput	
  when	
  reading	
  
•  These	
  techniques	
  are	
  well	
  understood	
  in	
  HEP,	
  
astronomy,	
  but	
  not	
  yet	
  in	
  biology.	
  
Case	
  Study:	
  Bio-­‐mirror	
  
[The	
  open	
  source	
  GridFTP]	
  from	
  the	
  Globus	
  project	
  has	
  
recently	
  been	
  improved	
  to	
  offer	
  UDP-­‐based	
  file	
  transport,	
  
with	
  long-­‐distance	
  speed	
  improvements	
  of	
  3x	
  to	
  10x	
  over	
  
the	
  usual	
  TCP-­‐based	
  file	
  transport.	
  
	
  
-­‐-­‐	
  Don	
  Gilbert,	
  August	
  2010,	
  bio-­‐mirror.net	
  
Moving	
  113GB	
  of	
  Bio-­‐mirror	
  Data	
  
Site	
   RTT	
   TCP	
   UDT	
   TCP/UDT	
   Km	
  
NCSA	
   10	
   139	
   139	
   1	
   200	
  
Purdue	
   17	
   125	
   125	
   1	
   500	
  
ORNL	
   25	
   361	
   120	
   3	
   1,200	
  
TACC	
   37	
   616	
   120	
   55	
   2,000	
  
SDSC	
   65	
   750	
   475	
   1.6	
   3,300	
  
CSTNET	
   274	
   3722	
   304	
   12	
   12,000	
  
GridFTP	
  TCP	
  and	
  UDT	
  transfer	
  +mes	
  for	
  113	
  GB	
  from	
  gridip.bio-­‐mirror.net/biomirror/
blast/	
  (Indiana	
  USA).	
  	
  All	
  TCP	
  and	
  UDT	
  +mes	
  in	
  minutes.	
  	
  Source:	
  	
  hCp://gridip.bio-­‐
mirror.net/biomirror/	
  
Case	
  Study:	
  CGI	
  60	
  Genomes	
  
•  Trace	
  by	
  Complete	
  Genomics	
  showing	
  performance	
  of	
  
moving	
  60	
  complete	
  human	
  genomes	
  from	
  Mountain	
  
View	
  to	
  Chicago	
  using	
  the	
  open	
  source	
  Sector/UDT.	
  
•  Approximately	
  18	
  TB	
  at	
  about	
  0.5	
  Mbs	
  on	
  1G	
  link.	
  
Source:	
  Complete	
  Genomics.	
  	
  	
  
Resource	
  Use	
  
Protocol	
   CPU	
  Usage*	
   Memory*	
  
GridFTP	
  (UDT)	
   1.0%	
  -­‐	
  3.0%	
  	
   40	
  Mb	
  
GridFTP	
  (TCP)	
   0.1%	
  -­‐	
  0.6%	
   6	
  Mb	
  
*CPU	
  and	
  memory	
  usage	
  collected	
  by	
  	
  Don	
  Gilbert.	
  	
  	
  He	
  reports	
  that	
  rsync	
  uses	
  more	
  
CPU	
  than	
  GridFTP	
  with	
  UDT.	
  	
  	
  Source:	
  hCp://gridip.bio-­‐mirror.net/biomirror/.	
  
Sector/Sphere	
  
•  Sector/Sphere	
  is	
  a	
  pla{orm	
  for	
  data	
  intensive	
  
compu+ng	
  built	
  over	
  UDT	
  and	
  designed	
  to	
  
support	
  geographically	
  distributed	
  clusters.	
  	
  
Ques+ons?	
  
For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
rgrossman.com	
  

More Related Content

PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPT
Hadoop
Mallikarjuna G D
 
PDF
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
PPTX
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
PPTX
Introduction to Hadoop
York University
 
PDF
Welcome to Hadoop2Land!
Uwe Printz
 
PDF
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
Introduction to Hadoop
York University
 
Welcome to Hadoop2Land!
Uwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 

What's hot (20)

PPTX
Big data architecture on cloud computing infrastructure
datastack
 
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
PPTX
Hadoop introduction
musrath mohammad
 
PDF
Hadoop Operations - Best practices from the field
Uwe Printz
 
PDF
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PPTX
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
PPTX
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
PPTX
Column Stores and Google BigQuery
Csaba Toth
 
PDF
Apache Spark
Uwe Printz
 
PDF
Hadoop scalability
WANdisco Plc
 
PPTX
Druid Scaling Realtime Analytics
Aaron Brooks
 
PPTX
Scaling etl with hadoop shapira 3
Gwen (Chen) Shapira
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PPTX
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
PPTX
New Directions for Mahout
Ted Dunning
 
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Big data architecture on cloud computing infrastructure
datastack
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Hadoop introduction
musrath mohammad
 
Hadoop Operations - Best practices from the field
Uwe Printz
 
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Introduction To Hadoop Ecosystem
InSemble
 
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
Column Stores and Google BigQuery
Csaba Toth
 
Apache Spark
Uwe Printz
 
Hadoop scalability
WANdisco Plc
 
Druid Scaling Realtime Analytics
Aaron Brooks
 
Scaling etl with hadoop shapira 3
Gwen (Chen) Shapira
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
New Directions for Mahout
Ted Dunning
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Ad

Similar to Managing Big Data: An Introduction to Data Intensive Computing (20)

PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
PPTX
Big Data_Architecture.pptx
betalab
 
PPTX
Master.pptx
KarthikR780430
 
PPTX
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
PDF
Big Data Architecture Workshop - Vahid Amiri
datastack
 
PPTX
Oracle Big Data Cloud service
mandeep kaur Sandhu
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
Nosql databases
Fayez Shayeb
 
PPTX
The Big Data Stack
Zubair Nabi
 
PPTX
Presentation of Apache Cassandra
Nikiforos Botis
 
PPTX
Research on vector spatial data storage scheme based
Anant Kumar
 
PPTX
Hadoop ppt on the basics and architecture
saipriyacoool
 
PPTX
Ecosistema MLOps: Gestión y Despliegue de Modelos
ManuelGarcia99558
 
PPTX
VTU 6th Sem Elective CSE - Module 4 cloud computing
Sachin Gowda
 
PDF
module4-cloudcomputing-180131071200.pdf
SumanthReddy540432
 
PPT
NoSQL_Night
Clarence J M Tauro
 
PPTX
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
Hadoop.pptx
sonukumar379092
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Robert Grossman
 
Big Data_Architecture.pptx
betalab
 
Master.pptx
KarthikR780430
 
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Oracle Big Data Cloud service
mandeep kaur Sandhu
 
Big data and hadoop overvew
Kunal Khanna
 
Nosql databases
Fayez Shayeb
 
The Big Data Stack
Zubair Nabi
 
Presentation of Apache Cassandra
Nikiforos Botis
 
Research on vector spatial data storage scheme based
Anant Kumar
 
Hadoop ppt on the basics and architecture
saipriyacoool
 
Ecosistema MLOps: Gestión y Despliegue de Modelos
ManuelGarcia99558
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
Sachin Gowda
 
module4-cloudcomputing-180131071200.pdf
SumanthReddy540432
 
NoSQL_Night
Clarence J M Tauro
 
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
sonukumar379092
 
Ad

Recently uploaded (20)

PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 

Managing Big Data: An Introduction to Data Intensive Computing

  • 1. An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  2:  Data  Management   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  • 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3. What  Are  the  Choices?   Databases     (SqlServer,  Oracle,  DB2)   File  Systems   Distributed  File  Systems   (Hadoop,  Sector)   Clustered   File  Systems   (glusterfs,  …)   NoSQL  Databases   (HBase,  Accumulo,   Cassandra,  SimpleDB,  …)   Applica+ons     (R,  SAS,  Excel,  etc.  )  
  • 4. What  is  the  Fundamental  Trade  Off?   Scale  up   Scale  out   vs   …  
  • 6. Advice  From  Jim  Gray   1.  Analyzing  big  data  requires   scale-­‐out  solu+ons  not  scale-­‐up   solu+ons  (GrayWulf)   2.  Move  the  analysis  to  the  data.   3.  Work  with  scien+sts  to  find  the   most  common  “20  queries”  and   make  them  fast.   4.  Go  from  “working  to  working.”  
  • 7. PaCern  1:  Put  the  metadata  in  a   database  and  point  to  files  in  a   file  system.    
  • 8. Example:  Sloan  Digital  Sky  Survey   •  Two  surveys  in  one   – Photometric  survey  in  5  bands   – Spectroscopic  redshii  survey   •  Data  is  public   – 40  TB  of  raw  data   – 5  TB  processed  catalogs   – 2.5  Terapixels  of  images   •  Catalog  uses  Microsoi  SQLServer   •  Started  in  1992,  finished  in  2008   •  JHU  SkyServer  serves  millions  of  queries    
  • 9. Example:  Bionimbus  Genomics  Cloud   www.bionimbus.org  
  • 10. Database   Services   Analysis  Pipelines   &  Re-­‐analysis   Services   GWT-­‐based  Front  End   Data     Cloud  Services   Data   Inges+on   Services   U+lity  Cloud   Services   Intercloud   Services  
  • 11. Database   Services   Analysis  Pipelines   &  Re-­‐analysis   Services   GWT-­‐based  Front  End   Large  Data     Cloud  Services   Data   Inges+on   Services   Elas+c  Cloud   Services   Intercloud   Services   (Hadoop,   Sector/Sphere)   (Eucalyptus,   OpenStack)   (PostgreSQL)   ID  Service   (UDT,   replica+on)  
  • 12. Sec+on  2.2   Distributed  File  Systems   Sector/Sphere  
  • 13. Hadoop’s  Large  Data  Cloud   Storage  Services   Compute  Services   13 Hadoop’s  Stack   Applica+ons   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  MapReduce   Data  Services   NoSQL  Databases  
  • 14. PaCern  2:  Put  the  data  into  a   distributed  file  system.  
  • 15. Hadoop  Design   •  Designed  to  run  over  commodity  components   that  fail.   •  Data  replicated,  typically  three  +mes.   •  Block-­‐based  storage.   •  Op+mized  for  efficient  scans  with  high   throughput,  not  low  latency  access.   •  Designed  for  write  once,  read  many.   •  Append  opera+on  planned  for  future.  
  • 16. Hadoop  Distributed  File  System  (HDFS)     Architecture   Name  Node   Data  Node   Data  Node   Data  Node   Client   control   Data  Node   Data  Node   Data  Node   data   Rack   Rack   Rack   •  HDFS  is  block-­‐ based.   •  WriCen  in  Java.  
  • 17. Sector  Distributed  File  System  (SDFS)   Architecture   •  Broadly  similar  to  Google  File  System  and   Hadoop  Distributed  File  System.   •  Uses  na+ve  file  system.    It  is  not  block  based.   •  Has  security  server  that  provides   authoriza+ons.   •  Has  mul+ple  master  name  servers  so  that   there  is  no  single  point  of  failure.   •  Use  UDT  to  support  wide  area  opera+ons.  
  • 18. Sector  Distributed  File  System  (SDFS)     Architecture   Master  Node   Slave  Node   Slave  Node   Slave  Node   Client   control   Slave  Node   Slave  Node   Slave  Node   data   Rack   Rack   Rack   •  HDFS  is  file-­‐ based.   •  WriCen  in  C++.   •  Security  server.   •  Mul+ple  masters.   Security  Server   control   Master  Node  
  • 19. GlusterFS  Architecture   •  No  metadata  server.   •  No  single  point  of  failure.   •  Uses  algorithms  to  determine  loca+on  of  data.   •  Can  scale  out  by  adding  more  bricks.  
  • 20. GlusterFS  Architecture   Brick   Brick   Brick   Client   Brick   Brick   Brick   data   Rack   Rack   Rack   File-­‐based.   GlusterFS  Server  
  • 21. Sec+on  2.3   NoSQL  Databases   21  
  • 22. Evolu+on   •  Standard  architecture  for  simple  web   applica+ons:   – Presenta+on:  front-­‐end,  load  balanced  web  servers   – Business  logic  layer     – Backend  database   •  Database  layer  does  not  scale  with  large   numbers  of  users  or  large  amounts  of  data   •  Alterna+ves  arose   – Sharded  (par++oned)  databases  or  master-­‐slave  dbs   – memcache   22  
  • 23. Scaling  RDMS   •  Master  –  slave  database  systems   – Writes  to  master   – Reads  from  slaves   – Can  be  boClenecks  wri+ng  to  slaves;  can  be   inconsistent   •  Sharded  databases   – Applica+ons  and  queries  must  understand  sharing   schema   – Both  reads  and  writes  scale   – No  na+ve,  direct  support  for  joins  across  shards   23  
  • 24. NoSQL  Systems   •  Suggests  No  SQL  support,  also  Not  Only  SQL   •  One  or  more  of  the  ACID  proper+es  not   supported   •  Joins  generally  not  supported   •  Usually  flexible  schemas   •  Some  well  known  examples:  Google’s  BigTable,   Amazon’s  Dynamo  &  Facebook’s  Cassandra   •  Quite  a  few  recent  open  source  systems   24  
  • 25. PaCern  3:  Put  the  data  into  a   NoSQL  applica+on.  
  • 26. 26  
  • 27. C   A   P   Consistency   Availability   Par++on-­‐resiliency   CA:  available  and   consistent,  unless  there   is  a  par++on.   AP:  a  reachable  replica   provides  service  even  in  a   par++on,  but  may  be   inconsistent.   CP:  always  consistent,  even  in  a   par++on,  but  a  reachable  replica   may  deny  service  without   quorum.   Dynamo,  Cassandra     BigTable,   HBase   CAP  –  Choose  Two   Per  Opera+on  
  • 28. CAP  Theorem   •  Proposed  by  Eric  Brewer,  2000   •  Three  proper+es  of  a  system:  consistency,   availability  and  par++ons   •  You  can  have  at  most  two  of  these  three   proper+es  for  any  shared-­‐data  system   •  Scale  out  requires  par++ons   •  Most  large  web-­‐based  systems  choose   availability  over  consistency   28  Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002  
  • 29. Eventual  Consistency   •  If  no  updates  occur  for  a  while,  all  updates   eventually  propagate  through  the  system  and   all  the  nodes  will  be  consistent   •  Eventually,  a  node  is  either  updated  or   removed  from  service.       •  Can  be  implemented  with  Gossip  protocol   •  Amazon’s  Dynamo  popularized  this  approach   •  Some+mes  this  is  called  BASE  (Basically   Available,  Soi  state,  Eventual  consistency),  as   opposed  to  ACID   29  
  • 30. Different  Types  of  NoSQL  Systems   •  Distributed  Key-­‐Value  Systems   –  Amazon’s  S3  Key-­‐Value  Store  (Dynamo)   –  Voldemort   –  Cassandra   •  Column-­‐based  Systems   –  BigTable   –  HBase   –  Cassandra   •  Document-­‐based  systems   –  CouchDB   30  
  • 31. Hbase  Architecture   HRegionServer   Client   Client   Client   Client  Client   HBaseMaster   REST API Disk   HRegionServer   Java  Client   Disk   HRegionServer   Disk   HRegionServer   Disk   HRegionServer   Source:  Raghu  Ramakrishnan  
  • 32. HRegion  Server   •  Records  par++oned  by  column  family  into  HStores   –  Each  HStore  contains  many  MapFiles   •  All  writes  to  HStore  applied  to  single  memcache   •  Reads  consult  MapFiles  and  memcache   •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full   •  Compac+ons  limit  number  of  MapFiles   HRegionServer   HStore   MapFiles   Memcache  writes   Flush  to  disk   reads   Source:  Raghu  Ramakrishnan  
  • 33. Facebook’s  Cassandra   •  Modeled  aier  BigTable’s  data  model   •  Modeled  aier  Dynamo’s  eventual  consistency   •  Peer  to  peer  storage  architecture  using   consistent  hashing  (Chord  hashing)   33  
  • 34. Databases   NoSQL  Systems   Scalability   100’s  TB   100’s  PB   Func+onality   Full  SQL-­‐based  queries,   including  joins   Op+mized  access  to   sorted  tables  (tables  with   single  keys)   Op+mized   Databases  op+mized   for  safe  writes   Clouds  op+mized  for   efficient  reads   Consistency   model   ACID  (Atomicity,   Consistency,  Isola+on   &  Durability)  –   database  always   consist   Eventual  consistency  –   updates  eventually   propagate  through   system   Parallelism   Difficult  because  of   ACID  model;  shared   nothing  is  possible   Basic  design  incorporates   parallelism  over   commodity  components     Scale   Racks   Data  center   34  
  • 35. Sec+on  2.3     Case  Study:  Project  Matsu  
  • 36. Zoom  Levels  /  Bounds   Zoom  Level  1:  4  images   Zoom  Level  2:  16  images   Zoom  Level  3:  64  images   Zoom  Level  4:  256  images   Source:  Andrew  Levine  
  • 37. Mapper  Input  Key:  Bounding  Box   Mapper  Input  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  resizes  and/or  cuts  up  the  original   image  into  pieces  to  output  Bounding  Boxes   (minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)   Step  1:  Input  to  Mapper   Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Build  Tile  Cache  in  the  Cloud  -­‐  Mapper   Source:  Andrew  Levine  
  • 38. Reducer  Key  Input:  Bounding  Box   (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)   Reducer  Value  Input:   Step  1:  Input  to  Reducer   …   Step  2:  Reducer  Output   Assemble  Images  based  on  bounding  box   •  Output  to  HBase   •  Builds  up  Layers   for  WMS  for   various  datasets   Build  Tile  Cache  in  the  Cloud  -­‐  Reducer   Source:  Andrew  Levine  
  • 39. HBase  Tables   •  Open  Geospa+al  Consor+um  (OGC)  Web   Mapping  Service  (WMS)  Query  translates  to   HBase  scheme   – Layers,  Styles,  Projec+on,  Size   •  Table  name:  WMS  Layer   – Row  ID:  Bounding  Box  of  image   -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  
  • 40. Sec+on  2.4   Distributed  Key-­‐Value  Stores   S3  
  • 41. PaCern  4:  Put  the  data  into  a   distributed  key-­‐value  store.  
  • 42. S3  Buckets   •  S3  bucket  names  must  be  unique  across  AWS   •  A  good  prac+ce  is  to  use  a  paCern  like      tutorial.osdc.org/dataset1.txt   for  a  domain  you  own.   •  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/ dataset1.txt   •  If  you  own  osdc.org  you  can  create  a  DNS   CNAME  entry  to  access  the  file  as   tutorial.osdc.org/dataset1.txt  
  • 43. S3  Keys   •  Keys  must  be  unique  within  a  bucket.   •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  
  • 44. S3  Security   •  AWS  access  key  (user  name)   •  This  func+on  as  your  S3  username.  It  is  an   alphanumeric  text  string  that  uniquely   iden+fies  users.     •  AWS  Secret  key  (func+ons  as  password)  
  • 46. Access  Keys   User  Name   Password  
  • 47. Other  Amazon  Data  Services   •  Amazon  Simple  Database  Service  (SDS)   •  Amazon’s  Elas+c  Block  Storage  (EBS)  
  • 48. Sec+on  2.5   Moving  Large  Data  Sets  
  • 49. The  Basic  Problem   •  TCP  was  never  designed  to  move  large  data   sets  over  wide  area  high  performance   networks.   •  As  a  general  rule,  reading  data  off  disks  is   slower  than  transpor+ng  it  over  the  network.      
  • 50. TCP Throughput vs RTT and Packet Loss 0.01% 0.05% 0.1% 0.1% 0.5% 1000 800 600 400 200 1 10 100 200 400 1000 800 600 400 200 Throughput(Mb/s) Round Trip Time (ms) LAN US-EU US-ASIAUS Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  
  • 51. The  Solu+on   •  Use  parallel  TCP  streams   – GridFTP   •  Use  specialized  network  protocols   – UDT,  FAST,  etc.   •  Use  RAID  to  stripe  data  across  disks  to   improve  throughput  when  reading   •  These  techniques  are  well  understood  in  HEP,   astronomy,  but  not  yet  in  biology.  
  • 52. Case  Study:  Bio-­‐mirror   [The  open  source  GridFTP]  from  the  Globus  project  has   recently  been  improved  to  offer  UDP-­‐based  file  transport,   with  long-­‐distance  speed  improvements  of  3x  to  10x  over   the  usual  TCP-­‐based  file  transport.     -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐mirror.net  
  • 53. Moving  113GB  of  Bio-­‐mirror  Data   Site   RTT   TCP   UDT   TCP/UDT   Km   NCSA   10   139   139   1   200   Purdue   17   125   125   1   500   ORNL   25   361   120   3   1,200   TACC   37   616   120   55   2,000   SDSC   65   750   475   1.6   3,300   CSTNET   274   3722   304   12   12,000   GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from  gridip.bio-­‐mirror.net/biomirror/ blast/  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://gridip.bio-­‐ mirror.net/biomirror/  
  • 54. Case  Study:  CGI  60  Genomes   •  Trace  by  Complete  Genomics  showing  performance  of   moving  60  complete  human  genomes  from  Mountain   View  to  Chicago  using  the  open  source  Sector/UDT.   •  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.   Source:  Complete  Genomics.      
  • 55. Resource  Use   Protocol   CPU  Usage*   Memory*   GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb   GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb   *CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more   CPU  than  GridFTP  with  UDT.      Source:  hCp://gridip.bio-­‐mirror.net/biomirror/.  
  • 56. Sector/Sphere   •  Sector/Sphere  is  a  pla{orm  for  data  intensive   compu+ng  built  over  UDT  and  designed  to   support  geographically  distributed  clusters.    
  • 57. Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com