SlideShare a Scribd company logo
Big Data Loading:
Project Voldemort
Big Data Loading
●   So you've processed your data...
●   Now, how to get that to people quickly?

●   Project Voldemort's Read-Only stores
    ●   Simple key-value store
    ●   Based upon Amazon Dynamo
    ●   Simple Java interface and operation
    ●   Immutable read only stores
Read Only Stores
●   Precompute in Hadoop or else where
●   Creates an indexed key-value store
    ●   One reducer (or file) per node
    ●   Replicated data for fail over


●   Atomically loads into nodes
    ●   Copy from hdfs or other http source
    ●   Very fast, limited by network or storage i/o
    ●   Can throttle so not affecting live services
●   Can also roll back to previous versions
Example Hadoop Store Builder
public class JsonStoreBuilder
   extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{

    JSONParser parser = new JSONParser();

    @Override
    public Object makeKey(LongWritable lineNo, Text line) {
       JSONObject json = parser.parse(line.toString());
       return json.get("name");
    }

    @Override
    public Object makeValue(LongWritable lineNo, Text line) {
       return line.toString();
    }
}
Example Hadoop Job
$VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh

  --input hdfs/JsonFile.json
  --output hdfs/StoreOut
  --tmpdir hdfs/temp_dir
  --mapper uk.co.danharvey.hadoop.JsonStoreBuilder
  --jar hadoop-core.jar
  --cluster config/cluster.xml
  --storename example_store
  --storedefinitions config/store.xml
  --chunksize 1073741824
  --replication 1
Pig to Json Index
●   Output JSON from pig
        STORE bag INTO 'data.json' USING JsonStorage();


●   JsonStoreBuilder
    ●   Extends Voldemort StoreBuilder
    ●   Easily index any field


●   Code up here:
    http://github.com/danharvey/pigJsonUtils

More Related Content

What's hot (20)

PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
PPTX
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
PDF
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
DataStax
 
PPTX
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon
 
PPTX
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
PDF
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
Ontico
 
PPTX
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
PPTX
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon
 
PDF
Gcp data engineer
Narendranath Reddy T
 
PPTX
Keynote: The Future of Apache HBase
HBaseCon
 
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
PDF
HBaseConAsia2018 Track1-3: HBase at Xiaomi
Michael Stack
 
PDF
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
PDF
Brian Bulkowski. Aerospike
Volha Banadyseva
 
PPT
7. Key-Value Databases: In Depth
Fabio Fumarola
 
PDF
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
PDF
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 
PPTX
Cassandra implementation for collecting data and presenting data
Chen Robert
 
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
DataStax
 
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon
 
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
Ontico
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon
 
Gcp data engineer
Narendranath Reddy T
 
Keynote: The Future of Apache HBase
HBaseCon
 
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
Michael Stack
 
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
Brian Bulkowski. Aerospike
Volha Banadyseva
 
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 
Cassandra implementation for collecting data and presenting data
Chen Robert
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 

Viewers also liked (20)

PDF
Project Voldemort
Fabiano Da Ventura
 
PDF
thesis-despoina
Despoina Magka
 
PDF
Plagcitation fa2012
Laksamee Putnam
 
PPTX
ISTC 201 - Plagiarism and Proper Citation
Laksamee Putnam
 
KEY
Google Apps and Plagiarism
Jon Corippo
 
PPTX
Google analytics ppt
maddinpiya
 
PDF
5 Fantasy Google Translator
Jing-mei Huang
 
PDF
HBase at Mendeley
Dan Harvey
 
PPTX
How to set up campaign in google adwords by Tanuja Talekar
Tanuja Talekar
 
KEY
Scientific writing pro : Office word & Mendeley (dani r firman)
Dani Firman
 
PPTX
Webmaster tool by Neha Nayak
Neha Nayak
 
PDF
Google Analytics Overview
Anvil Media, Inc.
 
PPTX
Google analytics by Neha Nayak
Neha Nayak
 
PPTX
Top 10 Google Analytics Reports
Sally Falkow
 
PDF
Google Analytics 101 for Business - How to Get Started With Google Analytics
Jeff Sauer
 
PPT
An introduction to Google Analytics
Joris Roebben
 
PDF
Google Analytics 101 | 2015
Insivia
 
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
PDF
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
PDF
Voldemort : Prototype to Production
Vinoth Chandar
 
Project Voldemort
Fabiano Da Ventura
 
thesis-despoina
Despoina Magka
 
Plagcitation fa2012
Laksamee Putnam
 
ISTC 201 - Plagiarism and Proper Citation
Laksamee Putnam
 
Google Apps and Plagiarism
Jon Corippo
 
Google analytics ppt
maddinpiya
 
5 Fantasy Google Translator
Jing-mei Huang
 
HBase at Mendeley
Dan Harvey
 
How to set up campaign in google adwords by Tanuja Talekar
Tanuja Talekar
 
Scientific writing pro : Office word & Mendeley (dani r firman)
Dani Firman
 
Webmaster tool by Neha Nayak
Neha Nayak
 
Google Analytics Overview
Anvil Media, Inc.
 
Google analytics by Neha Nayak
Neha Nayak
 
Top 10 Google Analytics Reports
Sally Falkow
 
Google Analytics 101 for Business - How to Get Started With Google Analytics
Jeff Sauer
 
An introduction to Google Analytics
Joris Roebben
 
Google Analytics 101 | 2015
Insivia
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
Voldemort : Prototype to Production
Vinoth Chandar
 
Ad

Similar to Project Voldemort: Big data loading (20)

PPT
Hadoop by sunitha
Sunitha Satyadas
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PDF
Hadoop and object stores can we do it better
gvernik
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPTX
Big Data Analytics Module-4 as per vtu .pptx
shilpabl1803
 
KEY
Polyglot Persistence & Big Data in the Cloud
Andrei Savu
 
PDF
Big Data Tools MapReduce,Hive and Pig.pdf
Sharmila Chidaravalli
 
PDF
Hadoop and object stores: Can we do it better?
gvernik
 
PPTX
Bw tech hadoop
Mindgrub Technologies
 
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
PDF
Accelerating NoSQL
sunnygleason
 
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Media Gorod
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
DataWorks Summit
 
PPTX
Big data week presentation
Joseph Adler
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PDF
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Hadoop by sunitha
Sunitha Satyadas
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Hadoop and object stores can we do it better
gvernik
 
Hands on Hadoop and pig
Sudar Muthu
 
Big Data Analytics Module-4 as per vtu .pptx
shilpabl1803
 
Polyglot Persistence & Big Data in the Cloud
Andrei Savu
 
Big Data Tools MapReduce,Hive and Pig.pdf
Sharmila Chidaravalli
 
Hadoop and object stores: Can we do it better?
gvernik
 
Bw tech hadoop
Mindgrub Technologies
 
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Accelerating NoSQL
sunnygleason
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Media Gorod
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Scaling Big Data Mining Infrastructure Twitter Experience
DataWorks Summit
 
Big data week presentation
Joseph Adler
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Ad

Recently uploaded (20)

PDF
July Patch Tuesday
Ivanti
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
July Patch Tuesday
Ivanti
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 

Project Voldemort: Big data loading

  • 2. Big Data Loading ● So you've processed your data... ● Now, how to get that to people quickly? ● Project Voldemort's Read-Only stores ● Simple key-value store ● Based upon Amazon Dynamo ● Simple Java interface and operation ● Immutable read only stores
  • 3. Read Only Stores ● Precompute in Hadoop or else where ● Creates an indexed key-value store ● One reducer (or file) per node ● Replicated data for fail over ● Atomically loads into nodes ● Copy from hdfs or other http source ● Very fast, limited by network or storage i/o ● Can throttle so not affecting live services ● Can also roll back to previous versions
  • 4. Example Hadoop Store Builder public class JsonStoreBuilder extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{ JSONParser parser = new JSONParser(); @Override public Object makeKey(LongWritable lineNo, Text line) { JSONObject json = parser.parse(line.toString()); return json.get("name"); } @Override public Object makeValue(LongWritable lineNo, Text line) { return line.toString(); } }
  • 5. Example Hadoop Job $VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh --input hdfs/JsonFile.json --output hdfs/StoreOut --tmpdir hdfs/temp_dir --mapper uk.co.danharvey.hadoop.JsonStoreBuilder --jar hadoop-core.jar --cluster config/cluster.xml --storename example_store --storedefinitions config/store.xml --chunksize 1073741824 --replication 1
  • 6. Pig to Json Index ● Output JSON from pig STORE bag INTO 'data.json' USING JsonStorage(); ● JsonStoreBuilder ● Extends Voldemort StoreBuilder ● Easily index any field ● Code up here: http://github.com/danharvey/pigJsonUtils