SlideShare a Scribd company logo
PNUTS: Yahoo!’s Hosted Data Serving Platform B.F. Cooper, R. Ramakrishnan, U.  Srivastava, A. Silberstein,  P. Bohannon, H. Jacobsen, N. Puz, D. Weaver and R. Yerneni Yahoo! Research Seminar Presentation for CSE 708 by  Ruchika Mehresh Department of Computer  Science and Engineering 22 nd  February, 2011
Motivation To design a distributed database for Yahoo!’s web applications, which is scalable, flexible, available. (Colored animation slides borrowed from openly available presentation on PNUTS by Yahoo!)
What does Yahoo! need? Web applications demand Scalability Response time and Geographic scope High Availability and Fault Tolerance Characteristic of Web traffic Simple query needs Manipulate one record at a time. Relaxed Consistency Guarantees Serializable transactions  Vs  Eventual consistency Serializability
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Features Record-level operations Asynchronous operations Novel Consistency model Massively Parallel Geographically distributed Flexible access: Hashed or ordered, indexes, views; flexible schemas. Centrally managed Delivery of data management as hosted service.
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Data and Query Model Data representation Table of records with attributes Additional data types: Blob Flexible Schemas Point Access Vs Range Access Hash tables Vs Ordered tables
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
System Architecture Animation
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Data Storage and Retrieval Storage unit Store tablets Respond to get(), scan() and set() requests. Tablet controller Owns the mapping Routers poll periodically to get mapping updates. Performs load-balancing and recovery
Data Storage and Retrieval Router  Determines which tablet contains the record Determines which storage unit has the tablet Interval mapping- Binary search of B+ tree. Soft state Tablet controller does not become bottleneck.
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Consistency model Per-record timeline consistency All replicas apply updates to the record in same order. Events (insert/update/delete) are for a particular primary key. Reads return a consistent version from this timeline. One of the replicas is the master. All updates forwarded to the master. Replica master chosen based on the locality of majority of write requests. Every Record has sequence number: Generation of record (New insert), Version of record (Update). Related Question Insert Update Delete Update Insert Update v.1.0 v.1.1 v.1.2 v.1.3 v.2.0 v.2.1 v.2.2
Consistency model API Calls Read-any Read-critical (required version) Read-latest Write Test-and-set-write (required version) Animation
Yahoo! Message broker Topic based Publish/subscribe system Used for logging and replication PNUTS + YMB = Sherpa data services platform Data updates considered committed when published to YMB. Updates asynchronously propagated to different regions (post-publishing). Message purged after applied to all replicas. Per-record mastership mechanism.
Yahoo! Message broker Mastership is assigned on a record-by-record basis. All requests directed to master. Different records in same table can be mastered in different clusters. Basis: Write requests locality Record stores its master as metadata. Tablet master for primary key constraints  Multiple values for primary keys. Related Question  (Write locality) Related Question  (Tablet Master)
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Recovery Any committed update is recoverable from a remote replica. Three step recovery Tablet controller requests copy from remote (source) replica. “ Checkpoint message” published to YMB, for in-flight updates. Source tablet is coped to destination region. Support for recovery Synchronized tablet boundaries Tablet splits at the same time (two-phase commit) Backup regions within the same region. Related Question
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Bulk load Upload large blocks into the database. Bulk inserts done in parallel to multiple storage units. Hash table – natural load balancing. Ordered table – Avoid hot spots. Avoiding hot spots in ordered table
Query Processing Scatter-gather engine Receives a multi-record request Splits it into multiple individual requests for single records/tablet scans Initiates requests in parallel. Gather the result and passes to client. Server-side design? Prevent multiple parallel client requests. Server side optimization (group requests to same storage) Range scan -  Continuation object.
Notifications Service to notify external systems of updates to data. Example: popular a keyword search engine index. Clients subscribe to all topics(tablets) for table Client need no knowledge of tablet organization. Creation of new topic (tablet split) - automatic subscription Break subscription of slow notification clients.
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Experimental setup Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload 1200-3600 requests/second 0-50% writes 80% locality
Experiments Inserts  required 75.6 ms per insert in West 1 (tablet master) 131.5 ms per insert into the non-master West 2, and  315.5 ms per insert into the non-master East.
Experiments Zipfian Distribution
Experiments
Bottlenecks Disk seek capacity on storage units Message Brokers Different PNUTS customers are assigned different clusters of storage units and message broker machines. Can share routers and tablet controllers.
PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
Future Work Consistency Referential integrity Bundled update Relaxed consistency Data Storage and Retrieval: Fair sharing of storage units and message brokers Query Processing Query optimization: Maintain statistics Expansion of query language: join/aggregation Batch-query processing Indexes and Materialized views Related Question
Question 1  (Dolphia Nandi) Q.  In section 3.3.2:Consistency via YMB and mastership~~ It is saying that 85% of writes to a record originate from the same data center which obviously justify locating the master at that point. Now my question is remaining 15% (with timeline consistency) must go across the wide area to the master and making it difficult to enforce SLA which is often set to 99%. What is your thought about this? A.  SLA’s define a requirement of 99% availability. The remaining 15% of traffic that needs wide-area communciation increases latency and does not effect availability. Back
Question 2  (Dolphia Nandi) Q.  Can we develop a strong consistent system which will deliver acceptable performance and maintains availability in the face of any single replica failure? Or is it still a fancy dream? A.  Strong consistency means ACID. ACID properties are in trade-off with availability, graceful degradation and performance. See references 1,2 and 17. This tradeoff is fundamental. CAP theorem (2000) outlined by Eric Brewer. It says that a distributed database system can only have at most two of the following three characteristics: Consistency Availability Partition tolerance Also see the NRW notation that describes at a high level how a distributed database will trade off consistency, read performance and write performance http://www.infoq.com/news/2008/03/ebaybase .
Question 3  (Dr. Murat Demirbas) Q. Why would users need table-scans, scatter-gather operations, bulk loading? Are these just best effort operations or is some consistency provided? A. Some social applications may need these functionalities for instance, searching through all the available users to find the ones that reside in Buffalo, NY. Bulk loading may be needed in user database like applications, for uploading statistics of page views At the time this paper was written Technology was there to optimize bulk loading. However, for consistency, the paper says,  “In particular, we make no guarantees as to consistency for multi-record transactions. Our model can provide serializability on a per-record basis. In particular, if an application reads or writes the same record multiple times in the same “transaction,” the application must use record versions to validate its own reads and writes to ensure serializability for the “transaction.” Scatter-gather are best effort operations. They propose bundled updates-like consistency operations as future work. No major literature available since then.
Question 4a  (Dr. Murat Demirbas) Q. How does primary replica recovery and replica recovery work?  A.  In slides. Q. Is PNUTS (or components of PNUTS) available as open source? A. No. (Source:  http://wiki.apache.org/hadoop/Hbase/PNUTS ) Back
Question 4b  (Dr. Murat Demirbas) Q. Can PNUTS work in a partitioned network? How does conflict resolution work when the network is united again? A.  In this mentioned as future work.“Under normal operation, if the master copy of a record fails, our system has protocols to fail over to another replica. However, if there are major outages, e.g., the entire region that had the master copy for a record becomes unreachable, updates cannot continue at another replica without potentially violating record-timeline consistency. We will allow applications to indicate, per-table, whether they want updates to continue in the presence of major outages, potentially branching the record timeline. If so, we will provide automatic conflict resolution and notifications thereof. The application will also be able to choose from several conflict resolution policies: e.g., discarding one branch, or merging updates from branches, etc. Back
Question 5  (Dr. Murat Demirbas) Q.  How do you compare PNUTS algorithm with the Paxos algorithm? PNUTS though uses leadership is not entirely based on Paxos algorithm. It does not need a quorum to agree before deciding to execute/commit a query. The query is first executed and then applied to rest of the replicas. That being said, it does however seem to use it in publishing to the YMB (unless it uses chain replication, or something else).
Question 6  (Fatih) Q.  Is there any comparison with other data storage systems like Cassandra of Facebook, because it seems like latency is quite a bit high for figures (e.g. Figure3). In addition, when I was reading I was surprised with the quality of machines used in the experiment (Table 1). Could it be a reason for high latency? SLA guaratees generally start from 50 ms and extend up to 150 ms across geographically diverse regions. So the first few graphs are within the SLA agreement, unless the number of clients is increased (in which case, a production environment can load balance to provide better SLA). It is an experimental setup, hence the limitations. Here is a link to such comparisons (Cassandra, Hbase, Sherpa and MySQL):  http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf These systems are not similar, for instance, Cassandra is P2P.
Question 7  (Hanifi Güneş) Q.  PNUTS leverages locality of updates. Given the typical applications running over PNUTS, we can say that the system is designed for single-write-multiple-read applications, say, in case of Flickr or even Facebook where only a single user is permissible to write and all others to see. Locality of updates, thus, seems a good cut for this design.  How can we translate what we have learned from PNUTS in to multiple-read-multiple-write applications like a wide-area file system? Any ideas? PNUTS is not strictly single-read-multiple-write, but good conflict resolution is the key if it needs to handle multiple-writes more effectively. PNUTS serves a different function than a wide-area file system. It is designed for simple queries of web traffic.  Also, in case of a primary replica failure or partition, where/how exactly are the new coming requests routed? And how is the data consistency ensured in case of a 'branching' (the term, branching, is used in the same context as mentioned in the paper, pg 4)? Can you throw some light on? A.  Record has soft state metadata. Resolving branching is future work.
Question 8  (Yong Wang) Q.  In section 2.2 about per-record timeline consistency, it seems that consistency is serializability. All updates are forwards to master, suppose there are two replica who update their record at the same time, according to the paper, they would forward update record to master, since their timeline is similar (update at the same time), how the master recognize that scenario? A.  Tablet masters resolve this scenario. Back
Question 9  (Santosh) Sec 3.21: How are changes in data handled while the previous data is yet to be updated to all the systems ? Versioning of updates for timeline consistency. How are hotspots related to user profiles handled? A.  In slides. Back  (Consistency Model) Back  (Bulk load)
Question 10 The author states In section 3.2.1 "While stronger ordering guarantees would simplify this protocol (published messaging), global ordering is too expensive to provide when different brokers are located in geographically separated datacenters." Since this paper was published in 2008, has any work been done to provide for global ordering and to mitigate the disadvantage of being too expensive? Wouldn't simplifying the protocol make it easy for robustness and further advancement? They have done optimizations of Sherpa but have not really changed the basic PNUTS design, or at least have not published.  Global ordering being expensive translates to the fundamental trade-off of consistency vs performance.  Simplifying how? This is the simplest Yahoo! tried to keep it while still meeting the demands of their applications.
Thank you !!
Additional Definitions JSON  (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. InnoDB  is a transactional storage engine for the MySQL open source database. Soft state  expires unless it is refreshed.  A transaction schedule has the  Serializability  property, if its outcome (e.g., the resulting database state, the values of the database's data) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time. Source:  http://en.wikipedia.org Back
Zipfian distribution A distribution of probabilities of occurrence that follows Zipf’s Law. Zipf’s Law: The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others occur rarely Formal Definition:  P n   1/n a , where P n  is the frequency of occurrence of the n th  ranked item and a is close to 1. Source:  http://xw2k.nist.gov Back
Bulk loading support An approach in which a planning phase is invoked before the actual insertions. By creating new partitions and intelligently distributing partitions across machines, the planning phase ensures that the insertion load will be well-balanced. Source: A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In Proc. SIGMOD, 2008. Related Question
Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? Consistency model Time Record inserted Update Update Update Update Update Delete Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update Update
Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Stale version Read
Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version Stale version
Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Stale version Read-critical(required version):
Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version Stale version
Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Stale version Test-and-set-write(required version)
Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Stale version Back Mechanism: per record mastership
What is PNUTS? Parallel database Geographic replication Structured, flexible schema Hosted, managed infrastructure E  75656  C A  42342  E B  42521  W C  66354  W D  12352  E F  15677  E E  75656  C A  42342  E B  42521  W C  66354  W D  12352  E F  15677  E A  42342  E B  42521  W C  66354  W D  12352  E E  75656  C F  15677  E
Storage units Routers Tablet  controller REST API Clients Message Broker Detailed architecture Data-path components
Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB Detailed architecture
Accessing data SU SU SU Get key k 1 2 Get key k 3 Record for key k 4 Record for key k
Bulk read SU SU SU Scatter/ gather server 1 {k 1 , k 2 , … k n } 2 Get k 1 Get k 2 Get k 3
Range queries Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Storage unit 1 Storage unit 2 Storage unit 3 Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 SU1 Strawberry-MAX SU2 Lime-Strawberry SU3 Canteloupe-Lime SU1 MIN-Canteloupe
Updates Write key k Sequence # for key k Sequence # for key k Write key k SUCCESS Write key k Routers Message brokers 1 2 Write key k 7 8 SU SU SU 3 4 5 6
Asynchronous replication Back

More Related Content

PPTX
Pnuts yahoo!’s hosted data serving platform
lammya aa
 
PPT
Virtual memory
rapunzel08
 
PPTX
Practical learnings from running thousands of Flink jobs
Flink Forward
 
PDF
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
Modern Tools for API Testing, Debugging and Monitoring
Neil Mansilla
 
PPTX
Os unit 2
Arnav Chowdhury
 
PPTX
Monitoring patterns for mitigating technical risk
Itai Frenkel
 
Pnuts yahoo!’s hosted data serving platform
lammya aa
 
Virtual memory
rapunzel08
 
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Modern Tools for API Testing, Debugging and Monitoring
Neil Mansilla
 
Os unit 2
Arnav Chowdhury
 
Monitoring patterns for mitigating technical risk
Itai Frenkel
 

What's hot (20)

PPTX
VIRTUAL MEMORY
Kamran Ashraf
 
PPS
Synchronous and-asynchronous-data-transfer
Anuj Modi
 
PPT
Allocation methods continuous method.47
myrajendra
 
PDF
OpenStack DRaaS - Freezer - 101
Trinath Somanchi
 
PPTX
Druid deep dive
Kashif Khan
 
PDF
Apache Spark at Airbnb
Databricks
 
PDF
Lecture 03 - Synchronous and Asynchronous Communication - Concurrency - Fault...
National College of Business Administration & Economics ( NCBA&E)
 
PPT
Chapter 9 - Virtual Memory
Wayne Jones Jnr
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PDF
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
PDF
Introduction to Redis
Dvir Volk
 
PPT
13. Query Processing in DBMS
koolkampus
 
PPTX
Cpu scheduling
Abhijith Reloaded
 
PPT
Open Source Cloud Computing -Eucalyptus
Sameer Naik
 
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
PPTX
Middleware in Distributed System-RPC,RMI
Prajakta Rane
 
PDF
Canteen Automation System
IRJET Journal
 
PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
PDF
Oracle Database – Mission Critical
Markus Michalewicz
 
PPT
Translation Lookaside Buffer & Inverted Page Table
Darshit Metaliya
 
VIRTUAL MEMORY
Kamran Ashraf
 
Synchronous and-asynchronous-data-transfer
Anuj Modi
 
Allocation methods continuous method.47
myrajendra
 
OpenStack DRaaS - Freezer - 101
Trinath Somanchi
 
Druid deep dive
Kashif Khan
 
Apache Spark at Airbnb
Databricks
 
Lecture 03 - Synchronous and Asynchronous Communication - Concurrency - Fault...
National College of Business Administration & Economics ( NCBA&E)
 
Chapter 9 - Virtual Memory
Wayne Jones Jnr
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Introduction to Redis
Dvir Volk
 
13. Query Processing in DBMS
koolkampus
 
Cpu scheduling
Abhijith Reloaded
 
Open Source Cloud Computing -Eucalyptus
Sameer Naik
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Middleware in Distributed System-RPC,RMI
Prajakta Rane
 
Canteen Automation System
IRJET Journal
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
Oracle Database – Mission Critical
Markus Michalewicz
 
Translation Lookaside Buffer & Inverted Page Table
Darshit Metaliya
 
Ad

Similar to Pnuts (20)

PPTX
PNUTS: Yahoo!’s Hosted Data Serving Platform
Tarik Reza Toha
 
PPTX
Different levels of Consistency guarantees.pptx
rajatmallick6
 
PPT
No sql databases
Ashish Kumar Thakur
 
PDF
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
PDF
Exploring the Fundamentals of YugabyteDB - Mydbops
Mydbops
 
PPTX
Dissecting Scalable Database Architectures
hypertable
 
PDF
NoSQL databases
Marin Dimitrov
 
PDF
Everything you always wanted to know about highly available distributed datab...
Codemotion
 
PDF
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
javier ramirez
 
PDF
Design Patterns For Distributed NO-reational databases
lovingprince58
 
PPTX
Spanner (may 19)
Sultan Ahmed
 
PPTX
Megastore by Google
Ankita Kapratwar
 
PDF
Intro to Databases
Sargun Dhillon
 
PPT
No SQL Databases as modern database concepts
debasisdas225831
 
PDF
Consistency, Availability, Partition: Make Your Choice
Andrea Giuliano
 
PDF
System design handwritten notes guidance
Shabista Imam
 
PDF
Highly available distributed databases, how they work, javier ramirez at teowaki
javier ramirez
 
PPT
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Yahoo Developer Network
 
PDF
System Design.pdf
JitendraYadav351971
 
PPT
No SQL Databases.ppt
ssuser8c8fc1
 
PNUTS: Yahoo!’s Hosted Data Serving Platform
Tarik Reza Toha
 
Different levels of Consistency guarantees.pptx
rajatmallick6
 
No sql databases
Ashish Kumar Thakur
 
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
Exploring the Fundamentals of YugabyteDB - Mydbops
Mydbops
 
Dissecting Scalable Database Architectures
hypertable
 
NoSQL databases
Marin Dimitrov
 
Everything you always wanted to know about highly available distributed datab...
Codemotion
 
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
javier ramirez
 
Design Patterns For Distributed NO-reational databases
lovingprince58
 
Spanner (may 19)
Sultan Ahmed
 
Megastore by Google
Ankita Kapratwar
 
Intro to Databases
Sargun Dhillon
 
No SQL Databases as modern database concepts
debasisdas225831
 
Consistency, Availability, Partition: Make Your Choice
Andrea Giuliano
 
System design handwritten notes guidance
Shabista Imam
 
Highly available distributed databases, how they work, javier ramirez at teowaki
javier ramirez
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Yahoo Developer Network
 
System Design.pdf
JitendraYadav351971
 
No SQL Databases.ppt
ssuser8c8fc1
 
Ad

More from Ruchika Mehresh (7)

PPTX
A deception framework for survivability against next generation
Ruchika Mehresh
 
PPT
PNUTS
Ruchika Mehresh
 
PPT
Centrifuge
Ruchika Mehresh
 
PPT
Secure Proactive Recovery- a Hardware Based Mission Assurance Scheme
Ruchika Mehresh
 
PDF
Dissertation Proposal Abstract
Ruchika Mehresh
 
PPT
Proposal defense presentation
Ruchika Mehresh
 
PPT
Pnuts Review
Ruchika Mehresh
 
A deception framework for survivability against next generation
Ruchika Mehresh
 
Centrifuge
Ruchika Mehresh
 
Secure Proactive Recovery- a Hardware Based Mission Assurance Scheme
Ruchika Mehresh
 
Dissertation Proposal Abstract
Ruchika Mehresh
 
Proposal defense presentation
Ruchika Mehresh
 
Pnuts Review
Ruchika Mehresh
 

Recently uploaded (20)

PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Artificial Intelligence (AI)
Mukul
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 

Pnuts

  • 1. PNUTS: Yahoo!’s Hosted Data Serving Platform B.F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver and R. Yerneni Yahoo! Research Seminar Presentation for CSE 708 by Ruchika Mehresh Department of Computer Science and Engineering 22 nd February, 2011
  • 2. Motivation To design a distributed database for Yahoo!’s web applications, which is scalable, flexible, available. (Colored animation slides borrowed from openly available presentation on PNUTS by Yahoo!)
  • 3. What does Yahoo! need? Web applications demand Scalability Response time and Geographic scope High Availability and Fault Tolerance Characteristic of Web traffic Simple query needs Manipulate one record at a time. Relaxed Consistency Guarantees Serializable transactions Vs Eventual consistency Serializability
  • 4. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 5. Features Record-level operations Asynchronous operations Novel Consistency model Massively Parallel Geographically distributed Flexible access: Hashed or ordered, indexes, views; flexible schemas. Centrally managed Delivery of data management as hosted service.
  • 6. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 7. Data and Query Model Data representation Table of records with attributes Additional data types: Blob Flexible Schemas Point Access Vs Range Access Hash tables Vs Ordered tables
  • 8. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 10. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 11. Data Storage and Retrieval Storage unit Store tablets Respond to get(), scan() and set() requests. Tablet controller Owns the mapping Routers poll periodically to get mapping updates. Performs load-balancing and recovery
  • 12. Data Storage and Retrieval Router Determines which tablet contains the record Determines which storage unit has the tablet Interval mapping- Binary search of B+ tree. Soft state Tablet controller does not become bottleneck.
  • 13. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 14. Consistency model Per-record timeline consistency All replicas apply updates to the record in same order. Events (insert/update/delete) are for a particular primary key. Reads return a consistent version from this timeline. One of the replicas is the master. All updates forwarded to the master. Replica master chosen based on the locality of majority of write requests. Every Record has sequence number: Generation of record (New insert), Version of record (Update). Related Question Insert Update Delete Update Insert Update v.1.0 v.1.1 v.1.2 v.1.3 v.2.0 v.2.1 v.2.2
  • 15. Consistency model API Calls Read-any Read-critical (required version) Read-latest Write Test-and-set-write (required version) Animation
  • 16. Yahoo! Message broker Topic based Publish/subscribe system Used for logging and replication PNUTS + YMB = Sherpa data services platform Data updates considered committed when published to YMB. Updates asynchronously propagated to different regions (post-publishing). Message purged after applied to all replicas. Per-record mastership mechanism.
  • 17. Yahoo! Message broker Mastership is assigned on a record-by-record basis. All requests directed to master. Different records in same table can be mastered in different clusters. Basis: Write requests locality Record stores its master as metadata. Tablet master for primary key constraints Multiple values for primary keys. Related Question (Write locality) Related Question (Tablet Master)
  • 18. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 19. Recovery Any committed update is recoverable from a remote replica. Three step recovery Tablet controller requests copy from remote (source) replica. “ Checkpoint message” published to YMB, for in-flight updates. Source tablet is coped to destination region. Support for recovery Synchronized tablet boundaries Tablet splits at the same time (two-phase commit) Backup regions within the same region. Related Question
  • 20. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 21. Bulk load Upload large blocks into the database. Bulk inserts done in parallel to multiple storage units. Hash table – natural load balancing. Ordered table – Avoid hot spots. Avoiding hot spots in ordered table
  • 22. Query Processing Scatter-gather engine Receives a multi-record request Splits it into multiple individual requests for single records/tablet scans Initiates requests in parallel. Gather the result and passes to client. Server-side design? Prevent multiple parallel client requests. Server side optimization (group requests to same storage) Range scan - Continuation object.
  • 23. Notifications Service to notify external systems of updates to data. Example: popular a keyword search engine index. Clients subscribe to all topics(tablets) for table Client need no knowledge of tablet organization. Creation of new topic (tablet split) - automatic subscription Break subscription of slow notification clients.
  • 24. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 25. Experimental setup Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload 1200-3600 requests/second 0-50% writes 80% locality
  • 26. Experiments Inserts required 75.6 ms per insert in West 1 (tablet master) 131.5 ms per insert into the non-master West 2, and 315.5 ms per insert into the non-master East.
  • 29. Bottlenecks Disk seek capacity on storage units Message Brokers Different PNUTS customers are assigned different clusters of storage units and message broker machines. Can share routers and tablet controllers.
  • 30. PNUTS Data Storage and Retrieval Features Data and Query Model System Architecture Consistency (Yahoo! Message Broker) Query Processing Experiments Recovery Structure Future Work
  • 31. Future Work Consistency Referential integrity Bundled update Relaxed consistency Data Storage and Retrieval: Fair sharing of storage units and message brokers Query Processing Query optimization: Maintain statistics Expansion of query language: join/aggregation Batch-query processing Indexes and Materialized views Related Question
  • 32. Question 1 (Dolphia Nandi) Q. In section 3.3.2:Consistency via YMB and mastership~~ It is saying that 85% of writes to a record originate from the same data center which obviously justify locating the master at that point. Now my question is remaining 15% (with timeline consistency) must go across the wide area to the master and making it difficult to enforce SLA which is often set to 99%. What is your thought about this? A. SLA’s define a requirement of 99% availability. The remaining 15% of traffic that needs wide-area communciation increases latency and does not effect availability. Back
  • 33. Question 2 (Dolphia Nandi) Q. Can we develop a strong consistent system which will deliver acceptable performance and maintains availability in the face of any single replica failure? Or is it still a fancy dream? A. Strong consistency means ACID. ACID properties are in trade-off with availability, graceful degradation and performance. See references 1,2 and 17. This tradeoff is fundamental. CAP theorem (2000) outlined by Eric Brewer. It says that a distributed database system can only have at most two of the following three characteristics: Consistency Availability Partition tolerance Also see the NRW notation that describes at a high level how a distributed database will trade off consistency, read performance and write performance http://www.infoq.com/news/2008/03/ebaybase .
  • 34. Question 3 (Dr. Murat Demirbas) Q. Why would users need table-scans, scatter-gather operations, bulk loading? Are these just best effort operations or is some consistency provided? A. Some social applications may need these functionalities for instance, searching through all the available users to find the ones that reside in Buffalo, NY. Bulk loading may be needed in user database like applications, for uploading statistics of page views At the time this paper was written Technology was there to optimize bulk loading. However, for consistency, the paper says, “In particular, we make no guarantees as to consistency for multi-record transactions. Our model can provide serializability on a per-record basis. In particular, if an application reads or writes the same record multiple times in the same “transaction,” the application must use record versions to validate its own reads and writes to ensure serializability for the “transaction.” Scatter-gather are best effort operations. They propose bundled updates-like consistency operations as future work. No major literature available since then.
  • 35. Question 4a (Dr. Murat Demirbas) Q. How does primary replica recovery and replica recovery work? A. In slides. Q. Is PNUTS (or components of PNUTS) available as open source? A. No. (Source: http://wiki.apache.org/hadoop/Hbase/PNUTS ) Back
  • 36. Question 4b (Dr. Murat Demirbas) Q. Can PNUTS work in a partitioned network? How does conflict resolution work when the network is united again? A. In this mentioned as future work.“Under normal operation, if the master copy of a record fails, our system has protocols to fail over to another replica. However, if there are major outages, e.g., the entire region that had the master copy for a record becomes unreachable, updates cannot continue at another replica without potentially violating record-timeline consistency. We will allow applications to indicate, per-table, whether they want updates to continue in the presence of major outages, potentially branching the record timeline. If so, we will provide automatic conflict resolution and notifications thereof. The application will also be able to choose from several conflict resolution policies: e.g., discarding one branch, or merging updates from branches, etc. Back
  • 37. Question 5 (Dr. Murat Demirbas) Q. How do you compare PNUTS algorithm with the Paxos algorithm? PNUTS though uses leadership is not entirely based on Paxos algorithm. It does not need a quorum to agree before deciding to execute/commit a query. The query is first executed and then applied to rest of the replicas. That being said, it does however seem to use it in publishing to the YMB (unless it uses chain replication, or something else).
  • 38. Question 6 (Fatih) Q. Is there any comparison with other data storage systems like Cassandra of Facebook, because it seems like latency is quite a bit high for figures (e.g. Figure3). In addition, when I was reading I was surprised with the quality of machines used in the experiment (Table 1). Could it be a reason for high latency? SLA guaratees generally start from 50 ms and extend up to 150 ms across geographically diverse regions. So the first few graphs are within the SLA agreement, unless the number of clients is increased (in which case, a production environment can load balance to provide better SLA). It is an experimental setup, hence the limitations. Here is a link to such comparisons (Cassandra, Hbase, Sherpa and MySQL): http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf These systems are not similar, for instance, Cassandra is P2P.
  • 39. Question 7 (Hanifi Güneş) Q. PNUTS leverages locality of updates. Given the typical applications running over PNUTS, we can say that the system is designed for single-write-multiple-read applications, say, in case of Flickr or even Facebook where only a single user is permissible to write and all others to see. Locality of updates, thus, seems a good cut for this design.  How can we translate what we have learned from PNUTS in to multiple-read-multiple-write applications like a wide-area file system? Any ideas? PNUTS is not strictly single-read-multiple-write, but good conflict resolution is the key if it needs to handle multiple-writes more effectively. PNUTS serves a different function than a wide-area file system. It is designed for simple queries of web traffic. Also, in case of a primary replica failure or partition, where/how exactly are the new coming requests routed? And how is the data consistency ensured in case of a 'branching' (the term, branching, is used in the same context as mentioned in the paper, pg 4)? Can you throw some light on? A. Record has soft state metadata. Resolving branching is future work.
  • 40. Question 8 (Yong Wang) Q. In section 2.2 about per-record timeline consistency, it seems that consistency is serializability. All updates are forwards to master, suppose there are two replica who update their record at the same time, according to the paper, they would forward update record to master, since their timeline is similar (update at the same time), how the master recognize that scenario? A. Tablet masters resolve this scenario. Back
  • 41. Question 9 (Santosh) Sec 3.21: How are changes in data handled while the previous data is yet to be updated to all the systems ? Versioning of updates for timeline consistency. How are hotspots related to user profiles handled? A. In slides. Back (Consistency Model) Back (Bulk load)
  • 42. Question 10 The author states In section 3.2.1 "While stronger ordering guarantees would simplify this protocol (published messaging), global ordering is too expensive to provide when different brokers are located in geographically separated datacenters." Since this paper was published in 2008, has any work been done to provide for global ordering and to mitigate the disadvantage of being too expensive? Wouldn't simplifying the protocol make it easy for robustness and further advancement? They have done optimizations of Sherpa but have not really changed the basic PNUTS design, or at least have not published. Global ordering being expensive translates to the fundamental trade-off of consistency vs performance. Simplifying how? This is the simplest Yahoo! tried to keep it while still meeting the demands of their applications.
  • 44. Additional Definitions JSON  (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. InnoDB is a transactional storage engine for the MySQL open source database. Soft state expires unless it is refreshed.  A transaction schedule has the  Serializability  property, if its outcome (e.g., the resulting database state, the values of the database's data) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time. Source: http://en.wikipedia.org Back
  • 45. Zipfian distribution A distribution of probabilities of occurrence that follows Zipf’s Law. Zipf’s Law: The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others occur rarely Formal Definition:  P n   1/n a , where P n  is the frequency of occurrence of the n th  ranked item and a is close to 1. Source: http://xw2k.nist.gov Back
  • 46. Bulk loading support An approach in which a planning phase is invoked before the actual insertions. By creating new partitions and intelligently distributing partitions across machines, the planning phase ensures that the insertion load will be well-balanced. Source: A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In Proc. SIGMOD, 2008. Related Question
  • 47. Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? Consistency model Time Record inserted Update Update Update Update Update Delete Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update Update
  • 48. Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Stale version Read
  • 49. Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version Stale version
  • 50. Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Stale version Read-critical(required version):
  • 51. Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version Stale version
  • 52. Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Stale version Test-and-set-write(required version)
  • 53. Consistency model Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Stale version Back Mechanism: per record mastership
  • 54. What is PNUTS? Parallel database Geographic replication Structured, flexible schema Hosted, managed infrastructure E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E
  • 55. Storage units Routers Tablet controller REST API Clients Message Broker Detailed architecture Data-path components
  • 56. Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB Detailed architecture
  • 57. Accessing data SU SU SU Get key k 1 2 Get key k 3 Record for key k 4 Record for key k
  • 58. Bulk read SU SU SU Scatter/ gather server 1 {k 1 , k 2 , … k n } 2 Get k 1 Get k 2 Get k 3
  • 59. Range queries Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Storage unit 1 Storage unit 2 Storage unit 3 Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 SU1 Strawberry-MAX SU2 Lime-Strawberry SU3 Canteloupe-Lime SU1 MIN-Canteloupe
  • 60. Updates Write key k Sequence # for key k Sequence # for key k Write key k SUCCESS Write key k Routers Message brokers 1 2 Write key k 7 8 SU SU SU 3 4 5 6