2010 NoSQL Summer Reading List

  • Post author:
  • Post category:其他


原文地址:

http://www.empiricalreality.com/2010/09/22/2010-nosql-summer-reading-list/

The

NoSQL Summer Reading List

captures many of the relevant papers for understanding the importance and underlying technology of the NoSQL approach to data storage. The list was put together by

Tim Anglade

, and spawned local reading groups all over the world to get together and discuss the papers. Pointers to the local reading groups are listed at

http://nosqlsummer.org/

.

Then,

Krishna Sankar organized the list into topical groups

, and added a few topics as extra reading or as the beginnings of a Winter reading list. A couple of the topics are currently placeholders for adding additional readings.

Personally, I read many of these papers in roughly chronological order over the years.

The above two lists conveniently pointed to download locations for the papers, but didn’t contain formal paper citations, which I wanted. Thus, I collected and provide this information here to save others the effort.

Several files are available here for download:


2010_nosql_summer_reading_list_citations.txt



Contains the list of paper citations in ACM Ref format. The papers are in topical group order.


2010_nosql_summer_reading_list_bibtex.txt



Contains the list of paper citations in bibtex format, again in topical group order.


2010_nosql_summer_reading_list_curl.txt



Contains a curl script that downloads all of the papers to the current directory. The resulting downloaded files are renamed from their source names to all have a uniform name format that I use for my personal library: “author year paper title”; this file name is derived from the citation. (Note: For additional background, I added the responses to Henry Baker’s ACM Forum letter to the reading list here. Accessing this download requires login to your ACM account; I did not find the responses available elsewhere.)


2010_nosql_summer_reading_list_json.txt




2010_nosql_summer_reading_list_json_schema.txt



The first file contains all of the collected data in a json file. The structure of the data should be apparent; the second file contains a presumed accurate schema for the first file (Note: I don’t programmatically use this schema file.). Two fields of note are (a) the seqoriginal field is the sequence in the original list of papers, and (b) the seqbytopic field is the sequence in topical list order. Sorting the data using these fields results in either an original or a topical order list.

Below is the expanded 2010 NoSQL Summer Reading List in topical group order with abstracts, citations, etc. This was generated from the above json data file.

Thanks again to Tim and Krishna.

Enjoy.

* * * * *


1 Core NoSQL





2 CAP Theorem






3 SQL






4 Distributed Storage






5 Distributed Time






6 Algorithmics






7 Internet-Scale Systems






8 Vector Clocks



9 Bloom Filter





10 Schemes for the usage of memory & disk



11 Gossip Protocol





12 Consistent Hashing






13 Failure Detection






A Appendix



* * * * *




1 Core NoSQL



1.1 Amazon’s Dynamo


by

Giuseppe DeCandia, et al

Abstract

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.

This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

Citation

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, Washington, USA, October 14-17, 2007). SOSP ’07. ACM, New York, NY, 205-220. DOI= http://doi.acm.org/10.1145/1294261.1294281

Difficulty: ++


Download Link

.


1.2 Cassandra — A Decentralized Structured Storage System


by

Avinash Lakshman, and Prashant Malik

Abstract

Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra man- ages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data lay- out and format. Cassandra system was designed to run on cheap commodity hardware and handle high write through- put while not sacrificing read efficiency.

Citation

Lakshman, A. and Malik, P. 2010. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (Apr. 2010), 35-40. DOI= http://doi.acm.org/10.1145/1773912.1773922

Difficulty: +


Download Link

.


1.3 Google’s BigTable


by

Fay Chang, et al

Abstract

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Citation

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06-08, 2006). Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 205-218.

Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26, 2 (Jun. 2008), 1-26. DOI= http://doi.acm.org/10.1145/1365815.1365816

Difficulty: +++


Download Link

.


1.4 The Google File System


by


Sanjay Ghemawat

, et al

Abstract

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

Citation

Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA, October 19-22, 2003). SOSP ’03. ACM, New York, NY, 29-43. DOI= http://doi.acm.org/10.1145/945445.945450

Difficulty:


Download Link

.


1.5 Google’s MapReduce


by


Jeffrey Dean

, and

Sanjay Ghemawat

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

Citation

Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation – Volume 6 (San Francisco, CA, December 06-08, 2004). Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 10-10.

Difficulty: ++


Download Link

.


2 CAP Theorem


2.1 Towards Robust Distributed Systems (Brewer’s 2000 PODC Keynote)


by


Eric A. Brewer

Abstract

Current distributed systems, even the ones that work, tend to be very fragile: they are hard to keep up, hard to manage, hard to grow, hard to evolve, and hard to program. In this talk, I look at several issues in an attempt to clean up the way we think about these systems. These issues include the fault model, high availability, graceful degradation, data consistency, evolution, composition, and autonomy.

These are not (yet) provable principles, but merely ways to think about the issues that simplify design in practice. They draw on experience at Berkeley and with giant-scale systems built at Inktomi, including the system that handles 50% of all web searches.

Citation

Brewer, E. A. 2000. Towards robust distributed systems (abstract). In Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing (Portland, Oregon, United States, July 16-19, 2000). PODC ’00. ACM, New York, NY, 7. DOI= http://doi.acm.org/10.1145/343477.343502

Difficulty:


Download Link

.


2.2 BASE: an Acid Alternative


by


Dan Pritchett

Abstract

In partitioned databases, trading some consistency for availability can lead to dramatic improvements in scalability.

Citation

Pritchett, D. 2008. BASE: An Acid Alternative. Queue 6, 3 (May. 2008), 48-55. DOI= http://doi.acm.org/10.1145/1394127.1394128

Difficulty: +


Download Link

.


2.3 The Byzantine Generals Problem


by


Leslie Lamport

, et al

Abstract

Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

Citation

Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst. 4, 3 (Jul. 1982), 382-401. DOI= http://doi.acm.org/10.1145/357172.357176

Difficulty: +


Download Link

.


2.4 The CAP Theorem


by


Seth Gilbert

, and

Nancy Lynch

Abstract

When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.

Citation

Gilbert, S. and Lynch, N. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (Jun. 2002), 51-59. DOI= http://doi.acm.org/10.1145/564585.564601

Difficulty: ++


Download Link

.


2.5 Eventually Consistent


by


Werner Vogels

Abstract

Building reliable distributed systems at a worldwide scale demands trade-offs — between consistency and availability.

At the foundation of Amazon’s cloud computing are infrastructure services such as Amazon’s S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.

Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.

One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when many widespread distributed systems provide an eventual consistency model in the context of data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. In this article I present some of the relevant background that has informed our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this text appeared as a posting on the All Things Distributed weblog and was greatly improved with the help of its readers.

Citation

Vogels, W. 2009. Eventually consistent. Commun. ACM 52, 1 (Jan. 2009), 40-44. DOI= http://doi.acm.org/10.1145/1435417.1435432

Difficulty: +


Download Link

.


2.6 The End of an Architectural Era


by


Michael Stonebraker

, et al

Abstract

In previous papers, some of us predicted the end of “one size fits all” as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.

Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T., to a popular RDBMS on the standard transactional benchmark, TPC-C.

We conclude that the current RDBMS code lines, while attempting to be a “one size fits all” solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of “from scratch” specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs.

Citation

Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., and Helland, P. 2007. The end of an architectural era: (it’s time for a complete rewrite). In Proceedings of the 33rd international Conference on Very Large Data Bases (Vienna, Austria, September 23-27, 2007). Very Large Data Bases. VLDB Endowment, 1150-1160.

Difficulty: +


Download Link

.


2.7 Harvest, Yield, and Scalable Tolerant Systems


by


Armando Fox

, and

Eric Brewer

Abstract

The cost of reconciling consistency and state management with high availability is highly magnified by the unprecedented scale and robustness requirements of today’s Internet applications. We propose two strategies for improving overall availability using simple mechanisms that scale over large applications whose output behavior tolerates graceful degradation. We characterize this degradation in terms of harvest and yield, and map it directly onto engineering mechanisms that enhance availability by improving fault isolation, and in some cases also simplify programming. By collecting examples of related techniques in the literature and illustrating the surprising range of applications that can benefit from these approaches, we hope to motivate a broader research program in this area.

Citation

Fox, A. and Brewer, E. A. 1999. Harvest, Yield, and Scalable Tolerant Systems. In Proceedings of the Seventh Workshop on Hot Topics in Operating Systems (March 28-30, 1999). HOTOS. IEEE Computer Society, Washington, DC, 174.

Difficulty: +


Download Link

.


2.8 Life beyond Distributed Transactions: an Apostate’s Opinion


by


Pat Helland

Abstract

Many decades of work have been invested in the area of distributed transactions including protocols such as 2PC, Paxos, and various approaches to quorum. These protocols provide the application programmer a façade of global serializability. Personally, I have invested a non- trivial portion of my career as a strong advocate for the implementation and use of platforms providing guarantees of global serializability.

My experience over the last decade has led me to liken these platforms to the Maginot Line1. In general, application developers simply do not implement large scalable applications assuming distributed transactions. When they attempt to use distributed transactions, the projects founder because the performance costs and fragility make them impractical. Natural selection kicks in.

Instead, applications are built using different techniques which do not provide the same transactional guarantees but still meet the needs of their businesses.

This paper explores and names some of the practical approaches used in the implementations of large-scale mission-critical applications in a world which rejects distributed transactions. We discuss the management of fine-grained pieces of application data which may be repartitioned over time as the application grows. We also discuss the design patterns used in sending messages between these repartitionable pieces of data.

The reason for starting this discussion is to raise awareness of new patterns for two reasons. First, it is my belief that this awareness can ease the challenges of people hand-crafting very large scalable applications. Second, by observing the patterns, hopefully the industry can work towards the creation of platforms that make it easier to build these very large applications.

Citation

Pat Helland. 2007. Life beyond Distributed Transactions: an Apostate’s Opinion. 3rd Biennial Conference on Innovative DataSystems Research (CIDR) January 7-10 2007, Asilomar, California USA.

Difficulty: +


Download Link

.


3 SQL


3.1 The Transaction Concept: Virtues and Limitations


by


Jim Gray

Abstract

A transaction is a transformation of state which has the properties of atomicity (all or nothing), durability (effects survive failures) and consistency (a correct transformation). The transaction concept is key to the structuring of data management applications. The concept may have applicability to programming systems in general. This paper restates the transaction concepts and attempts to put several implementation approaches in perspective. It then describes some areas which require further study: (1) the integration of the transaction concept with the notion of abstract data type, (2) some techniques to allow transactions to be composed of sub- transactions, and (3) handling transactions which last for extremely long times (days or months).

Citation

Gray, J. 1988. The transaction concept: virtues and limitations. In Readings in Database Systems Morgan Kaufmann Publishers, San Francisco, CA, 140-150.

Difficulty: ++


Download Link

.


3.2 Relational Databases Considered Harmful


by


Henry G. Baker

Abstract

I had great difficulty in controlling my mirth while I read the self-congratulatory article “Database Systems: Achievements and Opportunities” in the October, 1991, issue of the Communications, because its authors consider relational databases to be one of the three major achievements of the past two decades. As a designer of commercial manufacturing applications on IBM mainframes in the late 1960′s and early 1970′s, I can categorically state that relational databases set the commercial data processing industry back at least ten years and wasted many of the billions of dollars that were spent on data processing. With the recent arrival of object-oriented databases, the industry may finally achieve some of the promises which were made 20 years ago about the capabilities of computers to automate and improve organizations.

Citation

Henry G. Baker. 1992. Relational Databases considered harmful (relative to object-oriented databases.) ACM Forum. Comm. of the ACM 35, 4 (April 1992), 16,18.

Difficulty:


Download Link

.


3.2.1 Responses to Baker’s Forum Letter


by

Albert D’Andrea, and others

Abstract

In his “Forum” letter (Apr. 1992, pp. 16, 18) concerning the article, “Database Systems: Achievements and Opportunities” (Oct. 1991, p. 110), Henry Baker rightly asserts the well-known problems of relational database systems. Notwithstanding, Baker’s amusing portrayal of the relational era as the “Dark Ages of commercial data processing” is simply not correct. In fact, future historians may well view relational technology as the primordial soup from which a far superior class of database systems evolved.

[N.B.: Added to the lists. Requires ACM login to access. -BJG]

Citation

Albert D’Andrea and others. 1992. More on Relational Database Systems. ACM forum. Commun. ACM 35, 8 (Aug. 1992), 13. DOI= http://doi.acm.org/10.1145/135226.376116

Difficulty:


Download Link

.


3.3 The 1995 SQL Reunion: People, Projects, and Politics


by


Paul McJones

Abstract

A reunion of people who worked on System R and its derivatives, including SQL/DS, DB2, and R*, was held at Asilomar on May 29, 1995. This is an edited transcript of the day’s discussions, incorporating changes provided by the speakers. It provides an informal but first-hand account of the birth of SQL, the history of System R, and the origins of a number of other relational systems inside and outside IBM.

Recommended paired reading: Henry Baker’s 1991 letter to the ACM.

Citation

Paul McJones (editor). 1997. The 1995 SQL Reunion: People, Project, and Politics. August 20, 1997 (2nd edition).

Difficulty: +


Download Link

.


3.4 Access Path Selection in an RDBMS


by

P. Griffiths Selinger, et al

Abstract

In a high level query and data manipulation language such as SQL, requests are stated non-procedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.

Citation

Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international Conference on Management of Data (Boston, Massachusetts, May 30-June 01, 1979). SIGMOD ’79. ACM, New York, NY, 23-34. DOI= http://doi.acm.org/10.1145/582095.582099

Difficulty: +++


Download Link

.


3.5 Codd’s Relational Model


by


E.F. Codd

Abstract

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.

Citation

Codd, E. F. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (Jun. 1970), 377-387. DOI= http://doi.acm.org/10.1145/362384.362685

Difficulty: ++


Download Link

.


4 Distributed Storage


4.1 Stasis: Flexible Transactional Storage


by


Russell Sears

, and

Eric Brewer

Abstract

An increasing range of applications requires robust support for atomic, durable and concurrent transactions. Databases provide the default solution, but force applications to interact via SQL and to forfeit control over data layout and access mechanisms. We argue there is a gap between DBMSs and file systems that limits designers of data-oriented applications.

Stasis is a storage framework that incorporates ideas from traditional write-ahead logging algorithms and file systems. It provides applications with flexible control over data structures, data layout, robustness, and performance. Stasis enables the development of unforeseen variants on transactional storage by generalizing write-ahead logging algorithms. Our partial implementation of these ideas already provides specialized (and cleaner) semantics to applications.

We evaluate the performance of a traditional transactional storage system based on Stasis, and show that it performs favorably relative to existing systems. We present examples that make use of custom access methods, modified buffer manager semantics, direct log file manipulation, and LSN-free pages. These examples facilitate sophisticated performance optimizations such as zero-copy I/O. These extensions are composable, easy to implement and significantly improve performance.

Citation

Sears, R. and Brewer, E. 2006. Stasis: flexible transactional storage. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06-08, 2006). Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 29-44.

Difficulty: ++


Download Link

.


4.2 A History of the Virtual Synchrony Replication Model


by


Ken Birman

Abstract

A “Cloud Computing” revolution is underway, supported by massive data centers that often contain thousands (if not hundreds of thousands) of servers. In such systems, scalability is the mantra and this, in turn, compels application developers to replicate various forms of information. By replicating the data needed to handle client requests, many services can be spread over a cluster to exploit parallelism. Servers also use replication to implement high availability and fault-tolerance mechanisms, ensure low latency, implement caching, and provide distributed management and control. On the other hand, replication is hard to implement, hence developers typically turn to standard replication solutions, packaged as sharable libraries.

Virtual synchrony, the technology on which this article will focus, was created by the author and his colleagues in the early 1980’s to support these sorts of applications, and was the first widely adopted solution in the area. Viewed purely as a model, virtual synchrony defines rules for replicating data or a service that will behave in a manner indistinguishable from the behavior of some non-replicated reference system running on a single non-faulty node. The model is defined in the standard asynchronous network model for crash failures. This turns out to be ideal for the uses listed above.

Citation

Ken Birman. 2010. History of the Virtual Synchrony Replication Model. Appears in Replication: Theory and Practice. B. Charron-Bost, F. Pedone, A. Schiper (Eds) Springer Verlag, 2010. Replication, LNCS 5959, pp. 91–120, 2010.

Difficulty: +


Download Link

.


5 Distributed Time


5.1 Paxos Made Simple


by


Leslie Lamport

Abstract

The Paxos algorithm, when presented in plain English, is very simple.

Citation

Leslie Lamport. 2001. Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.

Difficulty: ++


Download Link

.


5.2 Paxos Made Practical


by


David Mazières

Abstract

Paxos is a simple protocol that a group of machines in a distributed system can use to agree on a value proposed by a member of the group. If it terminates, the protocol reaches consensus even if the network was unreliable and multiple machines simultaneously tried to propose different values.

Citation

David Mazières. 2007. Paxos Made Practical. Unpublished.

Difficulty:


Download Link

.


5.3 Time, Clocks, and the Ordering of Events in a Distributed System


by


Leslie Lamport

Abstract

The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

Citation

Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (Jul. 1978), 558-565. DOI= http://doi.acm.org/10.1145/359545.359563

Difficulty: ++


Download Link

.


5.4 Timestamps in Message-Passing Systems That Preserve the Partial Ordering


by


Colin J. Fidge

Abstract

Timestamping is a common method of totally ordering events in concurrent programs. However, for applications requiring access to the global state, a total ordering is inappropriate. This paper presents algorithms for timestamping events in both synchronous and asynchronous message-passing programs that allow for access to the partial ordering inherent in a parallel system. The algorithms do not change the communications graph or require a central timestamp issuing authority.

Citation

Fidge, C. J. 1988. Timestamps in message-passing systems that preserve the partial ordering. In K. Raymond, editor, Proceedings of the 11th Australian Computer Science Conference (ACSC’88), pages 56-66, February 3-5 1988.

Difficulty: ++


Download Link

.


5.5 Virtual Time and Global States of Distributed Systems


by


Friedemann Mattern

Abstract

A distributed system can be characterized by the fact that the global state is distributed and that a common time base does not exist. However, the notion of time is an important concept in every day life of our decentralized “real world” and helps to solve problems like getting a consistent population census or determining the potential causality between events. We argue that a linearly ordered structure of time is not (always) adequate for distributed systems and propose a generalized non-standard model of time which consists of vectors of clocks. These clock-vectors are partially ordered and form a lattice. By using timestamps and a simple clock update mechanism the structure of causality is represented in an isomorphic way. The new model of time has a close analogy to Minkowski’s relativistic spacetime and leads among others to an interesting characterization of the global state problem. Finally, we present a new algorithm to compute a consistent global snapshot of a distributed system where messages may be received out of order.

Citation

Friedemann Mattern. 2010. Virtual Time and Global States of Distributed Systems. In: Cosnard M. et al. (Ed.): Proc. Workshop on Parallel and Distributed Algorithms. pp. 215-226, North-Holland / Elsevier, 1989 (Reprinted in: Z. Yang, T.A. Marsland (Eds.), “Global States and Time in Distributed Systems”, IEEE, 1994, pp. 123-133.)

Difficulty: ++


Download Link

.


5.6 Google’s Chubby


by


Mike Burrows

Abstract

We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.

Citation

Burrows, M. 2006. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06-08, 2006). Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 335-350.

Difficulty: ++


Download Link

.


6 Algorithmics


6.1 CRDTs: Consistency without concurrency control


by

Mihai Letia, Nuno Preguiça, and Marc Shapiro

Abstract

A CRDT is a data type whose operations commute when they are concurrent. Replicas of a CRDT eventually converge without any complex concurrency control. As an existence proof, we exhibit a non-trivial CRDT: a shared edit buffer called Treedoc. We outline the design, implementation and performance of Treedoc. We discuss how the CRDT concept can be generalized, and its limitations.

Citation

Mihai Letia and Nuno M. Preguica and Marc Shapiro. 2009. CRDTs: Consistency without concurrency control. CoRR abs/0907.0929 2009. http://arxiv.org/abs/0907.0929

Difficulty: ++


Download Link

.


6.2 The Graph Traversal Pattern


by


Marko A. Rodriguez

, and

Peter Neubauer

Abstract

To many onlookers, it may seem that the NoSQL-hype is solely focused on scaling data. Many NoSQL databases are designed such that they can horizontally-scale with relatively ease. This is accomplished by making use of data structures that are optimized for sharding. Such data have limited to no direct references between each other. Therefore, the problem of referential integrity does not exist and data can be massively parallelized. Examples of such systems include Amazon’s Dynamo, Google’s Big Table, Apache’s CouchDB, and so on.

In stark contrast to this design choice, on the other side of the NoSQL spectrum, there exists another design choice—the highly interconnected, direct referent data structure of the graph database. Graph databases allow users to solve problems by moving through their data in intelligent/directed ways and with an arbitrary depth. This style of data processing is known as the graph traversal pattern. This pattern is difficult to efficiently achieve with systems that only allow for the joining of data through the use of global indices. The graph traversal pattern is all about local, index-free traversals.

Citation

Marko A. Rodriguez and Peter Neubauer. 2010. The Graph Traversal Pattern. AT&T and NeoTechnology Technical Report, April 2010. CoRR abs/1004.1001 2010. http://arxiv.org/abs/1004.1001

Difficulty: +++


Download Link

.


6.3 The Log-Structured Merge-Tree (LSM-Tree)


by


Patrick O’Neil

, et al

Abstract

High-performance transaction system applications typically insert rows in a History table to provide an activity trace; at the same time the transaction system generates log records for purposes of system recovery. Both types of generated information can benefit from efficient indexing. An example in a well-known setting is the TPC-A benchmark application, modified to support efficient queries on the History for account activity for specific accounts. This requires an index by account-id on the fast-growing History table. Unfortunately, standard disk-based index structures such as the B-tree will effectively double the I/O cost of the transaction to maintain an index such as this in real time, increasing the total system cost up to fifty percent. Clearly a method for maintaining a real-time index at low cost is desirable. The Log-Structured Merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components. The algorithm has greatly reduced disk arm movements compared to a traditional access methods such as B-trees, and will improve cost- performance in domains where disk arm costs for inserts with traditional access methods overwhelm storage media costs. The LSM-tree approach also generalizes to operations other than insert and delete. However, indexed finds requiring immediate response will lose I/O efficiency in some cases, so the LSM-tree is most useful in applications where index inserts are more common than finds that retrieve the entries. This seems to be a common property for History tables and log files, for example. The conclusions of Section 6 compare the hybrid use of memory and disk components in the LSM-tree access method with the commonly understood advantage of the hybrid method to buffer disk pages in memory.

Citation

O’Neil, P., Cheng, E., Gawlick, D., and O’Neil, E. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (Jun. 1996), 351-385. DOI= http://dx.doi.org/10.1007/s002360050048

Difficulty: +++


Download Link

.


7 Internet-Scale Systems


7.1 On Designing and Deploying Internet-Scale Services


by


James Hamilton

Abstract

The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1. Within Microsoft services, Autopilot is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While auto-administration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations- friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.

Citation

Hamilton, J. 2007. On designing and deploying internet-scale services. In Proceedings of the 21st Conference on Large installation System Administration Conference (Dallas, November 11-16, 2007). P. Anderson, Ed. USENIX Association, Berkeley, CA, 1-12.

Difficulty:


Download Link

.


7.2 The Process Group Approach to Reliable Distributed Computing


by


Kenneth P. Birman

Abstract

One might expect the reliability of a distributed system to correspond directly to the reliability of its constituents, but this is not always the case. The mechanisms used to structure a distributed system and to implement cooperation between components play a vital role in determining the reliability of the system. Many contemporary distributed operating systems have placed emphasis on communication performance, overlooking the need for tools to integrate components into a reliable whole. The communication primitives supported give generally reliable behavior, but exhibit problematic semantics when transient failures or system configuration changes occur. The resulting building blocks are, therefore, unsuitable for facilitating the construction of systems where reliability is important.

This article reviews 10 years of research on ISIS, a system that pro- vides tools to support the construction of reliable distributed software. The thesis underlying ISIS is that development of reliable distributed software can be simplified using process groups and group programming tools. This article describes the approach taken, surveys the system, and discusses experiences with real applications.

Citation

Birman, K. P. 1993. The process group approach to reliable distributed computing. Commun. ACM 36, 12 (Dec. 1993), 37-53. DOI= http://doi.acm.org/10.1145/163298.163303

Difficulty: +


Download Link

.


8 Vector Clocks


9 Bloom Filter


9.1 Space/time trade-offs in hash coding with allowable errors


by

Burton H. Bloom

Abstract

In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hashcoding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.

Citation

Bloom, B. H. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (Jul. 1970), 422-426. DOI= http://doi.acm.org/10.1145/362686.362692

Difficulty:


Download Link

.


9.2 Scalable Bloom Filters


by


Paulo Sérgio Almeida

Abstract

Bloom Filters provide space-efficient storage of sets at the cost of a probability of false positives on membership queries. The size of the filter must be defined a priori based on the number of elements to store and the desired false positive probability, being impossible to store extra elements without increasing the false positive probability. This leads typically to a conservative assumption regarding maximum set size, possibly by orders of magnitude, and a consequent space waste. This paper proposes Scalable Bloom Filters, a variant of Bloom Filters that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability.

Citation

Almeida, P. S., Baquero, C., Preguiça, N., and Hutchison, D. 2007. Scalable Bloom Filters. Inf. Process. Lett. 101, 6 (Mar. 2007), 255-261. DOI= http://dx.doi.org/10.1016/j.ipl.2006.10.007

Difficulty:


Download Link

.


9.4 Combinatorial Generation


by


Adam Kirsch

, and

Michael Mitzenmacher

Abstract

A standard technique from the hashing literature is to use two hash functions h1(x) and h2(x) to simulate additional hash functions of the form gi(x) = h1(x) + ih2(x). We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically, only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability. This leads to less computation and potentially less need for randomness in practice.

Citation

Kirsh, A., and Mitzenmacher, M. 2006. Less Hashing, Same Performance: Building a Better Bloom Filter. In Proceedings of the 14th Annual European Symposium on Algorithms (ESA), pp. 456-467.

Difficulty:


Download Link

.


9.5 Cache Efficient Bloom Filter


by

Felix Putze, Peter Sanders,, and Johannes Singler

Abstract

A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives.

We propose several new variants of Bloom filters and replacements with similar functionality. All of them have a better cache-efficiency and need less hash bits than regular Bloom filters. Some use SIMD functionality, while the others provide an even better space efficiency. As a consequence, we get a more flexible trade-off between false positive rate, space-efficiency, cache-efficiency, hash-efficiency, and computational effort. We analyze the efficiency of Bloom filters and the proposed replacements in detail, in terms of the false positive rate, the number of expected cache-misses, and the number of required hash bits. We also describe and experimentally evaluate the performance of highly-tuned implementations. For many settings, our alternatives perform better than the methods proposed so far.

Citation

Putze, F., Sanders, P., and Singler, J. 2009. Cache-, hash-, and space-efficient bloom filters. J. Exp. Algorithmics 14 (Dec. 2009), 4.4-4.18. DOI= http://doi.acm.org/10.1145/1498698.1594230

Difficulty:


Download Link

.


10 Schemes for the usage of memory & disk


11 Gossip Protocol


11.1 Efficient Reconciliation And Flow Control For Anti-Entropy Protocols


by

Robbert van Renesse, Dan Dumitriu, Valient Gough,, and Chris Thomas

Abstract

The paper shows that anti-entropy protocols can process only a limited rate of updates, and proposes and evaluates a new state reconciliation mechanism as well as a flow control scheme for anti-entropy protocols.

Citation

van Renesse, R., Dumitriu, D., Gough, V., and Thomas, C. 2008. Efficient reconciliation and flow control for anti-entropy protocols. In Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware (Yorktown Heights, New York, September 15-17, 2008). LADIS ’08, vol. 341. ACM, New York, NY, 1-7. DOI= http://doi.acm.org/10.1145/1529974.1529983

Difficulty:


Download Link

.


11.2 A Weakly Coupled Adaptive Gossip Protocol for Application Level Active Networks


by

Ibiso Wokoma, Ioannis Liabotis, Ognjen Prnjat, Lionel Sacks, and Ian Marshall

Abstract

With the sharp increase in heterogeneity and distribution of elements in wide-area networks, more flexible, efficient and autonomous approaches for management and information distribution are needed. This paper proposes a novel approach, based on gossip protocols and firefly synchronization theory, for the management policy distribution and synchronization over a number of nodes in an Application Level Active Network (ALAN). The work is presented in the context of the IST project ANDROID (Active Network Distributed Open Infrastructure Development), which is developing an autonomous policy-based management system for ALAN. The preliminary simulation results suggest that with the appropriately optimized parameters, the algorithms developed are scalable, can work effectively in a realistic random network, and allow the policy updates to be distributed efficiently throughout the active network with a lower latency than other similar types of gossip protocols.

Citation

Wokoma, I., Liabotis, I., Prnjat, O., Sacks, L., and Marshall, I. 2002. A Weakly Coupled Adaptive Gossip Protocol for Application Level Active Networks. In Proceedings of the 3rd international Workshop on Policies For Distributed Systems and Networks (Policy’02) (June 05-07, 2002). POLICY. IEEE Computer Society, Washington, DC, 244.

Difficulty:


Download Link

.


12 Consistent Hashing


12.1 Consistent Hashing And Random Trees


by

David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and Rina Panigrahy

Abstract

We describe a family of caching protocols for distributed networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/IP, and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows.

Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems.

Citation

Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D. 1997. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May 04-06, 1997). STOC ’97. ACM, New York, NY, 654-663. DOI= http://doi.acm.org/10.1145/258533.258660

Difficulty:


Download Link

.


13 Failure Detection


13.1 The φ Accrual Failure Detector


by

Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama

Abstract

The detection of failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously.

We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a binary nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment.

In this paper, we describe an implementation of such an accrual failure detector, that we call the failure detector. The particularity of the failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our failure detector over an intercontinental communication link over a week. Our experimental results show that performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility.

Citation

Hayashibara, N., Defago, X., Yared, R., and Katayama, T. 2004. The φ Accrual Failure Detector. In Proceedings of the 23rd IEEE international Symposium on Reliable Distributed Systems (October 18-20, 2004). SRDS. IEEE Computer Society, Washington, DC, 66-78. DOI=http://doi.ieeecomputersociety.org/10.1109/RELDIS.2004.1353004 http://doi.ieeecomputersociety.org/10.1109/RELDIS.2004.1353004

Difficulty:


Download Link

.


13.2 Unreliable Failure Detectors For Reliable Distributed Systems


by

Tushar D. Chandra, and Sam Toueg

Abstract

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterize unreliable failure detectors in terms of two properties completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].

Citation

Chandra, T. D. and Toueg, S. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (Mar. 1996), 225-267. DOI= http://doi.acm.org/10.1145/226643.226647

Chandra, T. D. and Toueg, S. 1991. Unreliable failure detectors for asynchronous systems (preliminary version). In Proceedings of the Tenth Annual ACM Symposium on Principles of Distributed Computing (Montreal, Quebec, Canada, August 19-21, 1991). PODC ’91. ACM, New York, NY, 325-340. DOI= http://doi.acm.org/10.1145/112600.112627

Difficulty:


Download Link

.


13.3 The Weakest Failure Detector for Solving Consensus


by

Tushar D. Chandra, Vassos Hadzilacos, and Sam Toueg

Abstract

We determine what information about failures is necessary and sufficient to solve Consensus in asynchronous distributed systems subject to crash failures. In Chandra and Toueg [1996], it is shown that W, a failure detector that provides surprisingly little information about which processes have crashed, is sufficient to solve Consensus in asynchronous systems with a majority of correct processes. In this paper, we prove that to solve Consensus, any failure detector has to provide at least as much information as W. Thus, W is indeed the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes.

Citation

Chandra, T. D., Hadzilacos, V., and Toueg, S. 1996. The weakest failure detector for solving consensus. J. ACM 43, 4 (Jul. 1996), 685-722. DOI= http://doi.acm.org/10.1145/234533.234549

Chandra, T. D., Hadzilacos, V., and Toueg, S. 1992. The weakest failure detector for solving consensus. In Proceedings of the Eleventh Annual ACM Symposium on Principles of Distributed Computing (Vancouver, British Columbia, Canada, August 10-12, 1992). PODC ’92. ACM, New York, NY, 147-158. DOI= http://doi.acm.org/10.1145/135419.135451

Difficulty:


Download Link

.


13.4 Optimal Implementation of the Weakest Failure Detector for Solving Consensus


by

Mikel Larrea, Antonio Fernández, and Sergio Arévalo

Abstract

The concept of unreliable failure detector was introduced by Chandra and Toueg [2] as a mechanism that provides in-formation about process failures. Depending on the properties the failure detector guarantee, they proposed taxonomy of failure detectors. It has been shown that one of the classes of this taxonomy, namely eventually Strong (3 S), is the weakest class allowing solving the Consensus problem. In this paper, we present a new algorithm implementing 3 S. Our algorithm guarantees that eventually all the correct processes agree on a common correct process. This property trivially allows us to provide the accuracy and completeness properties required by 3 S. We show, then, that our algorithm is better than any other proposed implementation of 3 S in terms of the number of messages and the total amount of information periodically sent. In particular, previous algorithms require to periodically exchanging at least a quadratic amount of information, while ours only requires O(n log n) (where n is the number of processes).However, we also propose a new measure to evaluate the efficiency of this kind of algorithms, the eventual monitoring degree, which does not rely on a periodic behavior and expresses better the degree of processing required by the algorithms. We show that the runs of our algorithm have optimal eventual monitoring degree.

Citation

Larrea, M., Fernández, A., and Arévalo, S. 2000. Optimal Implementation of the Weakest Failure Detector for Solving Consensus. In Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems (October 16-18, 2000). SRDS. IEEE Computer Society, Washington, DC, 52.

Difficulty:


Download Link

.


A Appendix


A.1 PNUTS: Yahoo!’s Hosted Data Serving Platform


by


Brian F. Cooper

, et al

Abstract

We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

Citation

Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., and Yerneni, R. 2008. PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277-1288. DOI= http://doi.acm.org/10.1145/1454159.1454167

Difficulty: +


Download Link

.


A.2 Benchmarking Cloud Serving Systems with YCSB


by


Brian F. Cooper

, et al

Abstract

While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address “cloud OLTP” applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being applied to a diverse range of applications that differ consider- ably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples- to-apples performance comparisons, makes it difficult to understand the tradeoffs between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the development of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.

Citation

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, Indiana, USA, June 10-11, 2010). SoCC ’10. ACM, New York, NY, 143-154. DOI= http://doi.acm.org/10.1145/1807128.1807152

Difficulty: +


Download Link