David Chappell


Get the Feed! Subscribe

Cloud Platform Storage: Relational vs. Scale-Out  
# Friday, February 27, 2009
All of today’s most visible cloud platforms provide a scalable storage mechanism. Google’s AppEngine has the Datastore, Amazon Web Services has SimpleDB, Microsoft’s Windows Azure has Tables, and Salesforce.com’s Force platform has an object store. All of them offer similar things: hierarchical rather than table-based storage, a straightforward schema-less approach, and a simple query language.

But why? Why don’t these platforms just provide ordinary relational storage? While their hierarchical approach has some strengths—it’s simple and flexible—it also has a host of weaknesses. Here are some of the big ones:
  • Because they don’t provide ordinary tables, these storage technologies are harder for developers to understand and use. It also requires work to move data between familiar on-premises relational databases and hierarchical cloud datastores. For example, achieving good performance probably means organizing your data hierarchy to optimize for your app’s most common queries, something that differs from the usual relational approach.
  • None of them support standard SQL. This adds to the unfamiliarity, and it also means that useful things like joins and aggregates aren’t generally available. And each platform has its own query language, making life more difficult for developers and increasing platform lock-in.
  • The lack of standard relational data means that existing tools for working with that data, such as reporting services, can’t easily be used.
  • Because there’s no schema, programs will contain more errors. Rather than relying on the database to catch attempts to, say, store a string in an integer field, you’ll have to find these yourself.

These are significant limitations, and they raise a big question: Why would anybody use these things? Why don’t the cloud platform vendors just give us relational storage with SQL rather than these limited approaches?

The reason is that nobody seems to know how to make a relational DBMS scale to hold really massive amounts of data. You certainly can make a relational system handle more and more data by running it on ever-larger machines, but it’s much harder to do this by replicating relational data across multiple machines. In other words, traditional relational databases scale up, but they’re hard to scale out.

One way to scale out with a relational DBMS is to divide your data across multiple instances of the DBMS. Maybe all customers whose names start with “A” are in one instance, all whose names start with “B” are in the next instance, and so on. This approach, sometimes referred to as sharding, can work. But it’s hard to administer, and think about what it gives up: You lose the familiar all-in-one relational world, you can’t do SQL queries across different instances (which means you lose joins and aggregates), you can’t easily use reporting tools across different instances, and you no longer have an automatically maintained common schema across your entire database.

Does this list of problems sound familiar? It should, since it mirrors what you lose with the hierarchical storage mechanisms provided by cloud platforms. These hierarchical stores are all focused on providing massively scalable storage, which means scale-out storage. Just like sharded databases, they trade off functionality for scalability.

Perhaps we’ll one day see cloud platforms offer scale-out storage that provides everything we now get in traditional relational databases. At the moment, however, the state of the art in scale-out storage seems to require giving up much of what we’re used to with SQL and relational databases.

So when should you use a cloud platform’s scale-out storage? In some cases, such as Google AppEngine, there’s no choice: All persistent data is kept in the Datastore. But other platforms have options. Amazon Web Services, for instance, offers both SimpleDB for scale-out storage and the ability to run a standard relational DBMS such as Oracle or SQL Server in a virtual machine. This latter option won’t scale as well, but it does give you a full relational system.

I’d argue that scale-out storage is so limited that it should be used only when your app requires enormous scalability. Giving up the advantages of relational databases makes sense only when the trade-off between functionality and scalability is worth it. So far, my sense is that not many cloud apps require this—using SimpleDB appears to be much less popular than running a relational database in a VM, for example.

Relational storage is a wonderful thing. It won’t always solve your problem, and so embracing the limitations of scale-out storage is sometimes necessary. But unless your cloud app needs massive scale, the relational option still makes plenty of sense.

5 comments :: Post a Comment



There is the question of transport from the client application to the cloud data service. The familiar model that developers follow today is REST (GET/POST/PUT/DELETE) of documents in the data service. These operations can cover much of what is needed for a data driven application. In these cases a simple data service can meet the requirements of the application. So I would add the Rich Internet Application use case to the list of applications that make sense to build with a simple data service approach. Use a simple data service approach when you want huge scalability and/or when you want to build Rich Internet Applications in languages like AJAX, Flash, Silverlight, and JavaFX.

I think scale out factor is only one axis of cloud based storage. Very important parameter here is also higher resilience capabilities compare to RDBMS. With cloud based storage you can distribute data quite easily across multiple different geographical locations and with as many data copies as you would need to fulfill resilience numbers in the SLA contract. Additionally all that you can achieve on commodity based servers which lower provider’s cost even farther. I doubt this is possible to achieve even in fraction of such scale on current RDBMS systems where higher resilience is heavily traded for specialized and very costly hardware. Complexity to do same with RDBMS will be simply enormous. Main question for me therefore is whether majority of enterprise applications really need such support to justify usually not natural hierarchy based data model now.

A couple of thoughts:

Jerry, you're right that a RESTful interface is useful. You don't have to give up a standard relational store to get this, however. On Windows, for instance, ADO.NET Data Services can put a RESTful head on a variety of different storage bodies, including relational stores. (In fact, Windows Azure tables use ADO.NET Data Services to expose their RESTful interface.) A cloud relational store that also provided a RESTful interface would address your concern without giving up the benefits of our familiar relational world.

And Libor, scale-out storage certainly does provide higher reliability than traditional relational databases. Still, the great majority of apps today are happy with the reliability they get from existing relational storage technologies. Why give up all of the benefits of relational storage for more of something that you don't really need? I'd be surprised if most developers do this.

Simple question, does the recent announcement from the Microsoft SDS team change your view on this?


No. It's a great thing for the Azure platform to include relational storage--it's an essential service. But I'd expect the SDS relational storage to be significantly less scalable than Windows Azure tables, Microsoft's scale-out storage offering.

Post a Comment

<< Home