David Chappell

Opinari

Get the Feed! Subscribe

Cloud Platform Storage: Relational vs. Scale-Out  
# Friday, February 27, 2009
 
All of today’s most visible cloud platforms provide a scalable storage mechanism. Google’s AppEngine has the Datastore, Amazon Web Services has SimpleDB, Microsoft’s Windows Azure has Tables, and Salesforce.com’s Force platform has an object store. All of them offer similar things: hierarchical rather than table-based storage, a straightforward schema-less approach, and a simple query language.

But why? Why don’t these platforms just provide ordinary relational storage? While their hierarchical approach has some strengths—it’s simple and flexible—it also has a host of weaknesses. Here are some of the big ones:
  • Because they don’t provide ordinary tables, these storage technologies are harder for developers to understand and use. It also requires work to move data between familiar on-premises relational databases and hierarchical cloud datastores. For example, achieving good performance probably means organizing your data hierarchy to optimize for your app’s most common queries, something that differs from the usual relational approach.
  • None of them support standard SQL. This adds to the unfamiliarity, and it also means that useful things like joins and aggregates aren’t generally available. And each platform has its own query language, making life more difficult for developers and increasing platform lock-in.
  • The lack of standard relational data means that existing tools for working with that data, such as reporting services, can’t easily be used.
  • Because there’s no schema, programs will contain more errors. Rather than relying on the database to catch attempts to, say, store a string in an integer field, you’ll have to find these yourself.

These are significant limitations, and they raise a big question: Why would anybody use these things? Why don’t the cloud platform vendors just give us relational storage with SQL rather than these limited approaches?

The reason is that nobody seems to know how to make a relational DBMS scale to hold really massive amounts of data. You certainly can make a relational system handle more and more data by running it on ever-larger machines, but it’s much harder to do this by replicating relational data across multiple machines. In other words, traditional relational databases scale up, but they’re hard to scale out.

One way to scale out with a relational DBMS is to divide your data across multiple instances of the DBMS. Maybe all customers whose names start with “A” are in one instance, all whose names start with “B” are in the next instance, and so on. This approach, sometimes referred to as sharding, can work. But it’s hard to administer, and think about what it gives up: You lose the familiar all-in-one relational world, you can’t do SQL queries across different instances (which means you lose joins and aggregates), you can’t easily use reporting tools across different instances, and you no longer have an automatically maintained common schema across your entire database.

Does this list of problems sound familiar? It should, since it mirrors what you lose with the hierarchical storage mechanisms provided by cloud platforms. These hierarchical stores are all focused on providing massively scalable storage, which means scale-out storage. Just like sharded databases, they trade off functionality for scalability.

Perhaps we’ll one day see cloud platforms offer scale-out storage that provides everything we now get in traditional relational databases. At the moment, however, the state of the art in scale-out storage seems to require giving up much of what we’re used to with SQL and relational databases.

So when should you use a cloud platform’s scale-out storage? In some cases, such as Google AppEngine, there’s no choice: All persistent data is kept in the Datastore. But other platforms have options. Amazon Web Services, for instance, offers both SimpleDB for scale-out storage and the ability to run a standard relational DBMS such as Oracle or SQL Server in a virtual machine. This latter option won’t scale as well, but it does give you a full relational system.

I’d argue that scale-out storage is so limited that it should be used only when your app requires enormous scalability. Giving up the advantages of relational databases makes sense only when the trade-off between functionality and scalability is worth it. So far, my sense is that not many cloud apps require this—using SimpleDB appears to be much less popular than running a relational database in a VM, for example.

Relational storage is a wonderful thing. It won’t always solve your problem, and so embracing the limitations of scale-out storage is sometimes necessary. But unless your cloud app needs massive scale, the relational option still makes plenty of sense.



5 comments :: Post a Comment