These are significant limitations, and they raise a big question: Why would anybody use these things? Why don’t the cloud platform vendors just give us relational storage with SQL rather than these limited approaches?
The reason is that nobody seems to know how to make a relational DBMS scale to hold really massive amounts of data. You certainly can make a relational system handle more and more data by running it on ever-larger machines, but it’s much harder to do this by replicating relational data across multiple machines. In other words, traditional relational databases scale up, but they’re hard to scale out.
One way to scale out with a relational DBMS is to divide your data across multiple instances of the DBMS. Maybe all customers whose names start with “A” are in one instance, all whose names start with “B” are in the next instance, and so on. This approach, sometimes referred to as sharding, can work. But it’s hard to administer, and think about what it gives up: You lose the familiar all-in-one relational world, you can’t do SQL queries across different instances (which means you lose joins and aggregates), you can’t easily use reporting tools across different instances, and you no longer have an automatically maintained common schema across your entire database.
Does this list of problems sound familiar? It should, since it mirrors what you lose with the hierarchical storage mechanisms provided by cloud platforms. These hierarchical stores are all focused on providing massively scalable storage, which means scale-out storage. Just like sharded databases, they trade off functionality for scalability.
Perhaps we’ll one day see cloud platforms offer scale-out storage that provides everything we now get in traditional relational databases. At the moment, however, the state of the art in scale-out storage seems to require giving up much of what we’re used to with SQL and relational databases.
So when should you use a cloud platform’s scale-out storage? In some cases, such as Google AppEngine, there’s no choice: All persistent data is kept in the Datastore. But other platforms have options. Amazon Web Services, for instance, offers both SimpleDB for scale-out storage and the ability to run a standard relational DBMS such as Oracle or SQL Server in a virtual machine. This latter option won’t scale as well, but it does give you a full relational system.
I’d argue that scale-out storage is so limited that it should be used only when your app requires enormous scalability. Giving up the advantages of relational databases makes sense only when the trade-off between functionality and scalability is worth it. So far, my sense is that not many cloud apps require this—using SimpleDB appears to be much less popular than running a relational database in a VM, for example.
Relational storage is a wonderful thing. It won’t always solve your problem, and so embracing the limitations of scale-out storage is sometimes necessary. But unless your cloud app needs massive scale, the relational option still makes plenty of sense.