Recently Cloudera published the results of a benchmark performed internally, comparing its own SQL-on-Hadoop implementation (Impala) against a carefully selected competition composed of Hive and an undisclosed RDBMS and showing that Impala outperforms both. As Gigaom’s Derrick Harris was quick to point out, beating Hive is not something to write home about as Hive is somewhat of a reference, as in vanilla, SQL-on-Hadoop these days.
I do think however that beating RDBMSs in their own game deserves some credit and can be a game changer. That is not to say that RDBMS are going away anytime soon: i would definitely not throw away my transactional store to replace it with Hadoop to run my web store, thank you very much. I would however love to leverage that cheap and reliable HDFS storage for my reporting and analytics needs using good old familiar SQL and adding some Hadoop processing while at it.
There is another point that Derrick makes and i completely agree with: SQL-on-Hadoop is getting to be a crowded and diverse space, and for good reasons. So a meaningful benchmark would be one that compares apples to apples – SQL-on-Hadoop flavors against each other. There have been some interesting thoughts on that already, and while i agree with most of the points that Ofir makes i have to point out the obvious: if indeed an industry benchmark would be hard to imagine, that does not mean there can’t be a benchmark altogether.
See how this works for RDF stores: there is a widely accepted benchmark on SPARQL, called the Berlin SPARQL Benchmark, organized and conducted by an independent academic 3rd party that defines and controls it. All vendors are asked to participate and work with the organizers to fine tune their implementations, and all becnhmark definitions, data sets and results are made public.
What’s more, the benchmark goes beyond isolated TPC-X style metrics to compose meaningful use cases that can give indications to end users as to what system would better fit their intended use. I think this is exactly the kind of thing that’s needed for SQL-on-Hadoop, so i hope somebody picks up the idea.
Truth be said however, there’s more to be done before we get there. As much as i love the idea of SQL-on-Hadoop, i would not use it today. Why? Simple: SQL ANSI 92 compliance – still missing. Trivial, you say? Sure, why don’t you try writing your own JDBC connector then and see how that works for you, because the ones out there at the moment have some trouble dealing with that. And that is something that will hamper, if not break, your application. Yes, i know it’s close, but near compliance does not cut it. I do hope we’re getting there though and it’s just a matter of time.