A small step for Impala, a big step for SQL-on-Hadoop. More to come, hopefully.
Recently Cloudera published the results of a benchmark performed internally, comparing its own SQL-on-Hadoop implementation (Impala) against a carefully selected competition composed of Hive and an undisclosed RDBMS and showing that Impala outperforms both. As Gigaom’s Derrick Harris was quick to point out, beating Hive is not something to write home about as Hive is somewhat of a reference, as in vanilla, SQL-on-Hadoop these days.
I do think however that beating RDBMSs in their own game deserves some credit and can be a game changer. That is not to say that RDBMS are going away anytime soon: i would definitely not throw away my transactional store to replace it with Hadoop to run my web store, thank you very much. I would however love to leverage that cheap and reliable HDFS storage for my reporting and analytics needs using good old familiar SQL and adding some Hadoop processing while at it.
There is another point that Derrick makes and i completely agree with: SQL-on-Hadoop is getting to be a crowded and diverse space, and for good reasons. So a meaningful benchmark would be one that compares apples to apples – SQL-on-Hadoop flavors against each other. There have been some interesting thoughts on that already, and while i agree with most of the points that Ofir makes i have to point out the obvious: if indeed an industry benchmark would be hard to imagine, that does not mean there can’t be a benchmark altogether.
See how this works for RDF stores: there is a widely accepted benchmark on SPARQL, called the Berlin SPARQL Benchmark, organized and conducted by an independent academic 3rd party that defines and controls it. All vendors are asked to participate and work with the organizers to fine tune their implementations, and all becnhmark definitions, data sets and results are made public.
What’s more, the benchmark goes beyond isolated TPC-X style metrics to compose meaningful use cases that can give indications to end users as to what system would better fit their intended use. I think this is exactly the kind of thing that’s needed for SQL-on-Hadoop, so i hope somebody picks up the idea.
Truth be said however, there’s more to be done before we get there. As much as i love the idea of SQL-on-Hadoop, i would not use it today. Why? Simple: SQL ANSI 92 compliance – still missing. Trivial, you say? Sure, why don’t you try writing your own JDBC connector then and see how that works for you, because the ones out there at the moment have some trouble dealing with that. And that is something that will hamper, if not break, your application. Yes, i know it’s close, but near compliance does not cut it. I do hope we’re getting there though and it’s just a matter of time.
3 comments to “A small step for Impala, a big step for SQL-on-Hadoop. More to come, hopefully.”
Thanks for the link!
The SPARQL looks very interesting – especially the division into three scenarios. I guess that in a narrow domain, with strong academic involvement and relatively little money on the table, cooperation is possible 🙂
Thank you also for the comment and initial thought-provoking post Ofir.
I think we owe the status quo in the RDF realm pretty much to the tenacity and heartfelt enthusiasm of certain people in this community. And to give credit where credit is due, Chris Bizer and the people he works with have done a fantastic job there organizing the BSBM.
I also think that if somebody came along and started doing something similar for SQL-on-Hadoop, it would only take a short while before vendors started to cooperate with them. When results would be published they could not possibly ignore them, so it would make sense to work with them in order to make sure they do as well as possible. If you can’t beat them, join them! So we just need to find that somebody – offers, anyone 🙂
[…] disclosure: having written about the recently released Cloudera benchmark, i got a reply to my blog post from Ofir Manor who has […]