Indexing Hadoop: If it’s so simple, how come not everyone’s doing it?
Is Hadoop really the best thing since sliced bread? You’d probably get this idea, if you have been talking to any of the (proliferating) Hadoop advocates / vendors. Hadoop and its expanding ecosystem are being touted as the ideal solution to any organization’s data management needs – and admittedly, for good reason.
Hadoop offers a number of clear advantages, the most obvious of which are cheap and scalable storage coupled with an infrastructure built to become the poor man’s parallel programming platform (although some would call that a contradiction in terms). But there are also a couple of things that are wrong about it*, the most prominent of which is its inability to do interactive querying and SQL. In this post we discuss some of the options to counter this drawback.
To be clear, we must note that Hadoop is basically a batch-oriented parallel processing system that was not designed to work interactively. And to add insult to injury, it does not support SQL either. Nothing surprising there too, as its storage system (HDFS) was not designed to support a relational db. So what is the fuss all about, and why are people trying to twist Hadoop’s arm (and theirs) to make it become something it was never meant to be? Hadoop offers a very attractive solution for storing all sorts of unstructured data for cheap, plus a framework to process that data also for cheap. As such, it did not take long for it to become established in many organisations.
What was the next logical step? As data started being accumulated in Hadoop deployments (thus becoming the de-facto data hub of the organisation), the need to answer questions on that data became pronounced. As Hadoop was not designed to do this, at least in an interactive way, some approaches were developed to deal with that need. One approach was to ETL data stored in Hadoop to a structured schema / format, offload it to a (relational, typically) db and do all the querying there. This works, and lets analysts and developers use all the tools they are familiar with on top of relational dbs (SQL and BI / Visualisation suites), but is cumbersome and resource-intensive to build and maintain. Is there a better way?
Enter SQL-on-Hadoop. It’s a reasonable (albeit brave) approach to take: instead of the cumbesome ETL process, a number of vendors have embarked on an effort to inject SQL capabilities to Hadoop. After all, if the mountain won’t come to Muhammad, then Muhammad must go to the mountain. This is a very lively and crowded space, as the benefits of this effort will be substantial. Of course it is suffering from a number of growing pains, some of which having to with FUD, some of which with the fact that as this is a new space the rules of the game are not quite clear yet in terms of how to measure and compare performance, some of which having to do with the nature of this game.
What most SQL-on-Hadoop vendors are really doing is using the MapReduce infrastructure to run batch jobs that do a full scan of the data every time in order to answer queries. Because Hadoop is so good at this, and because of the engineering vigour and resources that the vendors are putting into this, this actually works, and it’s even getting to the point where some claim it works just as good as a relational db. This is pretty impressive, considering how inefficient this approach can be. To use the library metaphor, this is the equivalent of hiring a 1000 librarians and keeping them around so that you can send them all off to look for books by scanning the library each time (blunt metaphor, but you get the point). So, again: is there a better way?
Why, sure. There has been for a number of years actually, it’s the reason RDBMS are so good at what they do, and it’s called indexing: instead of sending off your librarians each time to get your books, just keep track of everything that’s going in and out of your library in an index, and use that to quickly fetch what you need. Simple, right? Well if it’s so simple, how come not every SQL-on-Hadoop contestant is doing it? As you may have guessed, the answer is “because it’s not so simple, actually”. As any seasoned DBA or proficient db researcher will tell you, indexing is tricky.
It’s complex to implement and it has performace side-effects to maintain, penalizing the performance of your db in the long run as its size grows. To demonstrate that point, let us mention that for large databases it may make more sense performance-wise to drop indexes to perform updates/inserts and then rebuild them from scratch, instead of updating them and keeping them consistent. Now consider the volume of data that Hadoop is used for, and you will get why taking an indexing approach for SQL-on-Hadoop is not for the faint of heart. Yes, but still..is there a better way?
Well, there might be. RDBMS indexes are expensive because they live in a world that requires them to support the ACID properties, and that comes at a cost. The thing is however, these requirements do not necessarily apply in the Hadoop world. It’s one thing to support a transactional store, and a very different thing to support an analytical store. I do not know anyone using Hadoop as a transactional store, although if there is someone doing that i would very much like to meet them 🙂 HBase can be used for this purpose, it’s a different beast altogether – a key-store rather than an RDBMS really.
For the typical “data hub” scenario that Hadoop is well-suited for, indexes can be reinvented, by loosening some of the requirements imposed on RDBMS indexes: no updates, only insert/append operations. Allow duplicates, and go for eventual consistency instead of strict one (an approach also taken by NoSQL solutions).
If this sounds like a promising approach, you may be interested to know that there actually is somebody doing just that. They are called JethroData, and even though they are not enjoying the kind of popularity seen by the likes of Cloudera or Hortonworks, i think they are definitely worth keeping an eye on. A few days ago i had the chance to discuss with Eli Singer, their CEO, and i found their take on things pretty interesting. Besides the purely technical part, i also like the way they approach the isssue of benchmarks; this was, after all, how we got in touch in the first place. Being a relatively small player, JethroData relies on benchmarks to get some traction and attract some attention.
Jetho participated in last week’s Strata conference, and Eli was kind enough to share the following:
“We did show a live benchmark between Jethro and Impala and the results are below. It was very basic test that included a table from the TPC-DS benchmark with 2.5B records. It ran on an Amazon cluster of 3 M1.Large machines with Cloudera CDH 4.5. We tested Jethro pre-beta version 0.65 vs the latest Impala. In addition, the Impala data was using the new Parquet data format which significantly improves its performance. The test is very unscientific and we will run more complete tests once we’re more advanced into our beta program”.
Eli also shared some of their benchmark results, which i am citing below. While i obviously can not and will not endorse this and i have no intention of finding myself in the middle of vendor and analyst flame wars, i find their approach interesting and expect other vendors to take note soon. My point of view is let 1000 flowers bloom, and let each vendor find their own sweet spot in the market. I think any approach that pushes things forward for the community as a whole is a good thing, and as such i wish JethroData good luck moving forward,
Full disclosure: having written about the recently released Cloudera benchmark, i got a reply to my blog post from Ofir Manor who has just been appointed as JethroData product manager. Ofir got me in touch with Eli Singer, with whom we had a short discussion on their work and was kind enough to provide some material used with permission. There is no affiliation or contractual relationship whatsoever between myself and JethroData.
* To get an idea of some the things that are great and some others that are not so great about Hadoop, you can check the slide deck from the analyst session i organized this September in Structure Europe 2013.