This week the world’s biggest event for all things Hadoop takes place, the Strata and Hadoop World conference. Vendors announce and showcase new releases and features in their offerings, and Gigaom covered the extensive array of news. Let’s try to decipher them and see their impact in terms of Hadoop distributions Accessibility, Scalability and Security.
First of all, when referring to Hadoop distributions we refer to software platforms including not only Hadoop, but also a number of projects that build on top of it adding features such as stream processing, SQL access and security. All 3 major Hadoop distribution vendors, Cloudera, Hortonworks and MapR, announced either new versions of their distributions or new features. Let’s see what they mean in terms of Accessibility, Scalability and Security.
Hadoop is based on its filesystem, HDFS, as the main means to store and access data. Despite its efficiency as a file system, this is a rather cumbersome way to access data. Therefore, additional layers have been developed on top of Hadoop to give access to data via the most common data access language – SQL.
The SQL-on-Hadoop domain has seen a major breakthrough, as both Cloudera and Hortonworks have managed to enable their implementations (Impala 2.0 and Hive Stinger respectively) to support SQL write access. Previously SQL-on-Hadoop solutions only supported read access to data stored in Hadoop, so this is an important development on the way to maturity.
Now Hadoop is closer to being able to operate as a traditional database, with the additional benefits of scaling out on commodity hardware and supporting data processing tools and pipelines. What is still missing is full SQL ANSI support, but progress has also been made on this front so we can be optimistic that it will be achieved soon.
Even though Hadoop is designed to scale out on commodity hardware, sometimes what is available on premises is not enough to run demanding data processing jobs. In addition, configuring and managing a Hadoop cluster requires expertise not every organization is willing and able to have on board. For these reasons, the option of running Hadoop in the cloud is getting increasingly popular.
While some cloud vendors offer preconfigured Hadoop nodes, such as Amazon’s Elastic Map Reduce or Microsoft’s HDInsight, not all have this option. Furthermore, some organizations would like to have more control over their Hadoop nodes. Therefore, the announcements from both Cloudera and Hortonworks of new support for deploying their distributions in the cloud are important.
Cloudera announced a new tool called Director that will enable clients to deploy its distribution on major cloud providers, while Hortonworks announced the ability to perform data replication and archiving on Microsoft Azure or Amazon. In addition, its distribution can now also be deployed on both Azure and Amazon. MapR is also offered as an option on Amazon EMR.
Security is key in order for Hadoop to become truly enterprise-ready. There are a number of important features security consists of key features there, and a number of competing initiatives described in Gigaom’s note on Hadoop security.
Two of the key projects there are Apache Sentry, initiated by Cloudera, and Apache Argus, based on the codebase of XASecure that has been acquired by Hortonworks. Argus has just been proposed for incubation at the Apache Software Foundation, while Sentry is already well under way in the incubation process.
Sentry addresses fine-grained data access and has been enhanced to feature a plug-in architecture to enable it to work across the Hadoop stack, rather than just with Hive, Impala and Cloudera Search as is currently the case. Argus on the other hand is a more comprehensive framework also offering key management and transparent encryption for HDFS. It has been integrated with Apache Knox, Storm, Hive and HBase. So it remains to be seen just how much of XASecure is open sourced and which project gets more traction.