A standard for storing big data? Apache Spark creators release open-source Delta Lake
From data lakes to data swamps and back again. Data reliability, as in transactional support, is one of the pain-points keeping organizations from getting the most out of their data lakes. Delta Lake is here to address this.
In theory, data lakes sound like a good idea: One big repository to store all data your organization needs to process, unifying myriads of data sources. In practice, most data lakes are a mess in one way or another, earning them the “data swamp” moniker. Databricks says part of the reason is lack of transactional support, and they have just open sourced Delta Lake, a solution to address this.
Historically, data lakes have been a euphemism for Hadoop. Historical Hadoop, that is: On-premises, using HDFS as the storage layer. The reason is simple. HDFS offers cost-efficient, reliable storage for data of all shapes and sizes, and Hadoop’s ecosystem offers an array of processing options for that data.
The data times are a changin’ though, and data lakes follow. The main idea of having one big data store for everything remains, but that’s not necessarily on premise anymore, and not necessarily Hadoop either. Cloud storage is becoming the de facto data lake, and Hadoop itself is evolving to utilize cloud storage and work in the cloud.
Databricks is the company founded by the creators of Apache Spark. Spark has complemented, or superseded, traditional Hadoop to a large extent. This is due to the higher abstraction of Spark’s APIs and its faster, in-memory processing. Databricks itself offers a managed version of open source Spark in the cloud, with a number of proprietary extensions, called Delta. Delta is cloud-only, and is used by a number of big clients worldwide.
In a conversation with Matei Zaharia, Apache Spark co-creator and Databricks CTO. Zaharia noted that sometimes Spark users migrate to the Databricks platform, while other times it’s line-of-business requirements that dictate a cloud-first approach. It seems that having to deal with data lakes that span on-premises and cloud storage prompted Databricks to do something to address one of their main issues: Reliability.
“Today nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into ‘Delta Lakes’,” said Ali Ghodsi, cofounder and CEO at Databricks.
Knowing where this is coming from, we had to wonder what exactly does it mean, and what kind of data storage does Delta Lake support?