Unifying cloud storage and data warehouses: Delta Lake project hosted by the Linux Foundation
A mix of open source foundations and commercial adoption, the strategy adopted by Databricks for Delta Lake could set Delta Lake on its way to becoming a standard for storing data on the cloud
Going cloud for your storage needs comes with some baggage. On the one hand, it’s cheap, elastic, and convenient – it just works. On the other hand, it’s messy, especially if you are used to working with data management systems like databases and data warehouses.
Unlike those systems, cloud storage was not designed with things such as transactional support or metadata in mind. If you work with data at scale, these are pretty important features. This is why Databricks introduced Delta Lake to add those features on top of cloud storage back in 2017.
Earlier in 2019, Delta Lake was open sourced. Today, the Linux Foundation, the non-profit organisation enabling mass innovation through open source, announced that it will host Delta Lake. ZDNet discussed with Matei Zaharia, Databricks Chief Technologist and co-founder, about this development and how it fits in the overall data and analytics landscape.
Zaharia and Databricks CEO and co-founder Ali Ghodsi are the original creators of the open source Apache Spark project, the unified analytics engine that has become a defacto standard for large-scale data processing.
Early on in Databricks’ course, the decision was made to focus on offering a data management platform based on a hardened version of Apache Spark, with additional proprietary elements, as a managed cloud-based service. At the same time, Databricks remains the driving force behind the evolution of Spark.
This open core strategy is typical for many companies that act both as the stewards of open source projects, and as commercial for-profit entities. It’s a way to balance the benefits of open source, with the need to be commercially sustainable. It can, however, lead to unintended side-effects.
Competition from cloud vendors has forced some companies offering open source products to react. What they did was to change the licenses of their open source components to prohibit cloud vendors from taking their open source core and offering it as a service themselves. This, in turn, has caused controversy in the open source community, and beyond.
Databricks is aware of this, and decided to take a different approach for Delta Lake. As Zaharia explained, they want Delta Lake to be as widely adopted as possible, which is why they open sourced it. At the same time, they want it to take a life of its own, regardless of Databricks, which is why they are handing it over to the Linux Foundation.
Databricks wants to send a clear message to the community, said Zaharia, which is why they chose the Linux Foundation, an umbrella foundation for open source projects, as the steward for Delta Lake. Although it’s been only 6 months since Delta Lake was open sourced, data shared by Databricks suggest strong uptake.
Since its launch in October 2017, Delta Lake has been adopted by over 4,000 organisations and processes over two exabytes of data each month. Adopters include the likes of Alibaba, Booz Allen Hamilton, Intel, and Starburst. Coupled with an open governance model that encourages participation and technical contribution, this may mean that Delta Lake does indeed become a standard for storing big data.
Delta Lake aims at nothing short of unifying cloud storage and data warehouses, and this theme is reflected across the board for Databricks. Let’s take the recent announcement of partnering with Tableau, for example. As Zaharia explained, in fact the partnership is not new. What is new is the Databricks connector for Tableau, which is faster and easier to use than the generic Spark connector previously available.
For Databricks, having strong business intelligence and visualization partners like Tableau makes sense. It enables it to go the last mile to the business users, which is something Zaharia said they did not have in mind in the beginning of their journey. This is an interesting point, as it sheds light on the focus on Delta Lake.
Zaharia said their original aim was on serving data scientists and their workloads, with a focus on machine learning. But as they were met with increasing demand for “vanilla” data access, the kind of workload typically served by data warehouses, unifying those became a priority. This is how Delta Lake came to be.
For a vendor like Tableau, on the other hand, being able to access and integrate data that lives in cloud storage vastly expands the reach of its users. It’s a win for everyone. So even though Tableau may have a more direct way to access data in the Databricks platform via the partnership, the idea is that any tool should be able to access data on any cloud via Delta Lake.