Data management in 2024. Open data formats and a common language for a sixth data platform
What data management in 2024 and beyond will look like hangs on one question. Can open data formats lead to a best-of-breed data management platform? It will take Interoperability across clouds and formats, as well as on the semantics and governance layer.
Sixth Platform. Atlas. Debezium. DCAT. Egeria. Nessie. Mesh. Paimon. Transmogrification.
This veritable word soup sounds like something that jumped out of a role-playing game. In reality, these are all terms related to data management. These terms may tell us something about the mindscape of people involved in data management tools and nomenclature, but that’s a different story.
Data management is a long-standing and ever-evolving practice. The days of “Big Data” have long faded and data has now taken a back seat in terms of hype and attention. Generative AI is the new toy people are excited about, shining a light on AI and machine learning for the masses.
Data management may seem dull in comparison. However, there is something people who have been into AI before it was cool understand well, and people who are new to this eventually realize too: AI is only as good as the data it operates on.
There is an evolutionary chain leading from data to analytics to AI. Some organizations were well aware of this a decade ago. Others will have to learn the hard way today.
Exploring and understanding how, where and when data is stored, managed, integrated and used as well as related aspects of data formats, interoperability, semantics and governance may be hard and unglamorous work. But it is what generates value for organizations, along with RPG-like word soups.
Peter Corless and Alex Merced understand this. Their engagement with data goes way before their current roles as Director of Product Marketing at StarTree and Developer Advocate at Dremio, respectively. We caught up to talk about the state of data management in 2024, and where it may be headed.
The sixth data platform
For many organizations today, data management comes down to handing over their data to one of the “Big 5” data vendors: Amazon, Microsoft Azure and Google, plus Snowflake and Databricks. But analysts David Vellante and George Gilbert believe that the needs of modern data applications coupled with the evolution of open storage management may lead to the emergence of a “sixth data platform”.
The notion of the “sixth data platform” was the starting point for this conversation with Corless and Merced. The sixth data platform hypothesis is that open data formats may enable interoperability, leading the transition away from vertically integrated vendor-controlled platforms towards independent management of data storage and permissions.
It’s an interesting scenario, and one that would benefit users by forcing vendors to compete for every workload based on the business value delivered, irrespective of lock-in. But how close are we to realizing this?
In order to answer this question, we need to examine open data formats and their properties. In turn, in order to understand data formats a brief historical overview is needed.
We might call this table stakes, and it’s not just a pun. Historically, organizations would resort to using data warehouses for their analytics needs. These were databases specifically designed for analytics, as opposed to transactional applications. Traditional data warehouses worked, but scaling them was an issue as Merced pointed out.
Scaling traditional data warehouses was expensive and cumbersome, because it meant buying new hardware with storage and compute bundled. This was the problem Hadoop was meant to solve when introduced in the early 2010s, by separating storage and compute.
Hadoop made scaling easier, but it was cumbersome to use. This is why a SQL interface was eventually created for Hadoop via Apache Hive, making it more accessible. Hive used metadata and introduced a protocol for mapping files, which was what Hadoop operated on, to tables, which is what SQL operates on. That was the beginning of open data formats.
Data lakes, data lakehouses, and open data formats
Hadoop also signified the emergence of the data lake. Since storage was now cheap and easy to add, that opened up the possibility of storing all data in one big Hadoop store: the data lake. Eventually, however, Hadoop and on-premise compute and storage gave way to the cloud, which is the realm the “Big 5” operate on.
“The idea of a data lakehouse was, hey, you know what? We love this decoupling of compute and storage. We love the cloud. But wouldn’t it be nice if we didn’t have to duplicate the data and we could just operate over the data store you already have, your data lake, and then just take all that data warehouse functionality and start trying to move it on there?”, as Merced put it.
However, as he added, that turned out to be a tall order, because there were elements of traditional data management missing. Other cloud-based data management platforms introduced notions like automated sharding, partitioning, distribution and replication. That came as part of the move away from data centers and file systems towards the cloud and APIs to access data. But for data lakehouses, these things are not a given.
This is where open data formats such as Apache Iceberg, Apache Hudi and Delta Lake come in. They all have some things in common, such as the use of metadata to abstract and optimize file operations and being agnostic as to storage. However, Hudi and Delta Lake both continued to build on what Hive started, while Iceberg departed from that.
A detailed open data format comparison is a nuanced exercise, but perhaps the most important question is: Does interoperability exist between open data formats, and if yes, on what level? For the sixth data platform vision to become reality, Interoperability should exist among clouds and formats, as well as on the semantics and governance level.
Interoperability of the same format between different cloud storage vendors is possible, but not easy. It can be done, but it’s not an inherent feature of the data formats themselves. It’s done on the application level, either via custom integration or by using something like Dremio. And there will be egress and network costs from transferring files outside the cloud where they reside.
Interoperability of different formats is also possible, but not perfect. It can be done by using a third party application, but there are also a couple of other options like Delta Uniform or OneTable. Neither works 100%, as Delta Uniform batches transactions and OneTable is for migration purposes.
As Merced noted, creating a solution that works 100% would probably incur a lot of overhead and complexity as metadata would have to be synchronized across formats. Merced thinks these solutions are motivated at least partially by Iceberg’s growth and the desire to ensure third party tools can work with the Iceberg ecosystem.
To federate or not to federate?
What’s certain is that there’s always gonna be sprinkles of data in other systems or in multiple clouds. That’s why federation and virtualization that can work at scale is needed, as Corless noted. Schema management is complicated enough on one system, and trying to manage changes across three different systems would not make things easier.
Whether data is left where it originally resides and used via federated queries, or ingested in one storage location is a key architectural question. As such, there are tradeoffs involved that need to be understood. Even within federation, there are different ways to go about it – Dremio, GraphQL, Pinot, Trino, and more.
“How do data consumers want to size up these problems? How can we give them predictive ways to plan for these trade-offs? They could use federated queries, or a hybrid table of real time and batch data. We don’t even have a grammar to describe those kinds of hybrid or complex data products these days”, Corless said.
Corless has a slightly different point of view, focusing on realtime data and processing. This is what StarTree does, as it builds on Apache Pinot. Pinot is realtime distributed OLAP datastore, designed to answer OLAP queries with low latency.
Corless also noted that the choice of data format depends on a number of parameters. Some formats are optimized for in-memory use, like Apache Arrow. Others like Apache Parquet are optimized for disk storage.
There are attempts to have common representations of data both in-memory and on storage, motivated by the need to leverage tiered storage. Tiered storage may utilize different media, ranging from in-memory to SSD to some sort of blob storage.
“People want flexibility in where and how they store their data. If they have to do transmogrification in real time, that largely defeats the purpose of what they’re trying to do with tiered storage. Is there something like the best, most universal format?
Whenever you’re optimizing for something, you’re optimizing away from something else. But I’m very eager to see where that kind of universal representation of data is taking us right now.”, Corless said.
Semantics and governance
Regardless of where or how data is stored, however, true interoperability should also include semantics and governance. This is where the current state of affairs leaves a lot to be desired.
As Merced shared, the way all table formats work is that the metadata has information about the table, but not about the overall catalog of tables. That means they are not able to document the semantics of tables and how they relate to each other. A standard format of doing that doesn’t exist yet.
Merced also noted that there is something that addresses this gap, but it only works for Iceberg tables. Project Nessie is an open source platform that creates a catalog, thus making tracking and versioning Iceberg tables and views possible.
Nessie also incorporates elements of governance, and Merced noted that Nessie will eventually have Delta Lake support too. Databricks on its part offers Unity Catalog which works with Delta Lake, but it’s a proprietary product. There is no lack of data catalog products in the market, but none of that can really be considered the solution to semantics and data interoperability.
Corless on his part noted that there a standard called DCAT. DCAT, which stands for Data Catalog Vocabulary, is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. DCAT was recently updated to v.3, it’s been around for over a decade, and it’s precisely aimed at interoperability.
The fact that DCAT is not used widely has probably more to do with vendors reinventing the wheel and/or aiming for lock-in, as well as clients not being more proactive in requiring interoperability standards than with DCAT itself, as per Corless. Unfortunately, it seems like what data is to AI, governance is to data for most organizations: a distant second at best.
Forty-two percent of data and analytics leaders do not assess, measure or monitor their data and analytics governance, according to a 2020 Gartner survey. Those who said they measured their governance activity mainly focused on achieving compliance-oriented goals.
The onset of GDPR in 2018 marked the opportunity for governance to leverage metadata and semantics to rise up. Apache Atlas was an effort to standardize governance for data lakes leveraging DCAT and other metadata vocabularies. Today, however, both data lakes and Atlas seem to have fallen out of fashion.
A new project called Egeria was spun out of Atlas, aiming to address more than data lakes. Furthermore, there is another open standard for metadata and data lineage called OpenLineage and a number of lineage platforms that support it, including Egeria.
Conclusion
So, what is the verdict? Is the sixth data platform possible for data management in 2024 or is it a pipedream?
Merced thinks that we’re already starting to see it. In his view, using open formats like Apache Iceberg, Apache Arrow, and Apache Parquet can create a greater level of interoperability. This interoperability is still imperfect, he noted, but it’s a much better state of things than in the past. Dremio’s thesis is to be an open data lakehouse, not just operating on the data lake, but across tools and data sources.
Corless thinks that we’re going to see a drive towards clusters of clusters, or systems of systems. There’s gonna be a reinforced drive towards the automation of integration. More like Lego bricks that snap together easily, rather than Ikea furniture that comes with a hex wrench. But for that to happen, he noted, we’ll need a language and a grammar so that systems and people can both understand each other.
The ingredients for a sixth data platform are either there, or close enough. As complexity and fragmentation explode, a language and a grammar for interoperability sounds like a good place to start. Admittedly, interoperability and semantics are hard. Even that, however, may be more of a people and market issue than a technical one.
DCAT and OpenLineage are just some of the vocabularies out there. Even things as infamously hard to define as data mesh and data products have a vocabulary of their own – the Data Product Descriptor Specification.
Perhaps then the right stance here would be cautious optimism.