Trends in data and AI: Ben Lorica on cloud, platforms, models and Pegacorns
As Ben Lorica will readily admit, at the risk of dating himself, he belongs to the first generation of data scientists. In addition to having served as Chief Data Scientist for the likes of Databricks and O’Reilly, Lorica advises and works with a number of venture capitals, startups and enterprises, conducts surveys, and chairs some of the top data and AI events in the world. That gives him a unique vantage point to identify developments in this space.
Having worked in academia teaching applied mathematics and statistics for years, at some point Lorica realized that he wanted his work to have more practical implications. At that point the term “data science” was not yet coined, and Lorica’s exit strategy was to become a quant. Fast forwarding to today, Lorica still has friends in the venture capital world.
That includes Intel Capital’s Assaf Araki, with whom Lorica co-authored two recent posts on data management and AI trends. We caught up with Lorica to discuss those, as well as new areas for growth and the trouble with unicorns and what to do about it.
Data management is going to the cloud, but opportunities are everywhere
Starting out with data management, Lorica has some advice to dispense for wannabe startup founders. Lorica believes that there are more opportunities to be found for startups working in the cloud rather than on premise, choosing to address either analytics or operational workloads rather than taking on both, and addressing integration as a means to an end, rather than a goal in itself.
While broadly speaking the advice makes sense, we were curious as to the degree to which Lorica was open to discussion around it. Unsurprisingly, he is well aware of the nuances in all of those aspects.
The cloud is the first choice for most workloads today. Obviously this includes data workloads, and cloud gravity data is real. But this does not mean that on premise is a lost cause, or that there is no point for anyone to keep any workloads on premise anymore. In addition, multi-cloud is a complex reality, and that complexity is taking its toll in terms of cost, too.
Data repatriation, i.e. the counter-movement of moving data from the cloud back on premise, is gaining traction. Dropbox’s move from the public cloud reportedly resulted in $75M savings over two years and doubling its gross margin. Keeping cost and complexity at bay, in addition to regulatory and other requirements, are all good reasons for enterprises to not go all-in on cloud.
That is all very real, and Lorica is well aware of the fact that there are opportunities to be found in data management on-premises, too. The way this advice should be read is more in the spirit of: “If you were to start a company, what area should you focus on? That’s all there is to it. We’re not saying that there’s no need for on premise databases, or there’s no opportunities there. It’s just much easier to iterate. The cloud market is big enough, you can move faster” Lorica said.
Lorica identified multicloud as a unique challenge which is bigger than just databases. “If you look at the group in Berkeley that started Spark with Amp Lab and then Ray with Rice Lab, their new lab is called Sky Computing Lab and it’s aimed squarely at multicloud — making cloud as simple and commoditized as possible. If you are willing to build a startup that will require a lot more technology and work and maybe a bit of a longer development cycle, yes, there are definitely opportunities”, he added.
Lorica also noted that startups focused on simplifying the relationship of enterprises to cloud providers would essentially act as middlemen. Your relationship will be with the startup, and then the cloud computing is just in the background, more of a commodity. This is the Sky Lab vision, and one for which we have a blueprint today in the data management world: database-as-a-service (DBaaS). In DBaaS, users don’t have to know or care about the details of where data is stored or where workloads are running, as long as they get the service they are paying for.
The advice on choosing to address either analytics or operational workloads comes with the same caveat: it’s not that it’s impossible to do this, but if you’re a startup, it’s probably wiser to focus your efforts. It’s not so long ago that Hybrid Transactional Analytics Processing (HTAP) was featured in Gartner’s Hype Cycles. Today, even though some vendors still use the term, it has all but disappeared, giving way to the perhaps more grounded term “In-DBMS Analytics”.
The latest example of an operational DBMS providing some analytics capabilities is MongoDB. The fact that it takes different types of technical architectures to do well in operational and analytics workloads is well understood. However, while operational databases may offer some analytics capabilities to their users, Lorica thinks it’s the Databricks and Snowflakes of the world who may eventually be able to crack the unification of operational and analytics workloads with “infinite compute and storage in the cloud”.
Lorica refers to cloud data warehouses and lakehouses as Modern Data Platforms (MDP), and notes that their pull is getting stronger. He also notes that the cloud database market is growing faster than the overall database market, and that growth is led by DBaaS. Furthermore, most of the DBaaS are based on open source databases.
Case in point – PostgreSQL. Besides the eponymous database, hyperscalers all run their own version of PostgreSQL, with Google being the latest to join the club. PostgreSQL also serves as the API of choice for the likes of Yugabyte and CockroachDB, and exemplifies the dominance of the open source ecosystem. By choosing to build on PostgreSQL, products benefit from its ecosystem, and lower friction to adopt, Lorica noted.
AI and Pegacorns
In his closing thoughts on data management, and as a way of transitioning to trends in machine learning, Lorica also noted opportunities in autonomous DBaaS, databases for computer vision, and the shift towards data-centric AI.
Lorica highlights the data-centric direction in AI spearheaded by Andrew Ng, and presents evidence to make the case. Practitioners have long known that better data (not just more data) leads to better models. Data augmentation is key even in state-of-the-art models. According to the 2022 AI Index, nine state-of-the-art AI systems out of the ten benchmarks are trained with extra data.
But there’s another interesting, although perhaps unsurprising finding in Lorica’s research. In the world of AI practitioners and experts, people talk a lot about the platform plays, i.e. infrastructure and horizontal platforms. As it turns out however, startups that focus on specific problems and use cases tend to be more successful. To qualify this finding, Lorica and Kenn So introduced the notion of the Pegacorn. As they note, nearly 600 companies became unicorns in 2021.
It used to be a status that meant that a startup has graduated from being a startup and into a mature company worth being listed on the public stock market backed by revenue. In today’s climate becoming a unicorn is increasingly more a signal of investor enthusiasm than how mature a company is, with companies with $1-5 million in revenue being awarded the status. In addition, 75% of venture capitals think unicorns are significantly overvalued.
To weed out overvalued unicorns, Lorica and Araki defined flying unicorns, aka Pegacorns, to be AI companies that have reached the $100M revenue milestone and have graduated from startups to mature companies. $100M was the traditional benchmark of becoming a unicorn back when multiples were still 10x revenue, but reaching $100M in revenue also implies real scale that few startups ever achieve, Lorica notes: “If you can convince enough people to give you that much money cumulatively, then you must be onto something”.
What Lorica found when investigating AI Pegacorns was that there are more successful AI application startups compared to successful AI infrastructure startups. Lorica thinks that this is not very surprising: “On the application side, the budget and the need is much more pronounced and specific. On the platform side, you have to have enough usage of AI and machine learning and you have enough people who can use these tools to justify such a big purchase”, he said.
An infrastructure area which Lorica sees as ripe for growth is databases for computer vision. Lorica related the findings of a team of people with long standing experience in computer vision who have used all the tools out there. They talked to about 90 computer vision teams and team leads, and they dissected numerous state-of-the-art computer vision datasets.
What they found was that common problems such as corrupted images, outliers, wrong labels, and duplicated images can reach a level of up to 46% in those datasets. As for the feedback they got from conversations — it turns out that the major pain point is not really models, it’s data and working with data. That team is working on developing fastdup, which aims to address those issues.
At the same time, however, we are witnessing the emergence of vector databases. Vector databases enable storing data, including images, as vectors, i.e. in the form machine learning models work with. We asked Lorica whether vector databases could potentially be used to fill in the gap. His answer — yes, but first, it will probably be slower. And second, you may not be able to do some of the analysis, data cleaning, and all of the things that people take for granted if you mostly work in structured data.
“Believe it or not, there’s not been a lot of investment in data management solutions for visual data. For me, the results of the survey that he did were kind of a big AHA moment in terms of the team leads. I helped reach out to the team leads for many of these companies; if they are telling us that the tools out there are insufficient, then there must be an opportunity.”, Lorica said.
Lorica sees automation and democratization in AI as being on the rise, and thinks that in domains where model hubs and foundation models (e.g. language models) are available, pre-trained models and embeddings reduce the need to train models from scratch. Increasingly, this also applies to multimodal models which combine text and other modalities such as images. Lorica believes we will begin to see commercial applications based on multimodal models in the next couple of years.
He expects a range of solutions including those available though simple declarative programming interfaces, and tools that utilize graphical user interfaces designed for domain experts and business users. With the global talent pool of analysts dwarfing that of data scientists and data engineers (2 million analysts versus only 83,000 data scientists, according to Lorica’s research on LinkedIn), Lorica notes we will likely see more machine learning tools designed specifically for analysts.
Overall, Lorica’s outlook for the future of AI is on the positive side. He sees AI is becoming more affordable and more efficient, the pipeline from research to industry remaining robust, and Responsible AI gaining traction.