Data management is a critically important foundation for enabling applications, analytics, business intelligence and machine learning.
Over the course of 2020, a number of key trends emerged as data management vendors and users alike were affected by the global coronavirus pandemic and the need to accelerate data insights cost effectively.
Among the clear trends that have emerged is the need for organizations to make better use of cloud storage to enable data lakes that are more than just data swamps. Multiple vendors and open source projects took up the challenge of optimizing data lakes in 2020, with different data lake engines and query technologies.
2021: Lakehouses and Iceberg on the horizon
Another key data management trend in 2020 was the concept of the data lakehouse. The data lakehouse is a technical architecture that combines the best elements of data lake and data warehouse models.
The lakehouse concept was pioneered by Databricks in 2019 with the vendor’s open source Delta Lake project. In 2020, the lakehouse concept became commercially available with the San Francisco-based vendor’s Delta Engine technology introduced in June and further expanded in the Databricks Unified Data Analytics Platform released in November.
“Databricks has long been known for supporting data science workloads, but it stepped up on the business intelligence and data warehousing side in 2020 with its lakehouse,” commented Doug Henschen, an analyst at Constellation Research.
Henschen added that it’s no simple matter meeting mission-critical needs for business intelligence and analytics at scale. While Databricks likes to tout query speed performance stats, in Henschen’s view that is just half the story. For 2021, he’s looking to see how Databricks’ technology is adopted by customers with high concurrency among users and queries.
While the lakehouse concept has its set of adherents, with Databricks and the open source delta lake project, a rival effort emerged in 2020 that is set to have a big year in 2021. That is the open source Apache Iceberg project, originally developed at streaming media giant Netflix.
“Iceberg is actually an open table format for huge analytic data sets,” explained Daniel Weeks, engineering manager for big data compute at Netflix, at the Subsurface virtual conference in July. “It’s an open community standard with a specification to ensure compatibility across languages and implementations.”
Beyond Netflix, both Apple and Expedia are early users of Iceberg, which is positioned to break out for wider adoption in 2021. To this point, Iceberg has been an open source community effort, but that will change in 2021 as enterprise-supported tools emerge. The earliest commercially supported platform that will integrate Iceberg is likely to be from Dremio, a data lake engine vendor based in Santa Clara, Calif.
Dremio was busy in 2020 building out its platform that enables users to query data lakes in an optimized system for business intelligence and analytics.
Dremio has been an active participant and contributor in the open source Iceberg project and is the host of the Subsurface conference. In 2021, the company plans on integrating Iceberg into its platform, which will provide an alternative approach to the Databricks lakehouse approach.
Whether an Iceberg-based method to enable easier data management in a data lake will be faster or more efficient than a lakehouse model remains to be seen, but it will be a key trend to watch in 2021.
Spark vs. Presto
Another emerging trend for data management in 2021 will be in the data query sector.
The open source Apache Spark query engine had a major release in 2020 with it 3.0 milestone that became generally available on June 18. Spark 3.0 introduced the Adaptive Query Execution (AQE) feature to accelerate data queries.
Challenging Spark in 2020 was the open source Presto project that gained the support of multiple commercial vendors all vying to take workload share from Spark.
Among the vendors that emerged in 2020 with Presto is Starburst, which raised $42 million in funding on June 16. The company’s core platform is Starburst Enterprise Presto, which was updated in July 2020 with capabilities to support data queries on Hadoop workloads and cloud data lakes.
Another vendor that emerged in 2020 to bring Presto to enterprises is Ahana, which raised $4.8 million in seed funding on Sept. 22. Alongside the financing, the company introduced its Ahana Cloud for Presto system, providing a managed service for organizations using Presto.
Adding further momentum to the growing use of Presto, on Dec. 8 the Varada Data Platform became generally available. Varada’s data virtualization platform embeds Presto as the engine that helps to enable data queries against disparate sources of data.
Presto is not likely going to displace Spark as the dominant SQL query engine in 2021, but it will undoubtedly attract new users and vendors as enterprises seek to optimize data management queries.
Personal data management in 2021
While enabling organizations to more effectively use data is a key trend for 2021, so too is the need for improved personal data management.
Enterprise Strategy Group (ESG) analyst Mike Leone noted that the market for personal data management is made up of a collection of vendors, including new entrants such as Dataswift and Inrupt that are focused on enabling end users to control their own personal data.
“I think throughout this year, we’ll see end users demand more control of their own data and we’ll see governing bodies step up their game to address end-user data privacy concerns,” Leone said.