The Evolution of Enterprise Data Systems

The Evolution of Analytical Databases: From Monoliths to Decoupled Innovation

Explore how analytical databases have evolved from monoliths and how generative AI fits in.


The Evolution of Analytical Databases: From Monoliths to Decoupled Innovation

You might have seen multiple news in recent weeks: Snowflake planning to release Polaris, an open-source catalog for Iceberg, Databricks acquiring Tabular, a company founded by the creators of Iceberg, for over a billion dollars, and OpenAI acquiring RockSet to help “companies to transform their data into actionable intelligence”. Here is an attempt to explain how this fits in by tracing the history of how enterprises have stored their data.

The Early Days: Monolithic Systems and Challenges

Cast your mind back to the 1990s. Databases were like fortresses—impenetrable and unyielding. Needed to store data? You’d better have a crystal ball to predict your future needs! These early systems required upfront schema declarations and resource provisioning, enough to drive even the most patient infrastructure manager to despair.

As businesses expanded, so too did their data requirements. Scaling these titans? A monumental challenge. Flexibility and efficiency were mere fantasies in this rigid landscape.

A Paradigm Shift: The Rise of Open File Formats and Object Storage

Then the deluge of data arrived. Enterprises found themselves awash in a sea of unstructured and semi-structured data. This flood heralded a pivotal shift—storage costs plummeted, and the radical notion of “store now, analyze later” took root.

This era welcomed object storage solutions like Amazon S3, revolutionizing how data was stored and accessed—cheaply, conveniently, and reliably. It also saw the emergence of columnar storage formats such as Parquet and ORC. These open file formats not only supported schema evolution, efficient data compression, and complex nested data structures, but made it possible to get acceptable query performance on analytical queries.

Meanwhile, the use cases for data began to diversify, extending beyond traditional SQL to accommodate machine learning workloads.

The Modern Era: Decoupling Compute from Storage and Data Lakes

The launch of platforms like Google’s BigQuery (Dremel) and Snowflake marked a significant evolution in analytical databases. These platforms separated compute from storage, enabling dynamic scaling and on-demand resource allocation. This shift abolished the need for upfront provisioning of resources, ushering in a new era of flexibility and efficiency.

Scaling became as effortless as adjusting a dial—need more processing power? Activate it on demand. Done crunching numbers? Scale back down.

Around the same time, Databricks popularized data lakes, allowing enterprises to store vast amounts of raw data in S3, yet use a table format to get table and transaction semantics. This approach provided a foundation for more flexible and scalable data architectures. Iceberg and Delta table were the two table formats that started gaining popularity.

Large data warehouses, including Snowflake and BigQuery, which had their own native storage format, started supporting these table formats. At the recent Snowflake summit, Snowflake even announced Polaris, an open-source catalog for Iceberg. At the same time (and while the Snowflake summit was going on), Databricks announced its acquisition of Tabular, the company behind Iceberg, signaling its intention to unify table formats.

Beyond Storage: The Advent of Large Language Models

Another recent trend reshaping the enterprise data landscape is the adoption of generative AI. While ChatGPT has gained significant traction among consumers, the real monetization potential lies in helping enterprises unlock the value of their data using generative AI technologies. In this context, OpenAI’s acquisition of RockSet is a strategic move aimed at enhancing data analytics capabilities. As stated by OpenAI,

“We’ll integrate Rockset’s technology across our products, empowering companies to transform their data into actionable intelligence.”

Meanwhile, major data platforms like Databricks, Snowflake, Google, Microsoft, and Amazon are not only developing or integrating large language models (LLMs) but are also incorporating vector databases. This development ensures that enterprises can effectively leverage the power of generative AI while adhering to the strict governance and security standards essential for business operations.

The race to integrate generative AI highlights a broader trend: both generative-AI-first companies and leading data platform companies are striving to position themselves at the forefront of enterprise data solutions. Their goal is to become indispensable partners in the journey toward data-driven business transformations.

Looking Ahead: A Bright and Data-Driven Future

As we continue to witness innovations in enterprise data systems, it is clear that the landscape is anything but static. Generative AI and enterprises owning their data in open formats promise a new data-driven future.