These days, powerful cloud data platforms like Snowflake, Databricks, and BigQuery empower business-critical use cases from petabyte-scale analytics to cross-cloud data lakes and machine learning. But how did we get here?
Many young data engineers might be surprised to learn that the journey of databases as we know them today dates back to the 1970s. In fact, much of the groundwork for what we do today, from the way we conceptualize data to the SQL we use for querying it, was laid in the early decades before cloud computing, distributed computing, and even the internet.
This timeline highlights how foundational research in the 1970s laid the groundwork for widespread commercial adoption, subsequent standardization, and ongoing innovation, ultimately transforming relational databases into powerful and versatile cornerstones of modern data management.
For more detailed information about the milestones listed in this infographic, please see below.
1970 – E. F. Codd’s Seminal Paper
- Key event: Edgar F. Codd publishes “A Relational Model of Data for Large Shared Data Banks.”
- Significance: Lays the theoretical groundwork for relational databases, introducing the concepts of relations, tuples, and normalization.
1976 – Chen’s Entity-Relationship (ER) Model
- Key event: Peter Chen’s Publication: “The Entity-Relationship Model—Toward a Unified View of Data.”
- Significance: Chen notation introduces ER diagrams as a way to visually represent entities, relationships, and attributes, becoming a foundational approach for conceptual data modeling.
Late 1970s / Early 1980s – System R and INGRES, and SQL
- Key event: IBM System R: Builds on Codd’s ideas and pioneers SQL (originally called SEQUEL). UC Berkeley develops INGRES, a precursor to Postgres.
- Significance: Another influential system and research project solidified SQL as a means of communicating with databases and paved the way for future databases, such as PostgreSQL.
1979 – Oracle’s First Commercial RDBMS
- Key Event: Oracle (then Relational Software, Inc.) releases the first commercially available relational database leveraging SQL.
- Significance: Proves that relational theory can be successfully commercialized and broadly adopted in businesses.
1981 – Barker Notation
- Key event: Richard Barker develops this notation while working on CASE tools for data modeling.
- Significance: Barker notation focuses on entity-relationship modeling, featuring a cleaner and more streamlined diagram style that emphasizes clarity in large-scale database design.
Mid-1980s – IDEF1X Notation
- Key event: Developed under U.S. Air Force projects for data modeling.
- Significance: Standardizes data modeling techniques, specifying entities, relationships, and key constraints in a way particularly suitable for relational schema design and government/enterprise documentation.
1983 – IBM DB2
- Key event: IBM introduces DB2 on mainframe systems.
- Significance: Solidifies the enterprise use of relational databases on a large scale.
1985 Microsoft Excel
- Key event: Microsoft releases Excel
- Significance: Microsoft releases Excel, a spreadsheet app that organizes data in columns and rows, to compete with the market leader, Lotus 1-2-3. The popularity of Excel reinforces the understanding of columnar data structures, similar to those found in databases, in the public consciousness.
Mid-1980s – SQL Standardization
- Key event: American National Standards Institute (ANSI) SQL (1986) and International Organization for Standardization (ISO) SQL (1987) standards adopted.
- Significance: Solidifies SQL as the standard relational query language, facilitating interoperability across different vendors.
1992 – Bill Inmon and the Data Warehouse Concept
- Key event: Bill Inmon introduces the Data Warehouse Concept
- Significance: Bill Inmon introduced the term “data warehouse” in a series of articles and papers published in the late 1980s, and he formalized the concept in his influential book: “Building the Data Warehouse” – first published in 1992. In the book, Inmon defined a data warehouse as: “A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.” This definition and the architectural principles around it became the cornerstone of enterprise data warehousing strategies for decades.
- Methodology:
- Top-down, enterprise-wide approach (Corporate Information Factory).
- Emphasizes a centralized data warehouse with normalized data, supporting broad historical analysis.
Mid-1990s – Proliferation of Commercial and Open-Source RDBMS
- Key event: SQL Server, MySQL, and PostgreSQL emerge as popular RDBMS solutions for both enterprise and open-source projects.
- Significance: Microsoft SQL Server emerges (building on Sybase code), becoming a major competitor on Windows platforms. In 1995, MySQL was released, offering a lightweight, open-source relational database option that fueled widespread adoption of the web. PostgreSQL evolved from the POSTGRES project at Berkeley, introducing advanced features such as object-relational capabilities.
1996 – Ralph Kimball and Dimensional Modeling
- Key event: Ralph Kimball introduces dimensional modeling
- Significance: Kimball begins formalizing his bottom-up, dimensional modeling philosophy in publications and training. Kimball and Margie Ross formalize this approach in their 1996 book, The Data Warehouse Toolkit. This contrasts with Inmon’s enterprise-focused approach, emphasizing star schemas, data marts, and business-process-centric design.
- Methodology:
- Bottom-up approach with a focus on star schemas and data marts (facts and dimensions).
- Iterative development of data warehouses aligned with specific business processes.
1997 – Unified Modeling Language (UML)
- Key event: Created by Grady Booch, Ivar Jacobson, and James Rumbaugh, standardized by the OMG (Object Management Group) around 1997.
- Significance: Although primarily used for object-oriented software design, UML class diagrams are often employed to model data structures conceptually, thereby bridging application development and database design. The concept ultimately fails to live up to its promise of a “unified” single standard.
2001 Data Vault
- Key event: Dan Linstedt introduces the Data Vault methodology
- Significance: Dan Linstedt introduced Data Vault in the early 2000s through white papers, conference talks, and consulting engagements. The methodology was expanded in 2015 with the introduction of Data Vault 2.0, which included concepts such as big data and NoSQL.
- Methodology:
- A modeling methodology for data warehouses designed to handle change, scalability, and historical tracking.
- Uses Hubs (business keys), Links (relationships), and Satellites (context/history) to decouple structure and improve agility.
- Its application is Ideal for enterprise data warehouses that require auditability, versioning, and agility under evolving business rules.
2006 – Hadoop and Distributed Computing
- Key event: Hadoop and MapReduce popularized
- Significance: Hadoop, inspired by Google’s MapReduce paper, enables large-scale, batch-oriented data processing on commodity hardware. Hadoop challenges the dominance of traditional RDBMS for certain analytic workloads and sparks the broader big data movement.
2006 – Amazon Web Services (AWS) Officially Launched
- Key event: Amazon offers services like S3 (Simple Storage Service) and EC2 (Elastic Compute Cloud)
- Significance: Introduces the world to scalable, on-demand cloud infrastructure, laying the groundwork for cloud-native databases and data warehouses.
Mid-2000s – The NoSQL Movement
- Key event: NoSQL (Not Only SQL) databases released to meet the demands of big data
- Significance: The demand for high availability, horizontal scalability, and flexible schemas in web-scale applications has driven innovations such as MongoDB, Cassandra, CouchDB, and Redis. NoSQL databases often drop rigid schemas in favor of document, key-value, or wide-column models.
2010 – Google BigQuery
- Key event: Google releases BigQuery to beta, GA in 2011
- Significance: A fully-managed, serverless data warehouse for large-scale analytics, using columnar storage and distributed querying. Showcases how cloud-native, on-demand services can disrupt traditional on-prem data warehousing.
2010 – Data Lake Concept Introduced
- Key event: James Dixon introduces the concept of a data lake.
- Significance: To contrast with a data mart, a data lake is a centralized repository designed to store and process large amounts of structured, semi-structured, and unstructured data in its native format, allowing for various types of analytics, including big data processing, real-time analytics, and machine learning.
2012 – Snowflake Founded
- Key event: Snowflake founded and publicly available in 2015.
- Significance: A cloud data platform to offer near-zero maintenance, scalable compute, and storage. Snowflake offers near-infinite scalability, concurrent workloads, and simplified administration.
2013 – Amazon Redshift
- Key event: Redshift is released
- Significance: A cloud data warehousing platform, Redshift is built on top of massively parallel processing technology and based on Postgres, to handle analytic workloads on big data sets stored by a column-oriented DBMS principle.
2013 – Databricks
- Key event: Databricks founded
- Significance: Unified analytics platform, combining data engineering, data science, and warehousing. Databricks later pioneered the lakehouse concept, which merges data lake scalability with warehousing performance, blurring traditional boundaries.
2018 – dbt Launches Commercial Offering
- Key event: Fishtown Analytics releases dbt as a commercial product
- Significance: Since 2016, dbt (data build tool) Core has been available as open source. In 2018, the dbt Labs team (then called Fishtown Analytics) released a commercial product on top of dbt Core. The tool enables engineers to decouple transformational logic from object parameters, allowing them to create dynamic data pipelines.
2019 – Data Mesh
- Key event: Zhamak Dehghani introduces the Data Mesh framework
- Significance: In a blog post titled “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh”, Dehdhani introduces the world to Data Mesh. The framework takes a socio-technical approach to data architecture and promotes decentralized, domain-oriented ownership of data. Data Mesh treats data as a product, with cross-functional teams owning pipelines, quality, and access.
Methodology: Data mesh includes four key principles:- Domain-oriented ownership
- Data as a product
- Self-serve infrastructure
- Federated computational governance
- Domain-oriented ownership
2019 – Data Lakehouse Concept
- Key event: The Lakehouse concept blurs the line between data lake and warehouse
- Significance: Cloud data platforms like Databricks pioneer a hybrid approach that can ingest a variety of raw data formats, similar to a data lake, yet provide ACID transactions and enforce data quality, much like a data warehouse.
2022 – ChatGPT Released by OpenAI
- Key event: ChatGPT is introduced to the public
- Significance: ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It raises the bar for natural language processing and demonstrates that AI is much more capable than previously thought.
2023 – Generative AI Meets Structured Data
- Key event: Cloud platforms like Databricks and Snowflake introduce AI copilots
- Significance: LLMs integrated into data platforms for code generation, SQL generation, semantic querying, and autonomous agents for data exploration. The move reframes how users interact with data warehouses—moving from structured query writing to conversational and AI-assisted analytics.