How Does Databricks SQL Support Normalization and Denormalization?
The Lakehouse inherits the flexibility of data lakes in terms of data modeling. In essence, it is modeling-agnostic. You can implement any modeling technique with Databricks SQL and the Lakehouse. For example, you can implement Kimball-style star schema data models, Inmon-style data marts, or a Data Vault.
This is because, at the end of the day, the data model is implemented via table DDLs or in semantic layers on the reporting side. Databricks SQL supports primary keys, surrogate keys, and foreign keys. That said, primary keys and foreign keys are currently not enforced – they are supported to allow interrogation from other tools.
ETL best practices for big data and Online Analytical Processing (OLAP) systems recommend disabling these keys for faster ingestion and independent ingestion of datasets. The key definition is mainly maintained in data modeling tools or semantic layers on the reporting side.
Databricks SQL fully supports SQL syntax, so loading a normalized or denormalized table will happen just like in any other SQL database.
Star schemas denormalize the data, which means adding redundant columns to some dimension tables to make querying and working with the data faster and easier. The purpose is to trade some redundancy (duplication of data) in the data model for increased query speed, by avoiding computationally expensive join operations.
In this model, the fact table is normalized but the dimensions tables are not. That is, data from the fact table exists only on the fact table, but dimensional tables may hold redundant data.