Delta Lake: Introduction

  • Read
  • Discuss

What is Delta Lake?

Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

You can easily use a single copy of data for both batch and streaming operations with Delta Lake’s strong interaction with Structured Streaming and support for incremental processing at scale. Delta Lake is completely compatible with Apache Spark APIs.

Delta Lake is the default storage format for all operations on Databricks. All tables on Databricks are Delta tables unless otherwise stated. The Delta Lake protocol was created by Databricks, which still actively participates in the open source effort. The Databricks Lakehouse Platform’s optimizations and products heavily rely on the assurances offered by Apache Spark and Delta Lake.

Why should Delta Lake be used?

Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning, achieving better performance improvements.

Converting and ingesting data to Delta Lake

Databricks provides a number of products to accelerate and simplify loading data to your lakehouse.

  • Delta Live Tables
  • COPY INTO
  • Auto Loader
  • Add data UI
  • Convert Parquet or Iceberg data to Delta Lake
  • Third-party partners

Updating and modifying Delta Lake tables

Atomic transactions with Delta Lake provide many options for updating data and metadata. Databricks recommends you avoid interacting directly with data and transaction log files in Delta Lake file directories to avoid corrupting your tables.

  • Delta Lake supports upserts using the merge operation.
  • Delta Lake provides numerous options for selective overwrites based on filters and partitions.
  • You can manually or automatically update your table schema without rewriting data.
  • Column mapping enables columns to be renamed or deleted without rewriting data.

Incremental and streaming workloads on Delta Lake

On Databricks, Delta Lake is optimised for Structured Streaming. With streamlined infrastructure deployment, improved scaling, and managed data dependencies, Delta Live Tables expands native capabilities.

Table streaming reads and writes

Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream. Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:

  • Coalescing small files produced by low latency ingest
  • Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
  • Efficiently discovering which files are new when using files as the source for a stream
  • Use Delta Lake change data feed on Databricks
  • Enable idempotent writes across jobs

Use Delta Lake change data feed on Databricks

Databricks can track row-level changes between Delta table versions using the change data feed. When enabled on a Delta table, the runtime logs any data changes made to the table’s contents. This contains the row data as well as metadata that shows whether the particular row was added, removed, or changed.

You can read the change events in batch queries using Spark SQL, Apache Spark DataFrames, and Structured Streaming.

Enable idempotent writes across jobs

Sometimes a job that writes data to a Delta table is restarted due to various reasons (for example, a job encounters a failure). The failed job may or may not have written the data to Delta table before terminating. In the case where the data is written to the Delta table, the restarted job writes the same data to the Delta table which results in duplicate data.

  • To address this, Delta tables support the following DataFrameWriter options to make the writes idempotent:
  • txnAppId: A unique string that you can pass on each DataFrame write. For example, this can be the name of the job.
  • txnVersion: A monotonically increasing number that acts as a transaction version. This number needs to be unique for data that is being written to the Delta table(s). For example, this can be the epoch seconds of the instant when the query is attempted for the first time. Any subsequent restarts of the same job needs to have the same value for txnVersion.

The above combination of options needs to be unique for each new data that is being ingested into the Delta table and the txnVersion needs to be higher than the last data that was ingested into the Delta table. For example:

  • Last successfully written data contains option values as dailyETL:23423 (txnAppId:txnVersion).
  • Next write of data should have txnAppId = dailyETL and txnVersion as at least 23424 (one more than the last written data txnVersion).
  • Any attempt to write data with txnAppId = dailyETL and txnVersion as 23422 or less is ignored because the txnVersion is less than the last recorded txnVersion in the table.

Attempting to write data with txnAppId:txnVersion as anotherETL:23424 is successful writing data to the table as it contains a different txnAppId compared to the same option value in last ingested data.

Querying previous versions of a table

Each write to a Delta table creates a new table version. You can use the transaction log to review modifications to your table and query previous table versions. View Delta Lake Operations to understand better

Leave a Reply

Leave a Reply

Scroll to Top