Delta Live Tables

  • Read
  • Discuss

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. Delta Live Tables controls task orchestration, cluster management, monitoring, data quality, and error handling while you specify the data transformations to be applied to your data.

Delta Live Tables regulates how your data is transformed based on a target schema you designate for each processing stage, as opposed to creating your data pipelines using a number of different Apache Spark tasks. Moreover, you can impose data quality standards using Delta Live Tables. Expectations let you declare the expected level of data quality and how to deal with records that don’t meet them.

Delta Live Tables (DLT) makes it easy to build and manage reliable batch and streaming data pipelines that deliver high-quality data on the Databricks Lakehouse Platform. With declarative pipeline construction, automatic data testing, and deep visibility for monitoring and recovery, DLT assists data engineering teams in streamlining ETL development and management.

Delta Live Table Concepts

This section introduces the fundamental concepts you should understand to use Delta Live Tables effectively.

Pipelines

You implement Delta Live Tables pipelines in Databricks notebooks. You can implement pipelines in a single notebook or in multiple notebooks. All queries in a single notebook must be implemented in either Python or SQL, but you can configure multiple-notebook pipelines with a mix of Python and SQL notebooks. Each notebook shares a storage location for output data and is able to reference datasets from other notebooks in the pipeline.

You can use Databricks Repos to store and manage your Delta Live Tables notebooks. To make a notebook managed with Databricks Repos available when you create a pipeline:

  • Add the comment line — Databricks notebook source at the top of a SQL notebook.
  • Add the comment line # Databricks notebook source at the top of a Python notebook.

You can also use a Databricks repo to store your Python code and import it as modules in your pipeline notebook. 

Queries

Queries implement data transformations by defining a data source and a target dataset. Delta Live Tables queries can be implemented in Python or SQL.

Expectations

You use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements.

You can define expectations to retain records that fail validation, drop records that fail validation, or halt the pipeline when a record fails validation.

Pipeline settings

Pipeline settings are defined in JSON and include the parameters required to run the pipeline, including:

  • Libraries (in the form of notebooks) that contain the queries that describe the tables and views to create the target datasets in Delta Lake.
  • A cloud storage location where the tables and metadata required for processing will be stored. This location is either DBFS or another location you provide.
  • Optional configuration for a Spark cluster where data processing will take place.

Datasets

There are two types of datasets in a Delta Live Tables pipeline: views and tables.

  • Views are similar to a temporary view in SQL and are an alias for some computation. A view allows you to break a complicated query into smaller or easier-to-understand queries. Views also allow you to reuse a given transformation as a source for more than one table. Views are available within a pipeline only and cannot be queried interactively.
  • Tables are similar to traditional materialized views. The Delta Live Tables runtime automatically creates tables in the Delta format and ensures those tables are updated with the latest result of the query that creates the table.

You can define a live or streaming live view or table:

A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.

A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.

Streaming live tables are valuable for a number of use cases, including:

  • Data retention: a streaming live table can preserve data indefinitely, even when an input data source has low retention, for example, a streaming data source such as Apache Kafka or Amazon Kinesis.
  • Data source evolution: data can be retained even if the data source changes, for example, moving from Kafka to Kinesis.

You can publish your tables to make them available for discovery and querying by downstream consumers.

Continuous and triggered pipelines

Delta Live Tables supports two different modes of execution:

  • Triggered pipelines update each table with whatever data is currently available and then stop the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and starts by computing those that read from external sources. Tables within the pipeline are updated after their dependent data sources have been updated.
  • Continuous pipelines update tables continuously as input data changes. Once an update is started, it continues to run until manually stopped. Continuous pipelines require an always-running cluster but ensure that downstream consumers have the most up-to-date data.

Triggered pipelines can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline. However, new data won’t be processed until the pipeline is triggered. Continuous pipelines require an always-running cluster, which is more expensive but reduces processing latency.

The continuous flag in the pipeline settings controls the execution mode. Pipelines run in triggered execution mode by default. Set continuous to true if you require low latency refreshes of the tables in your pipeline.

{
  ...
  "continuous": true,
  ...
}

The execution mode is independent of the type of table being computed. Both live and streaming live tables can be updated in either execution mode.

If some tables in your pipeline have weaker latency requirements, you can configure their update frequency independently by setting the pipelines.trigger.interval setting:

spark_conf={"pipelines.trigger.interval": "1 hour"}

This option does not turn off the cluster in between pipeline updates, but can free up resources for updating other tables in your pipeline.

Leave a Reply

Leave a Reply

Scroll to Top