What is Delta Lake?
Delta Lake is an open source data storage format that combines Apache Parquet data files with a robust metadata log. The Delta Lake format brings key data management functions, such as ACID transactions and data versioning, to data lakes, making it the basis for many data lakehouses.
First developed by Databricks in 2016, Delta Lake is an open table format, an open source framework for tabular data that builds a metadata layer on top of existing file formats. Delta Lake specifically uses Parquet tables for data storage. Other open table formats include Apache Iceberg …
What is Delta Lake?
Delta Lake is an open source data storage format that combines Apache Parquet data files with a robust metadata log. The Delta Lake format brings key data management functions, such as ACID transactions and data versioning, to data lakes, making it the basis for many data lakehouses.
First developed by Databricks in 2016, Delta Lake is an open table format, an open source framework for tabular data that builds a metadata layer on top of existing file formats. Delta Lake specifically uses Parquet tables for data storage. Other open table formats include Apache Iceberg and Apache Hudi.
The metadata layer allows Delta Lake and other open tables to optimize search queries and support advanced data operations that many standard table formats cannot. Organizations often use Delta Lake to make their data lakes more reliable and intuitive.
The creation of Delta Lake was a critical step in the development of data lakehouse architecture, which combines the storage of a data lake with the performance of a data warehouse.
What’s the difference between Delta Lake and a data lake?
Delta Lake and data lakes are often discussed together, but it’s important to know that these technologies are distinct from one another.
A data lake is a low-cost data storage environment designed to handle massive datasets of any data type and format. Most data lakes use cloud object storage platforms such as Amazon Simple Storage Service (S3), Microsoft Azure Blob Storage or IBM Cloud® Object Storage.
Delta Lake is a tabular data storage format that an organization can use in a data lake or other data store.
Delta Lake is not a type of data lake, nor is it an alternative to a data lake. Rather, one can think of a data lake as the “where” and Delta Lake as the “how”:
- Where is the data stored? In a data lake.
- How is the data stored? As a Delta Lake table.
The Delta Lake format can help make data lakes more manageable and efficient.
Data lakes have many benefits, but they typically lack built-in data quality controls, and directly querying data lakes can be difficult. Organizations must often take data from a lake, clean it up and load it into separate data warehouses and data marts before it can be used.
By introducing a metadata layer, Delta Lake gives organizations a way to enforce schemas, track and rollback changes and support ACID transactions.
Users can run structured query language (SQL) queries, analytics workloads and other activities right on a data lake, streamlining business intelligence (BI), data intelligence (DI), artificial intelligence (AI) and machine learning (ML).
How Delta Lake works
Delta Lake has 2 core components: the data files that contain data and the transaction log that houses metadata about those data files.
- Data files** **in Delta Lake are called “Delta Lake tables” or “Delta tables” and use the columnar Parquet file format.
- The transaction log stores recent activity in JSON log files and archives older metadata in Parquet files. The log is stored in the data lake alongside the data files.
The transaction log records information about the data files (such as column names and minimum and maximum values) and changes made to the files (what was changed, when, how and by whom).
The log is what makes Delta Lake different from a standard Parquet file. The transaction log essentially acts as a management layer and a set of instructions for all activity in the data lake, enabling features such as ACID transactions, time travel and schema evolution.
Regular Parquet files are immutable, meaning they can’t be changed after they are created; they can only be rewritten. Delta Lake’s transaction log makes Parquet files functionally, if not literally, mutable by separating physical actions (actions performed directly on the data) from logical actions (actions performed on the metadata).
For example, a user cannot remove a single column from a Parquet table without rewriting the whole file. In Delta Lake, the user can effectively remove that column by changing the table’s metadata to mark that column as deleted. The column remains in the file, but all subsequent queries, updates and writes see the metadata and treat the column as nonexistent.
Key features of Delta Lake
ACID transactions
“ACID” stands for “atomicity, consistency, isolation and durability”—key properties of a reliable data transaction.
- Atomicity means that all changes to data are performed as if they are a single operation.
- **Consistency **means that data is in a consistent state when a transaction starts and when it ends.
- Isolation means that the intermediate state of a transaction is invisible to other transactions.
- **Durability **means that changes made to data persist and are not undone.
Standard data lakes cannot support ACID transactions. Without ACID guarantees, data lakes are susceptible to failed transactions, partial writes and other issues that can corrupt data.
Delta Lake’s transaction log can record transaction information in accordance with the ACID principles, making data lakes more reliable for streaming data pipelines, business intelligence, analytics and other use cases.
Schema enforcement and schema evolution
Administrators can set schema requirements in the transaction log, and these requirements apply to all data upon ingestion. Data that does not meet schema requirements is rejected.
Admins can also use the transaction log to change an existing file’s schema, such as adding new columns or changing column types. This process is called “schema evolution.”
Query optimization
While not a traditional index, the transaction log can help queries retrieve data faster and more efficiently.
For example, say that a user is searching for a certain value in a column. Using metadata in the transaction log, the user’s query can skip any files where the target value cannot possibly exist. If the minimum value is higher or the maximum value is lower than the target value, the query can skip the file.
The transaction log also stores file paths. Instead of scanning the entire data lake, queries can use these file paths to head directly to relevant files.
Delta Lake can use techniques such as Z-ordering to store similar data closer together on disk, which makes it easier to skip irrelevant files and find relevant ones.
Data operations
Normal Parquet files are immutable, but users can manipulate Delta tables through the metadata layer. Delta Lake supports all kinds of data operations, including adding or dropping columns, updating entries and merging files.
Data versioning
Because the transaction log records everything that happens in the Delta tables, it effectively maintains version histories for each table. Users can query past versions and even time travel, that is, roll back changes to restore previous table versions.
Connectors
Delta Lake has a robust ecosystem of connectors. The format can be used with various compute engines, such as Apache Spark, Apache Hive, Apache Flink or Trino. Delta Lake also has application programming interfaces (APIs) for Python, Java, Scala and other languages, enabling developers to manage and query Delta tables programmatically.
Access and governance controls
While Delta Lake does not natively enforce security controls, it can integrate with data security and data governance tools. These tools can then use metadata from the transaction log to audit activity, track changes and enforce role-based access control (RBAC) policies.
Support for batch and streaming data
Delta Lake can accept both streaming and batch data, and data can be sent from Delta tables to connected services as a stream or in batches.
Recent developments
Delta Lake 4.0, the next scheduled major release for Delta Lake, plans to add more features, such as:
- Coordinated commits to streamline the process of writing to multiple tables and from multiple engines at once.
- A new “variant” data type for storing semistructured data, which is traditionally difficult to store in table format.
- Type widening, which enables users to change a column’s data type.
Delta Lake vs. other open table formats
Delta Lake vs. Apache Iceberg
Apache Iceberg is a high-performance open source format for massive analytic tables. Like Delta Lake, Iceberg builds a metadata layer on top of existing table formats to support ACID transactions and other operations in a data lake.
Iceberg can store data in Parquet, ORC or Avro files, whereas Delta Lake uses Parquet exclusively. Iceberg also uses a three-tiered metadata layer instead of a single transaction log such as Delta Lake.
Iceberg integrates natively with many different query engines, and it is a common choice for SQL-based analytics in a data lake.
Delta Lake vs. Apache Hudi
Like Delta Lake and Iceberg, Hudi maintains a metadata layer on top of a data layer. Hudi can use Parquet, HFile and ORC file formats, and its metadata layer takes the form of a “timeline” that records everything that happens in the data layer.
Hudi is designed for incremental data processing, in which small batches of data are processed frequently. This focus on incremental processing makes Hudi a common choice for real-time analytics and change data capture (CDC).
Delta Lake’s role in the data lakehouse
The development of the Delta Lake format helped pave the way for the creation of data lakehouses.
For a long time, organizations primarily managed their data in data warehouses. While useful for analytics and BI, warehouses require strict schemas. They don’t work well with unstructured or semistructured data, which has become more prevalent and more important as organizations ramp up their investments in AI and ML.
The rise of data lakes in the early 2010s gave organizations a way to aggregate all kinds of data from all kinds of data sources in one location.
However, data lakes have their own issues. They often lack quality controls. They don’t support ACID transactions, and it’s not easy to query them directly.
To make data usable, organizations often needed to build separate extract, transform, load (ETL) data pipelines to move data from a lake to a warehouse.
Delta Lake emerged in 2016, adding ACID transactions, schema enforcement and time travel to data lakes, making them more reliable for direct querying and analytics.
Open sourced in 2019, Delta Lake played a key role in shaping the data lakehouse architecture, which combines the flexibility of data lakes with the performance of data warehouses.
Many organizations create data lakehouses by building a Delta Lake storage layer on top of an existing data lake and integrating it with a data processing engine such as Spark or Hive.
Data lakehouses help support data integration and streamline data architecture by eliminating the need to maintain separate data lakes and warehouses, which can lead to data silos.
In turn, these streamlined architectures help ensure that data scientists, data engineers and other users can access the data they need when they need it. AI and ML workloads are common use cases for Delta Lake-powered data lakehouses.
Data lakes are, on their own, already useful for these workloads because they can house massive amounts of structured, unstructured and semistructured data.
By adding features such as ACID transactions and schema enforcement, Delta Lake helps ensure training data quality and reliability in ways that standard data lakes cannot.