DELTA LAKE ESSENTIALS

Delta Lake 101 — PART1

LEVEL#INTERMEDIATE

RK Iyer

Published in

Microsoft Azure

6 min readJul 19, 2022

❑ Background

In today’s times most companies have understood the power of data and are in a relentless pursuit to make best use of data to take an informed business decision. In this zest to become a data-driven organization, organizations have started creating Data lakes for data storage due to their inherent flexibility & advantages over data warehouses.

Data lakes gives us the flexibility of storing all types of data be it structured, unstructured, and semi-structured at relatively cheap cost compared to data warehouses. Also, the data schema is decided upon reading, rather than loading, or writing, the data. One can always change the schema if there is extra information or structures that you need from the raw data, leading to greater organizational agility.

Organizations love this as this helps them in collecting all data be it data from sensors, logs and can make value out of this data in future. The flexibility that data lakes bring to data management and storage is also one of their biggest shortcomings. As they store all types of data for long periods, data lakes can quickly become Data swamps.

Data swamps are faulty, outdated irrelevant data with no data governance, automation. Data swamps are difficult to navigate and manage.

Delta Lake was designed to bring the governance, reliability and structure of data warehouses into data lakes to, above all else, ensure that the data in an organization’s data lake is reliable for use.

In this 2-part series blog, we will walk through the challenges of building a data lake, what is Delta Lake, its different elements, features and how delta helps in overcoming different data lake challenges mentioned above.

❑ Challenges of building a Data Lake?

I have faced a lot of challenges in implementing data lake in the past. Please find below some of the challenges along with workaround faced during implementation of data lakes.

❐ Performing updates & deletions —

➲ Challenge — Data Lake/Hadoop was started with a concept of write once & read many times, but the main challenge was to handle updates & deletes in an efficient & performant manner.

➲ Solution — One of methods of handling updates & deletes was to read the entire table, perform update, delete & insert operations in dataframe & write the entire table again. This was not suitable for large tables.

Other method used for large transactional tables was creating a separate table (update-delete table) for the updated & deleted records (delete flag true) to make sure that the writes are faster. While performing read, the main table & update-delete table were joined to get the latest records. Compaction was performed weekly to merge both these tables during off-peak hours.

❐ Consistency of data —

➲ Challenge — Jobs would fail intermittently causing end users to see inconsistent data.

➲ Solution — Data pipeline consisting of cleanup & reprocessing jobs needed to be designed to ensure record level consistency.

❐ Historical versioning of data —

➲ Challenge — In some industries it is important to keep different historical versions of data. This becomes extremely costly and difficult to manage.

➲ Solution — For one of the customers, we used HBase which has out-of-the box feature to maintain different version of data. Please refer Apache HBase ™ Reference Guide for more details.

❐ Data Quality/Schema Validation —

➲ Challenge — In Hadoop/bigdata systems, we don’t have inbuilt schema enforcement/validation check like permissible datatypes etc. similar to RDBMS & data warehouse. Schema on Read/schema evolution is a very good feature for Hadoop & NoSQL but sometimes you can land up in big trouble if you allow to mix both good & bad data.

➲ Solution — To solve the above problem, we had created a separate module which would check the quality of data & would send the report back to the source to intimate them about the quality of data and thereby not allowing the bad data in the silver & gold layer.

❐ Same data different use cases — In many cases the same data is processed multiple times for real-time & batch-processing use case resulting in increasing the cost & complexity. This was quite difficult to solve since most of the cases Lamda architecture was preferred.

Delta lake addresses some of these pain point. Let’s understand what is Delta Lake & how helps to address these pain points.

❑ What is Delta Lake?

Delta Lake is a file-based, open-source storage format that provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lakes and is compatible with Apache Spark and other processing engines like PrestoDB, Flink, Trino, and Hive.

It was developed at Databricks and open-sourced in early 2019. Delta Lake Document. https://delta.io

❑ What are the features of Delta Lake?

❐ It is an opensource & based on open format (Parquet). Please refer Github Link for more details.

❐ It has connectors for Apache Spark, Flink, Hive, PrestoDB, Power BI,Trino, Redshift, Athena and other systems. Refer Integrations | Delta Lake for more details.

❐ ACID Transactions — Delta Lake brings in ACID (Atomicity, Consistency, Isolation, Durability) transactions to data lakes. Delta Lake ensures that there are no partial or corrupt files ensuring that readers never see inconsistent data. All data changes written to storage are committed for durability and made visible to readers atomically.

Each new write will create a new version of a data table, and new data won’t be read until the transaction completes. This means that data jobs that fail midway can be disregarded entirely. It also simplifies the process of deleting and updating records — many changes can be applied to the data in a single transaction, eliminating the possibility of incomplete deletes or updates.

❐ Scalable Metadata Handling — In big data, the volume of metadata can also be large, so Delta Lake treats the metadata just like data, leveraging Spark’s distributed processing power. Delta Lake can easily handle petabyte-scale tables with billions of partitions and files.

❐ Time Travel — Delta Lake provides snapshots of data, enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. This log makes it easy to retain historical versions of the data to fulfill compliance requirements in various industries such as GDPR and CCPA.

❐ Unified Batch and Streaming Source and Sink — A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.

❐ Schema Enforcement—Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema.

❐Schema Evolution — Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.

❐ Fully compatible with Apache Spark — Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark.

❐ Delete and insert support — Traditionally, data lake falls under the “save once, never change” way of thinking. Delta Lake changes that as it supports insertion and deletion of the data within dataset.

I hope this blog helped you in understanding the challenges in building a data lake along with delta features.

We have subsequent blogs covering how delta lake can be used to overcome the above challenges along with code snippets. Till then, Happy Learning!!!

Please Note — All opinions expressed here are my personal views and not of my employer.

Thought of the moment-

Never give up something you believe in — Steve Scalise