Azure Databricks Auto Loader— Plugging the Gap for Batch ETL

Amit Damle
3 min readMay 30, 2021

--

What is Autoloader

Auto loader is a utility provided by Databricks that can automatically pull new files landed into Azure Storage and insert into sunk e.g. Delta lake.
Autoloader introduced new source called cloudFiles that works on structured streaming. It identifies the new files arrived using either of the File discovery mode set and process the file .Once picked up by the loader one can write processing logic and store data into virtually any sink e.g. Storage, Azure SQL DB , Synapse etc.

Quick Summary of Autoloader Working

Auto Loader Workflow

Process Overview

  1. Source System uploads files into Azure Storage
  2. Storage publish the ‘Blob Created Event’ on Event Grid.
  3. Event Grid subscription allows these events to be pushed on Storage Queue. The storage Queue and Event Subscriptions are automatically created by Databricks while initializing ‘Auto Loader’ job
  4. Data Bricks autoloader job receives the file upload events (Blob Created Event)
  5. Databricks streaming job reads newly created files from Azure Storage and processes it based on the logic written by user
  6. Processed data can be stored into Azure storage or any Databases as shown

Sample Code for Initializing Auto Loader and Reading data

Initialize Autoloader

df = spark.readStream.format(“cloudFiles”) \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load(<input-path>)

Store Processed Data

df.writeStream\
.format(“delta”) \
.option(“checkpointLocation”, “<YOUR CHECKPOINT LOCATION>) \
.start(<OUTPUT PATH>)

How to use it efficiently for Batch ETLs?

Autoloader is no doubt one of the best utility when it comes to automatically moving data from storage to desired sink after processing. It makes your ETL pipelines more readable and simple to build

One of the obvious gap I realize working on Autoloader is the nature of the auto loader job i.e. it spawns Streaming job for reading files hence the cluster will not terminate irrespective of new files are being sent to source folder or not. This will definitely increase the cost if the data loading is periodic rather than real time/ near real time
But still we cannot ignore the advantages provided by Autoloader i.e.

  1. Out of the box support for event based processing ,
  2. Auto creation of Event Grid & Subscription, event consumer, reading new files etc.
  3. Clean data loading architecture

To workaround the Gap we can make following changes-

1. Change autoloader job to “Trigger.Once”
2. Use ADF data pipeline to trigger Autoloader Job activity
3. ADF pipeline can be schedule to run periodically or event based trigger to trigger the autoloader job

Store Processed Data for Batch

df.writeStream\
.format(“delta”)\
.foreachBatch(<YOUR DATA PROCESSING FUNCTION>)\
.trigger(Trigger.Once)\
.option(“checkpointLocation”, “<YOUR CHECKPOINT LOCATION >”)\
.start(“<OUTPUT PATH>”)\

Architecture with Azure Data Factory -

References —

Azure Databricks Auto Loader

Easy Data Ingestion using Databricks Auto Loader

Disclaimer: Ideas / views expressed are my personal opinions

--

--

Responses (1)