Azure Databricks Auto Loader— Plugging the Gap for Batch ETL
What is Autoloader
Auto loader is a utility provided by Databricks that can automatically pull new files landed into Azure Storage and insert into sunk e.g. Delta lake.
Autoloader introduced new source called cloudFiles that works on structured streaming. It identifies the new files arrived using either of the File discovery mode set and process the file .Once picked up by the loader one can write processing logic and store data into virtually any sink e.g. Storage, Azure SQL DB , Synapse etc.
Quick Summary of Autoloader Working
Process Overview
- Source System uploads files into Azure Storage
- Storage publish the ‘Blob Created Event’ on Event Grid.
- Event Grid subscription allows these events to be pushed on Storage Queue. The storage Queue and Event Subscriptions are automatically created by Databricks while initializing ‘Auto Loader’ job
- Data Bricks autoloader job receives the file upload events (Blob Created Event)
- Databricks streaming job reads newly created files from Azure Storage and processes it based on the logic written by user
- Processed data can be stored into Azure storage or any Databases as shown
Sample Code for Initializing Auto Loader and Reading data
Initialize Autoloader
df = spark.readStream.format(“cloudFiles”) \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load(<input-path>)
Store Processed Data
df.writeStream\
.format(“delta”) \
.option(“checkpointLocation”, “<YOUR CHECKPOINT LOCATION>) \
.start(<OUTPUT PATH>)
How to use it efficiently for Batch ETLs?
Autoloader is no doubt one of the best utility when it comes to automatically moving data from storage to desired sink after processing. It makes your ETL pipelines more readable and simple to build
One of the obvious gap I realize working on Autoloader is the nature of the auto loader job i.e. it spawns Streaming job for reading files hence the cluster will not terminate irrespective of new files are being sent to source folder or not. This will definitely increase the cost if the data loading is periodic rather than real time/ near real time
But still we cannot ignore the advantages provided by Autoloader i.e.
- Out of the box support for event based processing ,
- Auto creation of Event Grid & Subscription, event consumer, reading new files etc.
- Clean data loading architecture
To workaround the Gap we can make following changes-
1. Change autoloader job to “Trigger.Once”
2. Use ADF data pipeline to trigger Autoloader Job activity
3. ADF pipeline can be schedule to run periodically or event based trigger to trigger the autoloader job
Store Processed Data for Batch
df.writeStream\
.format(“delta”)\
.foreachBatch(<YOUR DATA PROCESSING FUNCTION>)\
.trigger(Trigger.Once)\
.option(“checkpointLocation”, “<YOUR CHECKPOINT LOCATION >”)\
.start(“<OUTPUT PATH>”)\
Architecture with Azure Data Factory -
References —
Easy Data Ingestion using Databricks Auto Loader
Disclaimer: Ideas / views expressed are my personal opinions