5 Simple Steps to Plan Databricks Cluster Capacity

3 min readMar 21, 2021

Co-author — RK Iyer

I had seen lots of implementations struggled to find the right size for their Spark cluster(Databricks). In my view Capacity planning is influenced by 2 factors
1. Infra — If infra is not chosen based on the workload type then it may affect the performance
2. Workload— Even after choosing right infrastructure sometimes desired performance is not achieved due to poorly written / unoptimized code
Though the process is an iterative and bit of a complex, there is a way to get quick insight on #1 i.e. how much memory, CPU or storage is required for the given dataset. Year & half back Azure Databricks has released Ganglia monitoring for their clusters. It can provide valuable input to get you started with capacity planning. This approach though not complete help to set a base configuration for further tuning.
In my experience once basic capacity is identified using Ganglia there is only 30% work required to further tune your workloads.

The process is really simple, you just need to follow 5 steps mentioned below

First take a subset of your dataset
Start with basic cluster size i.e. 3 Node cluster — 1 Master + 2 Worker Nodes (4Core+14GB each)
Run your job containing business logic (choose the job that has complex logic)
Identify type of workload i.e. whether workload is CPU bound or Memory Bound or N/W Bound. Fig#2 shows where to check for Ganglia metrics Monitor performance of your cluster using Ganglia metrics (Present on monitoring Tab of cluster)
CPU Bound Job — If you observed your Job executions consuming CPU up to 70–80% that means its a CPU intense workload
Memory Bound — CPU is well below the 60% but memory usage is reaching 70–80% that means its a Memory intense workload
N/W Bound Cluster — Check
Tune your cluster and rerun the job