5 Simple Steps to Plan Databricks Cluster Capacity

Amit Damle
3 min readMar 21, 2021

--

Co-author — RK Iyer

I had seen lots of implementations struggled to find the right size for their Spark cluster(Databricks). In my view Capacity planning is influenced by 2 factors
1. Infra — If infra is not chosen based on the workload type then it may affect the performance

2. Workload— Even after choosing right infrastructure sometimes desired performance is not achieved due to poorly written / unoptimized code

Though the process is an iterative and bit of a complex, there is a way to get quick insight on #1 i.e. how much memory, CPU or storage is required for the given dataset. Year & half back Azure Databricks has released Ganglia monitoring for their clusters. It can provide valuable input to get you started with capacity planning. This approach though not complete help to set a base configuration for further tuning.
In my experience once basic capacity is identified using Ganglia there is only 30% work required to further tune your workloads.

The process is really simple, you just need to follow 5 steps mentioned below

Fig -1 Steps
  1. First take a subset of your dataset
  2. Start with basic cluster size i.e. 3 Node cluster — 1 Master + 2 Worker Nodes (4Core+14GB each)
  3. Run your job containing business logic (choose the job that has complex logic)
  4. Identify type of workload i.e. whether workload is CPU bound or Memory Bound or N/W Bound. Fig#2 shows where to check for Ganglia metrics Monitor performance of your cluster using Ganglia metrics (Present on monitoring Tab of cluster)
    CPU Bound Job — If you observed your Job executions consuming CPU up to 70–80% that means its a CPU intense workload
    Memory Bound — CPU is well below the 60% but memory usage is reaching 70–80% that means its a Memory intense workload
    N/W Bound Cluster — Check
  5. Tune your cluster and rerun the job

Ganglia Metrics in Databricks

Fig-2 Ganglia Metrics

Workload Types

CPU Bound Workload Sample

Fig-3 CPU Bound Workload

Memory Bound Sample

Fig-4 Memory Bound Workload

Network Bound

Fig-5 Network Bound

Cluster Load Distribution

Fig-6 Load distribution and Legends

Pay attention to the second load distribution which shows one of the node is overloaded this can help us identify skewed dataset processing

Once the workload type is identified then please use following guidance to tune the infrastructure

Fig-6 Tuning

Few Pointers to solve the common performance issues related to Jobs

References

Spark Metrics

Databricks Monitoring using Grafana

--

--