5 Simple Steps to Plan Databricks Cluster Capacity
Co-author — RK Iyer
I had seen lots of implementations struggled to find the right size for their Spark cluster(Databricks). In my view Capacity planning is influenced by 2 factors
1. Infra — If infra is not chosen based on the workload type then it may affect the performance2. Workload— Even after choosing right infrastructure sometimes desired performance is not achieved due to poorly written / unoptimized code
Though the process is an iterative and bit of a complex, there is a way to get quick insight on #1 i.e. how much memory, CPU or storage is required for the given dataset. Year & half back Azure Databricks has released Ganglia monitoring for their clusters. It can provide valuable input to get you started with capacity planning. This approach though not complete help to set a base configuration for further tuning.
In my experience once basic capacity is identified using Ganglia there is only 30% work required to further tune your workloads.
The process is really simple, you just need to follow 5 steps mentioned below
- First take a subset of your dataset
- Start with basic cluster size i.e. 3 Node cluster — 1 Master + 2 Worker Nodes (4Core+14GB each)
- Run your job containing business logic (choose the job that has complex logic)
- Identify type of workload i.e. whether workload is CPU bound or Memory Bound or N/W Bound. Fig#2 shows where to check for Ganglia metrics Monitor performance of your cluster using Ganglia metrics (Present on monitoring Tab of cluster)
CPU Bound Job — If you observed your Job executions consuming CPU up to 70–80% that means its a CPU intense workload
Memory Bound — CPU is well below the 60% but memory usage is reaching 70–80% that means its a Memory intense workload
N/W Bound Cluster — Check - Tune your cluster and rerun the job
Ganglia Metrics in Databricks
Workload Types
CPU Bound Workload Sample
Memory Bound Sample
Network Bound
Cluster Load Distribution
Pay attention to the second load distribution which shows one of the node is overloaded this can help us identify skewed dataset processing
Once the workload type is identified then please use following guidance to tune the infrastructure
Few Pointers to solve the common performance issues related to Jobs
References