Reduce Cost of Lake House Processing by migrating to Amazon EMR Spark
More and more customers are adopting Lake House implementations that uses Cloud Storage as their preferred storage for creating Data Lake and Warehouse.
Main Objective of this approach is —
A. To get the data in one location (centralized) and
B. Reduce cost of Data Warehouse since common understanding is that the cloud Storages are cheaper than Cloud Based Warehouses like Snowflake, Redshift and Synapse of the world
Soon customers start realizing the drawback of this approach as mentioned below -
- Cloud storages are simple object stores and does not support CRUD operations out of the box — this means using Open Table formats like Hudi, Iceberg and Delta Lake
- Though Cloud storages are cost effective, it needs constant monitoring and file compaction to minimize large number of small file to keep List and Get cost in check
- Managing enterprise SLAs for reporting and analytical queries need compute engines that may contribute to higher cost
When it comes to implementing Lake House Architecture on AWS, S3 is the de-facto standard for Data Lake implementations. Most customers choose leading Spark ISV solutions because of ease of use and one-stop solution offered for data processing, exploratory / ad-hoc analysis and reporting. Eventually these Solutions can lead to increased cost due to premium offerings such as SQL Querying on Data Lake, and Commercial version of Spark. Customers were forced to start looking out for the cost effective options to process and analyze data stored on cloud storage.
One of such options evaluated for backend processing is Amazon EMR to significantly reduce the cost.
In most cases it was observed that customers migrated on EMR Spark complain about the negligible / no cost benefit. It was observed that in 90% of cases customers have migrated their implementation from Spark ISVs to EMR Spark using like to like infra and cluster sizing. EMR Spark especially EMR on EC2 provides various cost optimization levers which may not be aware to all customers.
In this Blog, I have listed 7 cost optimization options for EMR on EC2 Spark Cluster. These optimization techniques centered around the EMR and Spark platform rather than code level optimizations. It will help customers quickly save the cost without making too many changes .
But before we jump into understanding optimization options , lets look at typical migration process followed by most of the customers -
First for those who are new to EMR , What is Amazon EMR — Managed distributed processing platform that simplifies running Bigdata frameworks such as Hadoop, Hive, Spark, Flink, HBase, Presto and Trino
EMR Deployment Options -
EMR on EC2, EMR on EKS, EMR Serverless and EMR on outpost
Cost Optimization Options—
Option — 1 Resize Master Node
Analyze usage metrics for EMR Master Node and reduce Node size. In most migration scenarios it was observed that Master Node size (T-Shirt sizing) is kept at large . 95% of the scenario can work with setting Master Node size as small i.e. m5.xlarge or m5a.xlarge.
Few exceptions where Master node size needs to be higher than m5.xlarge are —
- EMR Spark Jobs running in Client mode with large data needs to be fetched on node running spark driver
- Using Thrift server for DBT (at the time of writing this blog EMR does not have native DBT plugin hence Spark JDBC using Thrift server needs to be used)
Option -2 : Select appropriate EC2 family Based on Workload type
Identify workload type and accordingly use the specific EC2 node types for EMR Core or Task nodes as mentioned below —
Option-3 : Use of Graviton Instances
In general existing EMR cluster node types can be safely replaced by equivalent Graviton instances to get similar or better performance at reduced cost. If you have already selected node based on workload types as mentioned in #2 try using equivalent graviton instances and check performance as well as cost for running your workload
Exceptions -
1. Third-party libraries not compatible with ARM architecture.
2.No Significant cost gain if you have jobs with large shuffle data >2 TB and
3. I3 / I3en instances used for high shuffle workload
Option-4 : Use of SPOT instances
Based on scenario requirement keep Core Node on-demand and Use SPOT instance for Task Nodes. This will significantly reduce the cost.
In case you are using Graviton instances then make sure specific region has graviton SPOT availability
SPOT interruptions can sometimes cause Jobs to breach SLAs and execute for more than expected time causing additional cost overhead. In such scenarios choose one of the following option —
Monitor and adjust the cluster On-Demand and SPOT adoption to balance the SPOT interruptions and costor
Use the Instance fleet with diverse instance types, specifying all AZs and allocation strategy as “capacity-optimized-prioritized” so that EMR Cluster can spawn instances in an AZ with maximum instance availability maintaining the priority of the instance specified while creating the cluster
Option-5 : Managed Scaling
For long running clusters or scheduled jobs with unpredictable demand, use Managed Scaling to scale out EMR nodes based on load and scale in when requirement dies off. Adding nodes for processing will expedite job execution thereby help reduce the cost. Managed scaling can also take advantage of SPOT instances as mentioned above to reduce the cost further.
Option-6 : Cluster Grouping
Identify and group cluster based on usage
Grouping could be based on —
- Environments e.g. Dev, QA, Prod or
- Departments or teams,
- Workloads Type - Interactive, Batch ETL
- Scheduling frequency — hourly, daily, every ‘x’ mins etc
Grouping will help isolate workloads from each other e.g. Dev & QA clusters will be isolated from Prod and can be shutdown after ‘x’ hours of usage to save the cost
For workloads with short SLAs e.g. few mins EMR EC2 transient clusters will incur more cost due to startup time EMR on EKS or EMR Serverless can provide better cost benefits. If EMR on EC2 is your preferred choice of deployment then persistent clusters (long running) with Option — 4 & 5 will help save the cost. Based on frequency, workloads can be grouped together to execute on few clusters to take advantage of the running resources rather than spawning new clusters . For SLA sensitive workloads either parallel invocation on same cluster or spawning few Parallel clusters can meet SLA and save cost as well
Option-7: Code Tuning
Code tuning is an extensive process that needs repetitive validations. Here are few tips for tuning code with minimal /zero code changes
- Use S3 as persistent data store instead of EBS backed HDFS. This promotes loose coupling and allow compute to scale irrespective of the storage.
- Monitor Spark jobs to identify straggler task that causes job to overrun expected execution time. Enable “spark.speculation” to identify slow running tasks, kill it, and reschedule it on other nodes.
- Check amount of data processed by task if there is an uneven distribution of data repartition dataset for uniform distribution
- File Format and Compaction
File Format — If data being used is append only then standard Parquet or ORC will work well but if updates and deletes happens on regular basis then use of Open(O) Table(T) Formats(F) e.g. Delta Lake, Iceberg and Hudi are advisable to efficiently handle read and writes
Compaction — Check for large number of small files present in the S3 path and merge them to create small number of large files (~128 MB) to reduce IO and cost due to S3 API calls
Conclusion -
Customers who followed these optimization techniques can quickly reduce the cost with minimal effort. On an average cost reduction observed was ~20–30% of overall processing cost for Lake House. But these numbers differ depending on the use case implementations
References —
Reduce Cost using Managed Scaling
EMR Cost Optimization Perspective Guidance
Disclaimer: Ideas / views expressed are my personal opinion