3 n 1/2 Ways to Optimize AWS MSK for Predictable Cost

Amit Damle
3 min readSep 26, 2024

--

Customers are adopting Kafka for their real-time scenarios, inter service communication and for lot of other messaging needs. Though Kafka is easy to get started, managing Kafka may prove cumbersome without a team with relevant expertise.

Most customers on self-managed Kafka eventually look for managed alternatives e.g. Confluent, AWS MSK and others. More and more customers are choosing Amazon MSK as their managed Kafka platform due High Availability, ease of setup, deployment flexibility and low maintenance overhead.

What is Amazon MSK (Managed Streaming for Apache Kafka) -

Amazon MSK is a fully managed service that offers the greatness of opensource Kafka on a managed infrastructure. Amazon MSK is offered in 2 forms — Provisioned and Serverless.

This blog provides 3 and 1/2 (since this may /may not have impact compared to other options) options to reduce the cost of MSK cluster.

Option-1: Right size your broker
Choosing correct broker size is important from performance as well as cost perspective. If Broker size is large it may cost more and if the size is small then performance may get impacted. Finding optimal broker size could be an iterative process but AWS MSK provides a MSK Sizing calculator which help expedite choosing right broker size. Please refer this excel sheet available in public domain. After providing necessary inputs e.g. ingestion, egress rate, replication factor, retention hrs. and number of AZs it provides the number of brokers and Type of broker. Always look out for available Graviton instances which will provide better performance at affordable cost.

Option-2: Use tiered storage —

MSK supports tiered storage for storing data for long term retention. If you have requirements to store data for weeks and months consider using Tiered storage. MSK provides managed secondary storage(S3) to store data for longer period of time at affordable cost. Message will be evicted from primary storage(EBS volume) after configured primary retention time, allowing EBS storage size to be kept small reducing overall cluster cost by at least 10–15 %

Option-3: Monitor and Cleanup —

Most of the customers enable storage auto-scaling feature of MSK or add brokers to handle increasing demand during specific events such as Black Friday, Big Billion Days or Seasonal Sports Events (non tiered storage version of MSK). After completion of event, the requirement may not need the scaled up resources e.g. storage or additional compute power. This could be one of the factor contributing to higher MSK cost.
It is advisable to follow routine activity as below -

  1. Keep monitoring Cluster Resources
  2. Periodically Identify unused topics and remove to reclaim space
  3. If storage requirement is less than the available storage then remove the additional broker nodes (assuming compute requirement are also less than what was required during event). In case of storage auto-scaling additional storage space cannot be removed but you will keep paying for unused storage hence migrate the cluster to smaller cluster as part of maintenance activity.

Option-3.5: Open Monitoring —

Use open monitoring feature of MSK that enables JMX exporter on brokers to help scrap telemetry using Prometheus and Grafana. Out of the box Cloud Watch monitoring is also sufficient and detailed monitoring to partition level but enabling topic and partition level telemetry involves cost.

References —

MSK Best Practices

Right size your MSK Cluster

Disclaimer: Ideas / views expressed here are my personal opinions

--

--

No responses yet