Kafka Cheat Sheet

General Information

Sizing & Best Practices

  • Kafka Best Practices (HCC article) plus some additional HWX Consulting experience of
    • Use newer API to put less load on Zookeeper if same zookeeper is shared with other services like Namenode HA, HBase etc.
    • Use multiple and dedicated disks for Kafka for better throughput.
    • Kafka parallelism is tightly coupled with number of Partitions so make sure number of partitions are correct. General recommendation is to use one partition per physical disk and one consumer per partition. The maximum number of consumers for a topic is equal to the number of partitions. Consumers in the same consumer group split the partitions among them.
    • As far as sizing is concern, it depends on message size, Daily volume, retention period and I/O requirement.
  • Unofficial Storm and Kafka Best Practices Guide (HCC article)