Publish-subscribe systems are essential for organizing and managing large sets of data, depending on your business requirements. To get the most out of your Apache Kafka cluster, you may benefit of learning that Apache Kafka often are used for a various different purposes.
Using Apache Kafka to decouple processes via messaging creates a highly scalable and easy managed system without needing to build one large system. Connected applications using Apache Kafka for messaging can be written in multiple languages and develop independently, they can be maintained by separated developer teams and is very fault tolerant.
Tracking customer activity on websites, such as page views, searches, or other actions, often requires a high volume of throughput, as many activity messages are generated for each user. To solve this, it is popular to use Apache Kafka as a logging solution, wherein physical log files from servers are collected and placed in a central location for processing. With Apache Kafka, you can publish an event for every event occurring in your application, and other parts can then subscribe to these events and take the necessary actions.
Apache Kafka is designed to run in a cluster with at least three brokers and is meant to scale horizontally. However, it can also scale vertically, making it ideal for businesses that prioritize scalability. Scaling vertically means providing more resources to the machine. For a Kafka cluster, this means adding more nodes, CPU, RAM, and/or disk size. Scaling horizontally refers to adding brokers to the Kafka cluster without necessarily increasing the resources of individual brokers.
This use case follows a scenario with a simple website where users can click around, sign in, write blog articles, upload images, and publish articles.
When an event happens in the blog (e.g when someone logs in, when someone presses a button or when someone uploads an image to the article), a tracking event and information about the event is placed into a record, and the record is placed on a specified Kafka topic. One topic is named "click" and one is named "upload".
Partitioning setup is based on the user's id. A user with id 0, maps to partition 0, and the user with id 1 to partition 1, etc. The "click" topic will be split up into three partitions (three users) on two different machines.
A webshop with a ‘similar products’ feature included on the site. To make this work, each action performed by a consumer is recorded and sent to Kafka. A separate application comes along and consumes these messages, filtering out the products the consumer has shown an interest in and gathering information on similar products. This ‘similar product’ information is then sent back to the webshop for it to display to the consumer in real-time.
Alternatively, since all data is persistent in Kafka, a batch job can run overnight on the ‘similar product’ information gathered by the system, generating an email for the customer with suggestions of products.
Servers can be monitored and set to trigger alarms in case of rapid changes in usage or system faults. Information from server agents can be combined with the server syslog and sent to a Kafka cluster. Through Kafka Streams, these topics can be joined and set to trigger alarms based on usage thresholds, containing full information for easier troubleshooting of system problems before they become catastrophic.
Apache Kafka has another interesting feature not found in RabbitMQ - log compaction. Log compaction ensures that Kafka always retains the last known value for each record key. Kafka simply keeps the latest version of a record and deletes the older versions with the same key.
An example of log compaction use is when displaying the latest status of a cluster among thousands of clusters running. The current status of the cluster is written into Kafka and the topic is configured to compact the records. When this topic is consumed, it displays the latest status first and then a continuous stream of new statuses.
Choosing Kafka can be beneficial where large amounts of data need to be processed at a rapid speed. Communication with Kafka as a broker decouples processes and creates a highly scalable system.
Instead of building one large application, decoupling involves taking different parts of an application and only communicating between them asynchronously with messages. That way different parts of the application can evolve independently, be written in different languages, and/or maintained by separated developer teams. In comparison to many messaging systems, Kafka has better throughput. It has built-in partitioning, replication, and fault-tolerance that makes it a good solution for large-scale message processing applications.
A wide range of examples, use cases, and details are also available in the Apache Kafka documentation.