👉 Kafka is an open source software which provides a framework for storing, reading and analysing streaming data.Being open source means that it is essentially free to use and has a large network of users and developers who contribute towards updates, new features and offering support for new users.💯
Kafka is designed to be run in a “distributed” environment, which means that rather than sitting on one user’s computer, it runs across several (or many) servers, leveraging the additional processing power and storage capacity that this brings.
- Apache takes information – which can be read from a huge number of data sources – and organises it into “topics”. As a very simple example, one of these data sources could be a transactional log where a grocery store records every sale.
- Kafka would process this stream of information and make “topics” – which could be “number of apples sold”, or “number of sales between 1pm and 2pm” which could be analysed by anyone needing insights into the data.
- This may sound similar to how a conventional database lets you store or sort information, but in the case of Kafka it would be suitable for a national chain of grocery stores processing thousands of apple sales every minute.
- This is achieved using a function known as a Producer, which is an interface between applications (e.g. the software which is monitoring the grocery stores structured but unsorted transaction database) and the topics – Kafka’s own database of ordered, segmented data, known as the Kafka Topic Log.
👉Advantages of Kafka
- High performance: Kafka helps the platform to process messages at a very high speed. The processing rates can exceed beyond 100k/seconds (low latency). It maintains stable performance under extreme data loads (Terabytes of messages are stored). The data is processed in a partitioned and ordered fashion.
- Scalability: Kafka is a distributed system that can handle large volumes of data that can scale quickly without downtime. It provides scalability by allowing partitions to be distributed across different servers.
- Fault Tolerance: Kafka is a distributed system consisting of several nodes running together to serve the cluster. This distribution makes it resistant to a node or machine failure within the cluster.
- Durability: The Kafka system is highly durable. The message in Kafka can be persisted on disk as quickly as possible.
- Easy accessibility: Data can be easily accessible to anyone as all our data gets stored in Kafka.
- Eliminates multiple integrations: It eliminates multiple data source integrations as all a producer’s data goes to Kafka. This reduces complexity, time and cost.
👉Disadvantages of Kafka
- Not suitable for historical data: Kafka system doesn’t allow storing historical data for more than a few hours.
- Slow behavior: Kafka system becomes slow when the number of queues in a cluster increases.
- Lack of monitoring tools: Kafka system doesn’t have a complete set of monitoring and managing tools. To overcome this, we can use third-party tools like Kafka Monitor (developed by Linkedin), Datadog and Prometheus help to monitor Kafka clusters. In addition, there are many other open-source and commercial options also available.
- No wildcard topic support: Kafka system only supports the exact topic name and won’t support wildcard topics. So, for example, if you have a topic
metric_2022_01_01 & metric_2022_01_02, then it won’t support wildcard topic selection like metric_2022_*.
- Reduces Performance: Brokers and consumers start compressing and decompressing the messages when their size increases. This will reduce the Kafka system’s performance and affect its throughput.