Apache Kafka : Introduction

codechef vaibhav kashyap
4 min readOct 23, 2020

Welcome readers to yet another series & this time we’re going to turn some big rocks of data streaming. So without wasting much time let me introduce you to Apache Kafka.

Let’s begin with the theoretical definition: Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Let’s try understanding it better with help of an example. Assume You have a source system and you have got a target system to exchange data with.

After a while, when your system grows you’ve many source systems and many target systems and they all have to exchange data with one another & things become more complicated.

So it becomes one of the challenging problems for organizations when they go big with previous architecture. Imagine if you’ve 4 source systems & 6 target systems, you need to perform 24 integrations to write. Each integration comes with difficulties around, for example, the protocol you’ve to choose which is how the data is transported TCP, HTTP, FTP, etc. then comes the data format how the data is parsed for example binary, JSON, CSV, Avro, etc. And the data schema and evolution, how the data is shaped & how it may change in phases. Additionally, each time you begin with integrating a source system with the target system, there will be an increased load from the connections. So how do we solve this? Well, this is where APACHE KAFKA comes as a savior.

So Apache Kafka allows you to decouple your data streams and your systems. So now your source systems will have their data end up in Apache Kafka. while the source for your target systems will now be Apache Kafka.

For example, follow the below diagram :

once the data is in Kafka, you may want to put it into any system you like. It could be your database, your analytics systems, or your audits.

But Why Apache Kafka?

It was created by LinkedIn & it’s now an open-source project maintained by confluent but it’s under the apache stewardship. It’s distributed, resilient architecture & fault-tolerant which makes it easily scalable. It scales Horizontally. There are Kafka clusters with over 100 brokers. It can scale to millions of messages per second which is proven by LinkedIn & many other companies, exchange per second. It is extremely high performing. The latency to exchange data from one system to another is usually less than 10 millisecond which is next to real-time. Currently, there are 2000 plus firms, 35% of the fortune 500 that use Kafka, such as Linkedin, Airbnb, NetFlix, Uber, etc. Everyone uses Kafka or is thinking about using Kafka.

Let’s discuss a few use-cases for Apache Kafka :

a) Messaging System

b) Activity Tracking

c) Gather metrics from many different locations for example IoT devices etc.

d) Gather logs from your applications

e) Stream processing( With the Kafka streams API or Apache Spark for example)

f) Decoupling of system dependencies, to reduce the load on your databases and your systems by de-coupling them.

g) Perform big data integrations for example with Spark, Storm, Hadoop, and other big data technologies.

So this is why companies running after introducing Apache Kafka into their architecture.

Let’s find out where Apache Kafka is in our lives:-

1) Netflix uses Kafka to apply recommendations in real-time while you’re watching TV shows.

2) Uber uses Kafka to gather user, taxi & trip data in real-time to compute and forecast demand, and compute surge pricing in real-time.

3) LinkedIn uses Kafka to prevent spam, collect user interactions to make better connections recommendations in real-time.

All these companies are using Kafka so that they can make real-time recommendations, real-time decisions, give you real-time insights to their users, and this is why it’s so good.

One point I would like to highlight here is that Kafka is only used as a transportation mechanism, people will still write their applications or web applications to make things work using Kafka which is really good at making your data move really fast at scale.

This gives you a brief idea about how it works & where it sits in your companies & how do they use them.

This is just the beginning in our subsequent posts are going to learn how Kafka is set up on a local machine & how it is used. So stay tuned.

Happy Reading!!!

--

--