Sometimes the interviewer will ask about your previous project background, you don’t want them to hear something like “You know what? I used a lot of fancy technology for that project, but I know nothing about these technologies except how to find out the API that I need.”
The notes below is written during my last internship after reading the Kafka Documentation. The notes are presented as interview questions to help those who still struggle with finding a nice job. (AKA me)
What is Kafka?
Kafka is a message broker project that is written in Scala. It uses Zookeeper to store offset values of messages. It uses TCP to manage communication between the clients and the servers. It relies heavily on the filesystem and dis for storing and caching messages. The most popular client for Kafka is in Java.
What are the main components of Kafka?
Topic
The message stream that belongs to the same pattern.Producer
It helps in publishing messages to the topic.Broker
This is a set of various servers where all published data is stored.Consumer
It subscribes to the different topics and fetch data from the brokers.
What is the data structure of Kafka stream?
A record represents a single data entry pushed by producer. A topic partition is an ordered, immutable sequence of records that is continually appended to a structured commit log. Partition has a very similar data structure with Queue. The records in the partitions are each assigned a sequential id number called offset
taht uniquely identifies each record within the partition. A topic
contains multiple partitions. It can be seen as a channel that categories different producers’ input.
How does data streamming work in Kafka?
The topic partition can be seen as a unit of parallelism in Kafka stream. Therefore the throughput of most operations are bounded by the number of partitions.
The Kafka Producer first declares some specfic server lists and topic. Then it can start to upload data to some topic. We can customize the destination of our data stream, but be aware to balance the number of records among different partitions. The sending process can be done asynchronously. We can also choose when to send the next record. It can be determined by whether the leader partition has commited or whether all replications have commited.
The Kafka Stream can manage data from an input topic to an output topic. The pipeline process is based on MapReduce and can perform all the common database operations.
The Kafka Consumer pulls data from brokers. It can subscribe to some specific topic or retrive all data. Data pulling always starts from the last commited offset. The commit operation updates the consumer’s offset in the partition. When and how we call commit
determines what kind of message delivery guarantees we want to have. We can also set up consumer group name for each consumer. The consumers within a same group read one data and two different consumers groups read two copies of the data.
What kind of message delivery guarantees do Kafka has?
At Most Once
It means that we can either successfully get our message or we lose it. When we commits right after pulling out data, server might be down before we finish processing it. When we restart the server, since we have already updated the partition offset, the consumer will continue to read records from the new offset and thus we lost our data.At Least Once
It means that we can either successfully get our message or we get duplicated data. When we commits after finishing process the data, server might be down between the time we finish processing the data and we commit the offsets. When we restart the server, the consumer will start reading data from our last commited offset and process the data. This leads to process the data twice.Exactly Once
It means that we can always get our message once. To hold that, we need to add both commiting and processing data into a transaction. Althoughexactly once
seems to be the perfect solution, holding a transaction might reduce the processing speed. The most common strategy right now isat least once
.
How to choose the number of partitions in one topic?
Usually more partitions mean higher throughput. However, more paritions in one topic lead to more leader partitions for one topic. When a broker is shut down uncleanly, partitions with a leader on that broker become temporaily unavailable, and they need to wait till Kafka move the leaders to some other replicas. If the broker is shut down uncleanly, the observed unavailability could be proportional to the number of partitions. Although in general, unclean failures are rare, but for those who care about availablity in those cases, the number of partitions can’t be too high.
Also, if the number of replication is high, we might experience some latency between pushing data and polling data since we need to sync among all replications for every write.
How does Kafka handle server break down? (TODO)
We can set the number of replication for each topic partition. Each partition has a single leader and zero or more followers. All writes go to the leader of the partition and read can go to any replication. Usually the leader should distributed evenly among brokers so that one broken down server won’t affect much.
What does Kafka guarantees at a high-level?
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.