Apache Kafka Interview Questions

Apache Kafka Interview Questions

What is Apache Kafka?

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A high-throughput distributed messaging system. Kafka is a general purpose publish-subscribe model messaging system, which offers strong durability, scalability and fault-tolerance support. It is not specifically designed for Hadoop. Hadoop ecosystem is just be one of its possible consumers.

Apache Kafka Tutorials For Beginners
Kafka Vs Flume
Compared to Flume, Kafka wins on the its superb scalability and messsage durablity.

Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time. That’s because Kafka does not track which messages in the topic have been consumed by consumers. It simply keeps all messages in the topic within a configurable period. It is the consumers’ responsibility to do the tracking through offset.

In contrast, adding more consumers to Flume means changing the topology of Flume pipeline design, replicating the channel to deliver the messages to a new sink. It is not really a scalable solution when you have huge number of consumers. Also since the flume topology needs to be changed, it requires some down time.

Kafka’s scalability is also demonstrated by its ability to handle spike of the events. This is where Kakfa truly shines because it acts as a “shock absorber” between the producers and consumers. Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption model. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.

Bolt :- Bolts represent the processing logic unit in Storm. One can utilize bolts to do any kind of processing such as filtering, aggregating, joining, interacting with data stores, talking to external systems etc. Bolts can also emit tuples (data messages) for the subsequent bolts to process. Additionally, bolts are responsible to acknowledge the processing of tuples after they are done processing.

 Spout:- Spouts represent the source of data in Storm. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks etc. Spouts can broadly be classified into following –

-Reliable – These spouts have the capability to replay the tuples (a unit of data in data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.

-Unreliable – These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follow ‘at most once message processing’ semantic.

Tuple:- The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed — the types of the fields do not need to be declared. Tuples have helper methods like getInteger and getString to get field values without having to cast the result. Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type.

Get through the interview bar with our selected interview questions for Apache Kafka enthusiasts

What are the key benefits of using Storm for Real Time Processing?

Easy to operate: Operating storm is quiet easy

Real fast: It can process 100 messages per second per node

Fault Tolerant: It detects the fault automatically and re-starts the functional attributes

Reliable: It guarantees that each unit of data will be executed at least once or exactly once

Scalable: It runs across a cluster of machine

Does Apache act as a Proxy server?

Yes, It acts as proxy also by using the mod_proxy module. This module implements a proxy, gateway or cache for Apache. It implements proxying capability for AJP13 (Apache JServ Protocol version 1.3), FTP, CONNECT (for SSL),HTTP/0.9, HTTP/1.0, and (since Apache 1.3.23) HTTP/1.1. The module can be configured to connect to other proxy modules for these and other protocols.

While installing, why does Apache have three config files – srm.conf, access.conf and httpd.conf?

The first two are remnants from the NCSA times, and generally you should be ok if you delete the first two, and stick with httpd.conf.

What is ZeroMQ?

ZeroMQ is “a library which extends the standard socket interfaces with features traditionally provided by specialized messaging middleware products”. Storm relies on ZeroMQ primarily for task-to-task communication in running Storm topologies.

How many distinct layers are of Storm’s Codebase?

There are three distinct layers to Storm’s codebase.

-First, Storm was designed from the very beginning to be compatible with multiple languages. Nimbus is a Thrift service and topologies are defined as Thrift structures. The usage of Thrift allows Storm to be used from any language.

-Second, all of Storm’s interfaces are specified as Java interfaces. So even though there’s a lot of Clojure in Storm’s implementation, all usage must go through the Java API. This means that every feature of Storm is always available via Java.

-Third, Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality the great majority of the implementation logic is in Clojure.

When do you call the cleanup method?

The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: For instance, if the machine the task is running on blows up, there’s no way to invoke the method. The cleanup method is intended when you run topologies in local mode (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.

How can we kill a topology?

To kill a topology, simply run:

storm kill {stormname}

Give the same name to storm kill as you used when submitting the topology. Storm won’t kill the topology immediately. Instead, it deactivates all the spouts so that they don’t emit any more tuples, and then Storm waits Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS seconds before destroying all the workers. This gives the topology enough time to complete any tuples it was processing when it got killed.

Learn Apache Kafka by Tekslate - Fastest growing sector in the industry.
Explore Online Apache Kafka Training and course is aligned with industry needs & developed by industry veterans.
Tekslate will turn you into Apache Kafka Expert.

What is combiner Aggregator?

A Combiner Aggregator is used to combine a set of tuples into a single field. It has the following signature:

public interface CombinerAggregator {
 T init (TridentTuple tuple);
 T combine(T val1, T val2);
 T zero();
 }

Storm calls the init() method with each tuple, and then repeatedly calls the combine()method until the partition is processed. The values passed into the combine() method are partial aggregations, the result of combining the values returned by calls to init().

Is it necessary to kill the topology while updating the running topology?

Yes, to update a running topology, the only option currently is to kill the current topology and resubmit a new one. A planned feature is to implement a Storm swap command that swaps a running topology with a new one, ensuring minimal downtime and no chance of both topologies processing tuples at the same time.

Explain how to write the Output into a file using Storm?

In Spout, when you are reading file, make FileReader object in Open() method, as such that time it initializes the reader object for worker node. And use that object in nextTuple() method.

Mention what is the difference between Apache Kafka and Apache Storm?

Apach Kafeka: It is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another.

Apache Storm: It is a real time message processing system, and you can edit or manipulate data in real time. Apache storm pulls the data from Kafka and applies some required manipulation.

Explain when using field grouping in storm, is there any time-out or limit to known field values?

Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring which task will be processed in the correct order. For that, you don’t require any cache. So, there is no time-out or limit to known field values.

In which folder are Java Applications stored in Apache?

Java applications are not stored in Apache, it can be only connected to a other Java webapp hosting webserver using the mod_jk connector.

What is mod_vhost_alias?

This module creates dynamically configured virtual hosts, by allowing the IP address and/or the Host: header of the HTTP request to be used as part of the pathname to determine what files to serve. This allows for easy use of a huge number of virtual hosts with similar configurations.

What is struct and explain its purpose?

A struts is a open source framework for creating a Java web applications.

Tell me Is running apache as a root is a security risk?

No.root process opens port 80, but never listens to it, so no user will actually enter the site with root rights. If
you kill the root process, you will see the other kids disappear as well.

 

“At TekSlate, we are trying to create high quality tutorials and articles, if you think any information is incorrect or want to add anything to the article, please feel free to get in touch with us at info@tekslate.com, we will update the article in 24 hours.”

1 Responses on Apache Kafka Interview Questions"

  1. Umesh says:

    What is the point of having storm question in the kafka column?

Leave a Message

Your email address will not be published. Required fields are marked *

Support


Please leave a message and we'll get back to you soon.

3 + 5