Archive

Archive for the ‘Cluster’ Category

Apache Kafka – Introduction & Installation

apache_kafkaApache Kafka is a distributed streaming platform. What exactly does that mean? Why this Kafka?

Most traditional messaging systems don’t scale up to handle big data in realtime, however. So engineers at LinkedIn built and open-sourced Kafka: a distributed messaging framework that meets the demands of big data by scaling on commodity hardware.

In this post let we will start with how to install and run Apache Kafka in your development environment.

What is Kafka?

Apache Kafka is messaging system built to scale for big data. Similar to Apache ActiveMQ or RabbitMq, Kafka enables applications built on different platforms to communicate via asynchronous message passing. But Kafka differs from these more traditional messaging systems in key ways:

  • It’s designed to scale horizontally, by adding more commodity servers.
  • It provides much higher throughput for both producer and consumer processes.
  • It can be used to support both batch and real-time use cases.
  • It doesn’t support JMS, Java’s message-oriented middleware API.

Kafka’s basic terminology:

  • A producer is process that can publish a message to a topic.
  • a consumer is a process that can subscribe to one or more topics and consume messages published to topics.
  • A topic category is the name of the feed to which messages are published.
  • Each record consists of a key, a value, and a timestamp.
  • A broker is a process running on single machine.
  • A cluster is a group of brokers working together.
  • Kafka is run as a cluster on one or more servers.

Kafka’s API’s:

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

More details about Apache Kafka: https://kafka.apache.org/intro.

Kafka Architecture:

kafka_cluster_architecture

Kafka Installation:

Before installing Apache Kafka, you need to install Apache Zookeeper.  Service required by kafka for maintaining  all the required configuration information and for providing distributed synchronization.

Download Apache Zookeeper from: Download Zookeeper

Extract the zookeeper-3.4.10.tar.gz into your local drive eg: C:\apache\zookeeper-3.4.10\

  • Once you extracted the zookeeper, locate the conf folder. eg: C:\apache\zookeeper-3.4.10\conf
  • Rename the “zoo_sample.cfg” to “zoo.cfg” inside the conf folder
  • Create a data directory for zookeeper in your local drive. eg: C:\zk_data
  • and update the “dataDir=C:/zk_data” in the “zoo.cfg” file.

Download Apache Kafka from: Download Kafka

Extract the “kafka_2.10-0.10.2.1.tgz” into your local drive eg: C:\apache\kafka-2.10\

  • after extract you will find the server.properties file in side the config folder, in my case its: C:\apache\kafka-2.10\config\

Lets now start the zookeeper followed by kafka:

  • Start the Zookeeper server by executing the command:

       C:\apache\zookeeper-3.4.10\bin>zkServer.cmd            

zk_start.PNG

  • Start apache Kafka server by executing the command:

  C:\apache\kafka-2.10\bin\windows>kafka-server-start.bat C:/apache/kafka-2.10/config/server.properties 

kfk_start

  • Create a test topic that you can use for testing: “javainsider”

         C:\apache\kafka-2.10\bin\windows>kafka-topics.bat –create –zookeeper localhost:2181 –replication-factor 1 –partitions 1 –topic javainsider      

topic_created.PNG

  • Start a simple console consumer that can consume messages published to a given topic, such as “javainsider”:

    C:\apache\kafka-2.10\bin\windows>kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic javainsider --from-beginning    

kafka_consumer.PNG

  • Start up a simple producer console that can publish messages to the test topic:

     C:\apache\kafka-2.10\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic javainsider   

kfk_producer.PNG

  • Try typing one or two messages into the producer console. Your messages should show in the consumer console.

Congratulations! You are done with Apache Kafka installation and testing a Kafka instance with an out-of-the-box producer and consumer.

In my next blog post I will explain how to use the Apache Kafka with Java programs. Happy learning 🙂

Clustering vs. Load Balancing – What is the difference?

These 2 terms – clustering and load balancing – are used in the same sense by a majority of IT people with relative impunity.

Clustering has a formal meaning. A cluster is a group of resources that are trying to achieve a common objective, and are aware of one another. Clustering usually involves setting up the resources (servers usually) to exchange details on a particular channel (port) and keep exchanging their states, so a resource’s state is replicated at other places as well. It usually also includes load balancing, wherein, the request is routed to one of the resources in the cluster as per the load balancing policy.

Load balancing can also happen without clustering when we have multiple independent servers that have same setup, but other than that, are unaware of each other. Then, we can use a load balancer to forward requests to either one server or other, but one server does not use the other server’s resources. Also, one resource does not share its state with other resources.

Each load balancer basically does following tasks:

  1. Continuously check which servers are up.
  2. When a new request is received, send it to one of the servers as per the load balancing policy.
  3. When a request is received for a user who already has a session, send the user to the *same* server (This part is important, as otherwise user would keep going between different servers, but not able to really do any work). This part is not required for serving static pages, in that case, there are no user sessions.

What does it mean from a user’s perspective? Which one is better?
Every time some one asks a generic question – which one is better, the answer is invariably “It depends”. This isn’t political equivocation, but simply restating the fact that if one was better than the other in all circumstances, then the other wouldn’t exist.

Clustering saves the user’s state, and is more transparent to the user, but is harder to setup, and is very resource specific. Different application servers have different clustering protocols, and don’t necessarily work out of the box (don’t you believe any of that marketing fluff). Load balancing is comparatively more painless, and relatively more independent of application servers.

From a user’s perspective, it means that if the user is doing something on the application, and that server goes down, then depending upon whether the system is doing clustering or load balancing, the user observes different behavior. If the system is clustered, the user may be able to continue doing the transaction, and may not even realize that the server has gone down. If the system is load balanced without clustering, that means that the user’s state will likely be lost, and the user will be simply sent to the other server(s) to restart transaction. The user has lost some work.

Categories: Cluster, Load Balancing
%d bloggers like this: