JMS vs Apache Bahir: Choosing library for Apache Spark (Structured Streaming) ActiveMQ Connector

jms Jan 19, 2021

What is ActiveMQ?

Apache ActiveMQ™ is the most popular open-source, multi-protocol, Java-based messaging server. It supports industry-standard protocols so users get the benefits of client choices across a broad range of languages and platforms [https://activemq.apache.org/].

In simple terms ActiveMQ is an open-source protocol that functions as an implementation of ‘message-oriented-middleware’ (MOM), it’s basic function is to send messages between different applications. ActiveMQ translates messages from sender to receiver, it can connect multiple clients and allows messages to be held in a queue, instead of requiring both the client and server to be available simultaneously in order to communicate, messaging can still happen even if one application is temporarily unavailable.

Features offered by ActiveMQ.

  1. It provides message load-balancing and high availability for the data
  2. Multiple connected "master" brokers can dynamically respond to consumer demand by moving messages between the nodes in the background.
  3. Brokers can also be paired together in a master-slave configuration so that if a master fails then the slave takes over ensuring clients can get to their important data and eliminating costly downtime.
  4. Provides Asynchronous messaging feature.
  5. ActiveMQ has great scheduler support, which means you can schedule sending your message to be delivered at a particular time.

So, How do we read data from ActiveMQ?


While using ActiveMQ, we are provided with a broker which stores the : queues, topics and messages. But in order to consume information from the broker a client is required, for the context of this blog we will consider a spark-streaming application as our client.

Now to use a “spark-streaming”[more info here] we need to specify a `format` in its readStream method. This is where Bahir and JMS come into play. We can pass the Bahir package or our JMS package as a source for the streaming application in the format function. This makes the spark application read the received input in accordance to the format specified by the developer.

Sample code to demonstrate spark connection

What is Bahir?

Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. Currently, Bahir provides extensions for Apache Spark and Apache Flink. [https://bahir.apache.org/]

Apache Bahir strengthens Big Data processing by serving as a home for existing connectors that were initiated under Apache Spark, as well as provide additional extensions/plugins for other related distributed systems, storage, and query execution systems.

What is JMS?

JMS (Java Message Service) is an API that provides the facility to create, send and read messages. It provides loosely coupled, reliable and asynchronous communication. The message is a piece of information. It can be a text, XML document, JSON data or an Entity (Java Object), etc.

JMS API is a Java API which contains a common set of interfaces to implement enterprise based messaging systems. JMS API is used to implement Messaging systems in Java-based applications only, it does not support other languages.

A typical JMS System contains the following components:

  1. JMS Client: Java program used to send (or produce or publish) or receive (or consume or subscribe) messages.
  2. JMS Sender: JMS Client which is used to send messages to the destination system. JMS sender is also known as JMS Producer or JMS Publisher.
  3. JMS Receiver: JMS Client which is used to receive messages from the Source system. JMS Receiver is also known as JMS Consumer or JMS Subscriber.
  4. JMS Provider: JMS API is a set of common interfaces, which does not contain any implementation. JMS Provider is a third-party system who is responsible to implement the JMS API to provide messaging features to the clients. JMS Provider is also known as MOM (Message Oriented Middleware) software or Message Broker or JMS Server or Messaging Server. JMS Provider also provides some UI components to administrate and control this MOM software. Example: ActiveMQ, RabbitMQ, HornetQ etc.
  5. ConnectionFactory: ConnectionFactory object is used to create a connection between Java Application and JMS Provider. It is used by Application to communicate with JMS Provider.
  6. Destination: Destinations are also JMS Objects used by a JMS Client to specify the destination of messages it is sending and the source of messages it receives. There are two types of Destinations: Queue and Topic.(vii) JMS Message: an object that contains the data being transferred between JMS clients.

Comparing JMS with Bahir

Bahir has the honor of being an `apache` project, and it comes pre-baked with tons of features including support for apache-spark and apache-flink with (i) streaming-MQTT (ii) streaming-twitter (iii) Akka. Apart from this Bahir provides all the necessary features to establish a client connection to the broker i.e the user is not required to write a single line of extra code to establish the connection. On the other hand, unlike Bahir, JMS is not a  ‘pre-built-module’ . To establish a connection from a JMS client to a broker, the developer is required to code everything from scratch. The silver lining being that the developer gets full-access over the client-side code.

So what are the advantages of writing everything from scratch in case of JMS?

Though writing everything from scratch can sometimes be a hectic job considering the fact that there is already a pre-built module `Bahir` out there that serves the same purpose, but the overall advantage of not using Bahir and going for JMS dominates the whole conversation. Although the choice between JMS and Bahir solely depends upon the use case of the client application.

One of the main advantages of using JMS is ensuring fail-safe option in case of client-side application failure, the way an ActiveMQ broker works is that it pushes messages to the client subscribed to the topic/queue. The client then receives the message and acknowledges the broker for that specific message, the broker after receiving acknowledgment from the client for a particular message dequeues that specific message. Now there are three different modes of acknowledgment, out of which AUTO_ACKNOWLEDGMENT and CLIENT_ACKNOWLEDGMENT hold our interest. In AUTO_ACKNOWLEDGMENT the message acknowledgment receipt is sent back to the broker for a message as soon as the client receives the message, Bahir uses this “kind of” acknowledgment mode. On the other hand CLIENT_ACKNOWLEDGMENT gives the developer the power to acknowledge a message at his own convenience i.e. the developer can acknowledge the message after all the processing on the message is completed, this assures that if the application fails in between the processing, the message will still remain enqueued in the broker, and can be delivered to the client once it restarts. This can be implemented with the help of a JMS Client.

Moreover using JMS on the client side, equips us with the ability to restart the application from the point it failed with the help of checkpointing. [Note: Although Bahir claims to provide similar functionality with the use of the persistence option, but I was not able to make it work, moreover they have not provided an example of using this option in the official git-repo [https://github.com/apache/bahir].] As stated earlier “The silver lining being that the developer gets full-access over the client-side code”, JMS allows the developer to specify the readInterval, this means that the developer has access over the waiting-time after each record-fetch.

Conclusion

Thus we can conclude that, although writing a custom JMS client for the streaming application is a tedious task, but the overall flexibility we get is much more than a pre-built module like Bahir.




Tags

Ayush

Among with Sanjay

Tookitaki