Real-time Streaming Architectures

spark Apr 16, 2021

With the advancements in technology and the ability to process Big Data, there has been a lot of emphasis on near Real-time analytics. A wide range of businesses is trying to derive real-time insights and use them to optimise their operations.

Where does Real-time data processing come into the picture?

Real-time data processing comes into the picture in the systems following decoupled architecture. Unlike Monoliths like IBM Mainframe systems, where an upstream system pushes data files to downstream analytical systems at regular intervals, the decoupled systems use Messaging Queues\ Pub-sub messaging systems like Kafka. Upstream systems asynchronously push messages into queues/topics for downstream to consume them whenever available.

A streaming architecture is a defined set of technologies that work together to handle ‘stream processing’. Some streaming architectures include workflows for both batch processing and stream processing. Let’s briefly discuss a few of the streaming architectures.

Lambda Architecture:

Lambda architecture is an architecture designed to handle large quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before the presentation.

Lambda Architecture with Apache Spark - DZone Big Data

Lambda architecture consists of three layers,

  • batch layer
  • real-time processing\speed layer
  • serving layer

Batch Layer:

Batch layer processes data using distributed processing frameworks and aims at creating a highly accurate data set by recomputing based on the complete data set if required and, then updating existing views.

Speed Layer:

Speed Layer processes datasets in real-time, and aims at reducing latency. The speed layer is responsible for filling the gap caused by the batch layer's lag in providing views based on the most recent data.

Serving Layer:

Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.

Pros:

Benefits of Lambda Architecture include high availability, flexibility in scaling, and business agility. Batch pipeline can be used to enrich the output of the streaming layer.

Cons:

The main challenge with Lambda is complexity. Separate codebases need to be maintained for batch and streaming processes making it difficult to keep them in sync.

Kappa Architecture:

Kappa Architecture proposes the use of streams for both batch and stream processing. Kappa architecture is essentially Lambda Architecture without batch processing. Kappa architecture warrants the use of same technology stack and code base for both batch processing and real-time processing.

Applying the Kappa architecture in the telco industry – O'Reilly


Kappa architecture revolutions database migrations as you can delete the tables in the serving layer and create it afresh from scratch by replaying messages to Kafka using same code. Kappa architecture requires faster distributed stream processing frameworks.

Conclusion :

Rather than competing with each other, these two architectures can be considered to be complementing each other. Organisations can use the combination of these 2 architectures, depending on the requirements.

Tags

Sanjay

Tookitaki