By Roshan Kumar, InfoWorld |

About |

Emerging tech dissected by technologists

How to use Redis for real-time stream processing

An in-depth guide to overcoming fast data ingestion challenges with Redis Pub/Sub, Redis Lists, and Redis Sorted Sets

Real-time streaming data ingest is a common requirement for many big data use cases. In fields like IoT, e-commerce, security, communications, entertainment, finance, and retail, where so much depends on timely and accurate data-driven decision making, real-time data collection and analysis are in fact core to the business.

However, collecting, storing and processing streaming data in large volumes and at high velocity presents architectural challenges. An important first step in delivering real-time data analysis is ensuring that adequate network, compute, storage, and memory resources are available to capture fast data streams. But a company’s software stack must match the performance of its physical infrastructure. Otherwise, businesses will face a massive backlog of data, or worse, missing or incomplete data.

Redis has become a popular choice for such fast data ingest scenarios. A lightweight in-memory database platform, Redis achieves throughput in the millions of operations per second with sub-millisecond latencies, while drawing on minimal resources. It also offers simple implementations, enabled by its multiple data structures and functions.

In this article, I will show how Redis Enterprise can solve common challenges associated with the ingestion and processing of large volumes of high velocity data. We’ll walk through three different approaches (including code) to processing a Twitter feed in real time, using Redis Pub/Sub, Redis Lists, and Redis Sorted Sets, respectively. As we’ll see, all three methods have a role to play in fast data ingestion, depending on the use case.

Challenges in designing fast data ingest solutions

High-speed data ingestion often involves several different types of complexity:

Large volumes of data sometimes arriving in bursts. Bursty data requires a solution that is capable of processing large volumes of data with minimal latency. Ideally, it should be able to perform millions of writes per second with sub-millisecond latency, using minimal resources.

Data from multiple sources. Data ingest solutions must be flexible enough to handle data in many different formats, retaining source identity if needed and transforming or normalizing in real-time.

Data that needs to be filtered, analyzed, or forwarded. Most data ingest solutions have one or more subscribers who consume the data. These are often different applications that function in the same or different locations with a varied set of assumptions. In such cases, the database not only needs to transform the data, but also filter or aggregate depending on the requirements of the consuming applications.

Data coming from geographically distributed sources. In this scenario, it is often convenient to distribute the data collection nodes, placing them close to the sources. The nodes themselves become part of the fast data ingest solution, to collect, process, forward, or reroute ingest data.

Handling fast data ingest in Redis

Many solutions supporting fast data ingest today are complex, feature-rich, and over-engineered for simple requirements. Redis, on the other hand, is extremely lightweight, fast, and easy to use. With clients available in more than 60 languages, Redis can be easily integrated with the popular software stacks.

Redis offers data structures such as Lists, Sets, Sorted Sets, and Hashes that offer simple and versatile data processing. Redis delivers more than a million read/write operations per second, with sub-millisecond latency on a modestly sized commodity cloud instance, making it extremely resource-efficient for large volumes of data. Redis also supports messaging services and client libraries in all of the popular programming languages, making it well-suited for combining high-speed data ingest and real-time analytics. Redis Pub/Sub commands allow it to play the role of a message broker between publishers and subscribers, a feature often used to send notifications or messages between distributed data ingest nodes.

Redis Enterprise enhances Redis with seamless scaling, always-on availability, automated deployment, and the ability to use cost-effective flash memory as a RAM extender so that the processing of large datasets can be accomplished cost-effectively.

In the sections below, I will outline how to use Redis Enterprise to address common data ingest challenges.

Redis at the speed of Twitter

To illustrate the simplicity of Redis, we’ll explore a sample fast data ingest solution that gathers messages from a Twitter feed. The goal of this solution is to process tweets in real-time and push them down the pipe as they are processed.

Twitter data ingested by the solution is then consumed by multiple processors down the line. As shown in Figure 1, this example deals with two processors – the English Tweet Processor and the Influencer Processor. Each processor filters the tweets and passes them down its respective channels to other consumers. This chain can go as far as the solution requires. However, in our example, we stop at the third level, where we aggregate popular discussions among English speakers and top influencers.

redis twitter stream — Figure 1. Flow of the Twitter stream

Note that we are using the example of processing Twitter feeds because of the velocity of data arrival and simplicity. Note also that Twitter data reaches our fast data ingest via a single channel. In many cases, such as IoT, there could be multiple data sources sending data to the main receiver.

There are three possible ways to implement this solution using Redis: ingest with Redis Pub/Sub, ingest with the List data structure, or ingest with the Sorted Set data structure. Let’s examine each of these options.

Ingest with Redis Pub/Sub

This is the simplest implementation of fast data ingest. This solution uses Redis’s Pub/Sub feature, which allows applications to publish and subscribe to messages. As shown in Figure 2, each stage processes the data and publishes it to a channel. The subsequent stage subscribes to the channel and receives the messages for further processing or filtering.

Pros

Easy to implement.
Works well when the data sources and processors are distributed geographically.

Cons

The solution requires the publishers and subscribers to be up all the time. Subscribers lose data when stopped, or when the connection is lost.
It requires more connections. A program cannot publish and subscribe to the same connection, so each intermediate data processor requires two connections – one to subscribe and one to publish. If running Redis on a DBaaS platform, it is important to verify whether your package or level of service has any limits to the number of connections.

A note about connections

If more than one client subscribes to a channel, Redis pushes the data to each client linearly, one after the other. Large data payloads and many connections may introduce latency between a publisher and its subscribers. Although the default hard limit for maximum number of connections is 10,000, you must test and benchmark how many connections are appropriate for your payload.

Redis maintains a client output buffer for each client. The default limits for the client output buffer for Pub/Sub are set as:

client-output-buffer-limit pubsub 32mb 8mb 60

With this setting, Redis will force clients to disconnect under two conditions: if the output buffer grows beyond 32MB, or if the output buffer holds 8MB of data consistently for 60 seconds.

These are indications that clients are consuming the data more slowly than it is published. Should such a situation arise, first try optimizing the consumers such that they do not add latency while consuming the data. If you notice that your clients are still getting disconnected, then you may increase the limits for the client-output-buffer-limit pubsub property in redis.conf. Please keep in mind that any changes to the settings may increase latency between the publisher and subscriber. Any changes must be tested and verified thoroughly.

Code design for the Redis Pub/Sub solution

redis pubsub class diagram — Figure 3. Class diagram of the fast data ingest solution with Redis Pub/Sub

This is the simplest of the three solutions described in this paper. Here are the important Java classes implemented for this solution. Download the source code with full implementation here: https://github.com/redislabsdemo/IngestPubSub.

The Subscriber class is the core class of this design. Every Subscriber object maintains a new connection with Redis.

class Subscriber extends JedisPubSub implements Runnable{
       private String name ="Subscriber";
       private RedisConnection conn = null;
       private Jedis jedis = null;

       private String subscriberChannel ="defaultchannel";

       public Subscriber(String subscriberName, String channelName) throws Exception{
              name = subscriberName;
              subscriberChannel = channelName;
              Thread t = new Thread(this);
              t.start();
       }

       @Override
       public void run(){
              try{
                      conn = RedisConnection.getRedisConnection();
                      jedis = conn.getJedis();
                      while(true){
                             jedis.subscribe(this, this.subscriberChannel);
                      }
              }catch(Exception e){
                      e.printStackTrace();
              }
       }

       @Override
       public void onMessage(String channel, String message){
              super.onMessage(channel, message);
       }
}

The Publisher class maintains a separate connection to Redis for publishing messages to a channel.

public class Publisher{

       RedisConnection conn = null;
       Jedis jedis = null;

       private String channel ="defaultchannel";

       public Publisher(String channelName) throws Exception{
              channel = channelName;
              conn = RedisConnection.getRedisConnection();
              jedis = conn.getJedis();
       }

       public void publish(String msg) throws Exception{
              jedis.publish(channel, msg);
       }
}

The EnglishTweetFilter, InfluencerTweetFilter, HashTagCollector, and InfluencerCollector filters extend Subscriber, which enables them to listen to the inbound channels. Since you need separate Redis connections for subscribe and publish, each filter class has its own RedisConnection object. Filters listen to the new messages in their channels in a loop. Here is the sample code of the EnglishTweetFilter class:

public class EnglishTweetFilter extends Subscriber
{

       private RedisConnection conn = null;
       private Jedis jedis = null; 
       private String publisherChannel = null;

public EnglishTweetFilter(String name, String subscriberChannel, String publisherChannel) throws Exception{
              super(name, subscriberChannel);
              this.publisherChannel = publisherChannel;
              conn = RedisConnection.getRedisConnection();
              jedis = conn.getJedis();           
       }

       @Override
       public void onMessage(String subscriberChannel, String message){
              JsonParser jsonParser = new JsonParser();
              JsonElement jsonElement = jsonParser.parse(message);
              JsonObject jsonObject = jsonElement.getAsJsonObject();

              //filter messages: publish only English tweets           
if(jsonObject.get(“lang”) != null &&
       jsonObject.get(“lang”).getAsString().equals(“en”)){
                      jedis.publish(publisherChannel, message);
              }
       }
}

The Publisher class has a publish method that publishes messages to the required channel.

public class Publisher{
.
.     
       public void publish(String msg) throws Exception{
              jedis.publish(channel, msg);
       }
.
}

The main class reads data from the ingest stream and posts it to the AllData channel. The main method of this class starts all of the filter objects.

public class IngestPubSub
{
.
       public void start() throws Exception{
       .
       .
              publisher = new Publisher(“AllData”);

              englishFilter = new EnglishTweetFilter(“English Filter”,”AllData”,
                                           “EnglishTweets”);
              influencerFilter = new InfluencerTweetFilter(“Influencer Filter”,
                                           “AllData”, “InfluencerTweets”);
              hashtagCollector = new HashTagCollector(“Hashtag Collector”, 
                                           “EnglishTweets”);
              influencerCollector = new InfluencerCollector(“Influencer Collector”,
                                           “InfluencerTweets”);
       .
       .
}

Ingest with Redis Lists

The List data structure in Redis makes implementing a queueing solution easy and straightforward. In this solution, the producer pushes every message to the back of the queue, and the subscriber polls the queue and pulls new messages from the other end.

Pros

This method is reliable in cases of connection loss. Once data is pushed into the lists, it is preserved there until the subscribers read it. This is true even if the subscribers are stopped or lose their connection with the Redis server.
Producers and consumers require no connection between them.

Cons

Once data is pulled from the list, it is removed and cannot be retrieved again. Unless the consumers persist the data, it is lost as soon as it is consumed.
Every consumer requires a separate queue, which requires storing multiple copies of the data.

Code design for the Redis Lists solution

redis lists class diagram — Figure 5. Class diagram of the fast data ingest solution with Redis Lists

You can download the source code for the Redis Lists solution here: https://github.com/redislabsdemo/IngestList. This solution’s main classes are explained below.

MessageList embeds the Redis List data structure. The push() method pushes the new message to the left of the queue, and pop() waits for a new message from the right if the queue is empty.

public class MessageList{

       protected String name = “MyList”; // Name
.
.     
       public void push(String msg) throws Exception{
              jedis.lpush(name, msg); // Left Push
       }

       public String pop() throws Exception{
              return jedis.brpop(0, name).toString();
       }
.
.
}

MessageListener is an abstract class that implements listener and publisher logic. A MessageListener object listens to only one list, but can publish to multiple channels (MessageFilter objects). This solution requires a separate MessageFilter object for each subscriber down the pipe.

class MessageListener implements Runnable{
       private String name = null;
       private MessageList inboundList = null;
       Map<String, MessageFilter> outBoundMsgFilters = new HashMap<String, MessageFilter>();
.
.     
       public void registerOutBoundMessageList(MessageFilter msgFilter){
              if(msgFilter != null){
                      if(outBoundMsgFilters.get(msgFilter.name) == null){
                             outBoundMsgFilters.put(msgFilter.name, msgFilter);
                      }
              }
       }

.
.
       @Override
       public void run(){
.
                      while(true){
                             String msg = inboundList.pop();
                             processMessage(msg);
                      }                                  
.
       }

.
       protected void pushMessage(String msg) throws Exception{
              Set<String> outBoundMsgNames = outBoundMsgFilters.keySet();
              for(String name : outBoundMsgNames ){
                      MessageFilter msgList = outBoundMsgFilters.get(name);
                      msgList.filterAndPush(msg);
              }
       }
}

MessageFilter is a parent class facilitating the filterAndPush() method. As data flows through the ingest system, it is often filtered or transformed before being sent to the next stage. Classes that extend the MessageFilter class override the filterAndPush() method, and implement their own logic to push the filtered message to the next list.

public class MessageFilter{

       MessageList messageList = null;
.
.
       public void filterAndPush(String msg) throws Exception{
              messageList.push(msg);
       }
.
.     
}

AllTweetsListener is a sample implementation of a MessageListener class. This listens to all tweets on the AllData channel, and publishes the data to EnglishTweetsFilter and InfluencerFilter.

public class AllTweetsListener extends MessageListener{
.
.     
       public static void main(String[] args) throws Exception{
              MessageListener allTweetsProcessor = AllTweetsListener.getInstance();

allTweetsProcessor.registerOutBoundMessageList(new
              EnglishTweetsFilter(“EnglishTweetsFilter”, “EnglishTweets”));
              allTweetsProcessor.registerOutBoundMessageList(new
                             InfluencerFilter(“InfluencerFilter”, “Influencers”));

              allTweetsProcessor.start();
       }
.
.
}

EnglishTweetsFilter extends MessageFilter. This class implements logic to select only those tweets that are marked as English tweets. The filter discards non-English tweets and pushes English tweets to the next list.

public class EnglishTweetsFilter extends MessageFilter{

       public EnglishTweetsFilter(String name, String listName) throws Exception{
              super(name, listName);
       }

       @Override
       public void filterAndPush(String message) throws Exception{
              JsonParser jsonParser = new JsonParser();

              JsonElement jsonElement = jsonParser.parse(message);
              JsonArray jsonArray = jsonElement.getAsJsonArray();
              JsonObject jsonObject = jsonArray.get(1).getAsJsonObject();
              if(jsonObject.get(“lang”) != null &&
jsonObject.get(“lang”).getAsString().equals(“en”)){
                             Jedis jedis = super.getJedisInstance();
                             if(jedis != null){
                                    jedis.lpush(super.name, jsonObject.toString());
    
                             }
              }
       }
}

1 2 Page 1 Next

Page 1 of 2