Red Hat
Aug 14, 2015
by Christina Lin
It is very so often in the integration space we need to deal with large amount of data. When designing the integration solution, we really need to stop and take a good look at how to deal with these data.
You may find yourself have to handle large data in the following situations,

  • Incoming data 
  • Processing Data
  • Providing output 

From my experience, when having large amount incoming data, for me it means the data comes in with very high frequency, as well as high volume of messages, the risk of having too much content flooding our application is high, my approach is restricting the data coming into the application, so it runs the maximum capacity but at the same time avoid jamming the system. In this case I will

  • Try to use Polling Consumer if possible, there are many components in JBoss Fuse support polling mechanism. Such as File, FTP, JPA, Quartz2.. etc, it supports configuring how frequently the polling should be. 

  • If no polling consumer available or the data are just coming in too fast, try EIP, the Throttler Pattern, not only it can set the frequency of polling it will also allow you to control how many messages goes in per poll. 

When applying the above pattern, just make sure the place you receiving the input is large enough to temporary hold the incoming data, try to extend the capacity of the service either by have more service to share the load or simply allocate more hardware resources. 

Another situation when it comes to large amount of data is actually handing the large message, because of the way JBoss Fuse process the messages, it loads the message into memories, when the messages gets too big, it will soon run into out of memory problem. So, when dealing with one large chunk of data my approach will be

  • Splitting your content, when dealing with large files, it is always best to split it into smaller chunks if possible, for many reasons, avoid large memory consumptions at once, by splitting, we can even process these smaller chunks in parallel, instead of processing one part after the other. (For large XML file, by using the xml tokenizer, it will significantly reduce the memory usage.)

  • Enable streaming, when this is turned on, instead of holding the entire message in memory it sends big streams to file, you can even configure the StreamCachingStrategy to customized the size, location, buffer size, memory limits, etc. 

  • Filter the content, it is often the case, with large data, not all part of the file is needed for further processing, at the same time the original message is needed later, I would then use the Claim Check pattern to first filter the data send, and then retrieve the original data when needed.

Last but not least, providing large amount of data to broad audiences, clients. We try do least data sending as possible, the most sensible way then is to place a buffer in-between the client and the output procurer.

  • Publish and subscribe(Topic) in messaging, this is probably the first scenario that comes into my mind, but it guarantee the subscriber to get the message as for the producer , it only have to write it once. 

  • Caching medium, when the messages needs to be repetitively read by client, then placing the content into a caching medium such as database or even faster ones like memory caches, is  better then messaging, as messages will be gone as soon as all the recipients receive the data.

As you can see, there are many possible way to handle large data in JBoss Fuse, because of it's flexibility, you have the freedom of choosing the perfect strategy for different situation, there are many more options and combinations we can do, what is your approach when dealing with large data? I am curious of all the genius way people solve their problem, let me know!
Original Post