Red Hat
Oct 17, 2014
by Ramesh
Users/customers always ask us about the sizing of their Data Virtaulization infrastructure based on Teiid or the JDV product from Redhat. Typically this is very involved question and not a very easy one answer in plain terms. This is due to fact that it involves taking into consideration questions like:
  • What kind of sources that user is working with? Relational, file, CRM, NoSQL etc.
  • How many sources they are trying to integrate? 10, 20, 100?
  • What are the volumes of data they are working with? 10K, 100K, 1M+?
  • What are the query latency times from the sources? 
  • How you are using Teiid to implement the data integration/virtualization solution. What kind of queries that user is executing? Even small federated results may take a lot of server side processing - especially if the plan needs tweaking.
  • Is materializing being used?
  • Is query written in optimal way?
  • and so on..
Each and every one of the question affects the performance profoundly, and if you got mixture of those then it become that much more harder to give a specific configuration.

Before you start to thinking about beefing up your DV infrastructure, the first thing you want to check is:
  • Is my current infrastructure serving my current needs and future expectations?
  • What kind changes your are expecting?
  • Is there a change in type of sources  coming, like using Hadoop or using cloud based solutions?
We need to build the DV infrastructure on based on these available resources combined with mandated requirements for your usecase. Since Teiid is real time data virtualization engine, it heavily depends upon the underlying sources for data retrieval (there are caching strategies to minimize this). If Teiid is working with slow data sources, no matter much hardware you throw at it, you still going to get a slower response.  The place where the more memory and faster hardware can help DV is, when Teiid engine doing lots of aggregations, filtering, grouping and sorting as result of a user query over large sets of rows of results. That means all the above questions I raised may directly impact based on each individual query in terms of CPU and memory.

There are some limitations that Teiid engine itself has:

1.  hard limits which breaks down along several lines in terms of # of storage objects tracked, disk storage, streaming data size/row limits, etc.
  • Internal tables and result sets are limited to 2^31 rows. 
  • The buffer manager has a max addressable space of 16 terabytes - but due to fragmentation you'd expect that the max usable would be less (this is relatively easy to scale up with a larger block size when we need to).  This is the maximum amount of storage available to Teiid for all temporary lobs, internal tables, intermediate results, etc.
  • The max size of an object (batch or table page) that can be serialized by the buffer manager is 32 GB - but no one should ever get near that (the default limit is 8 MB). A batch is set or rows that are flowing through Teiid engine.
Handling a source that has tera/petabytes of data doesn't by itself impact Teiid in any way.  What matters is the processing operations that are being performed and/or how much of that data do we need to store on a temporary basis in Teiid.  With a simple forward-only query, as long as the result row count is less than 2^31, Teiid be perfectly happy to return a petabyte of data.

2. what are the soft limits for Teiid based upon the configuration such that it could impact sizing

Each batch/table page requires an in memory cache entry of approximately ~ 128 bytes - thus the total tracked max batches are limited by the heap and is also why we recommend to increase the processing batch size on larger memory or scenarios making use of large internal materializations. The actual batch/table itself is managed by buffer manager, which has layered memory buffer structure with spill over facility to disk. 

3. There are open file handle and other resource considerations (such as buffers being allocated by drivers) that are somewhat indirect from Teiid depending upon the particulars of the data source configurations that can have an impact as well.

4. Using internal materialization is based on buffer manager, it is directly dependent upon it.

5. When using XA the source access is serialized, otherwise source access happens in parallel. This can be controlled using # source threads/per user query.

Some scenarios may not be appropriate for Teiid.  Something contrived, such as 1M x 1M rows cross-join in Teiid, may not be a good fit for the vituralization layer.  But is that a real usecase where you are going to cursor over trillion rows to find what you are looking for? Is there a better targeted query? These are the kind of questions you need to be asking yourself when designing a data virtualization layer. 

Take look at query plan, command log and record the source latencies for a given query and see if your Teiid instance is performing optimally for your usecase. Is it CPU bound vs IO bound (larger source results and large source wait times). See if your submitted queries has been waiting in queue (you can check queue depth). Depending upon where you see the fallout is that is where you may need additional resources.

Our basic hardware recommendation is for smaller departmental use case is (double if you need HA or for disaster recovery) 
  • 16 Core processor
  • Minimum of 32 GB RAM
  • 100+ GB of buffer manager temp disk  (may be use of SSD based device will get better results when lot of cache miss or swapping of results)
  • Redhat Linux 6+
  • gigabit Ethernet
Then do a simple pilot with your own usecase(s) with your own data in your infrastructure with anticipated load. If you think that a DV server is totally CPU bound and queries are being delayed due to that, then you can consider adding additional cores to server or additional nodes in a cluster. Note again, to make to sure your source infrastructure is built to handle the load that DV is executing against it.

What would be really great would be sharing your hardware profiles that you selected for your Teiid environments, and techniques you used to get to the decision.

Thank you.

Ramesh & Steve.
Original Post