Scaling-up GPUs for Big Data Analytics – MapReduce and Fat Nodes

Over the last two years large clusters comprising 1,000s of commodity CPUs, running Hadoop MapReduce, have powered the analytical processing of “Big data” involving hundreds of terabytes of information.

Now a new generation of CUDA GPUs based on the Fermi (see figure 1) and the Kepler (Kepler is due in 2011) have created the potential for a new disruptive technology for “Big data” analytics based on the use of much smaller hybrid CPU-GPU clusters.

These small – medium size hybrid CPU-GPU clusters will be available at 1/10 the hardware cost and 1/20 the power consumption costs, and, deliver processing speed-ups of up to 500x or more when compared with regular CPU clusters running Hadoop MapReduce.  This disruptive technological advance will enable small business units and organisations to compete with the much larger businesses who can afford to deploy very large Hadoop CPU clusters for Big data analytics.


Figure 1: Nvidia Next Generation Fermi GPU with 512 processing cores



In an earlier post, “GPU and Large Scale Data Mining” I outlined the substantial performance speed-up that have been achieved using single GPU node workstations for data mining. I also posed the question “What happens if you wanted to scale up to a cluster of GPUs?“  I shall now attempt to answer this question with the context of “Big data” analytics.

However before discussing GPU clusters and “Big data” analytics it would be useful to give a quick overview of MapReduce as implemented by Hadoop (an open source project that has implemented the concepts of Google MapReduce – the originator of MapReduce) and see how it has been applied to “Big data” processing.

Hadoop MapReduce Architecture

The Hadoop MapReduce Framework is designed to run on large clusters of hundreds, or thousands, of commodity machines.  The Hadoop MapReduce runtime partitions the input data, schedules the programs execution over a set of machines and handles all inter-machine communications, machine failure and removes from the programmer the need to handle these issues.

The programmer does not have to know anything about distributed and parallel programming. He writes just two programs for each job, a Map program that takes the input data assigned to it by the MapReduce runtime; process the input data and emits intermediate results. These intermediate results are initially stored to disk and are subsequently processed by the Reduce program (written by the programmer).  The Reduce program collects the intermediate results to produce a final output result for the job.

The MapReduce model helps reduce software development costs because the programmer’s software is not required to address distributed parallel programming concerns.  This is all handled by the MapReduce runtime.

The simplicity of the programming model and the quality of services attached to the Hadoop MapReduce ecosystem has created a lot of enthusiasm and many success stories amongst the “Big data” processing communities.

Constraints of the Hadoop MapReduce Model

Whilst the Hadoop MapReduce architecture has successfully met its original design goals it suffers from the following constraints:

  • The Hadoop MapReduce framework is typically applied to large batch-oriented computations that are primarily concerned with time to job completion and not real-time computations
  • All the intermediate output from each map and reduce stage is materialized to disk before it can be consumed by the next stage or produce output.  This strategy is a simple and elegant checkpoint/restart fault tolerance mechanism that is essential for large-scale commodity machines with high-failure rates but comes with a performance price.
  • Each node within a MapReduce cluster is typically a 2 – 4 core CPU and the network link between nodes are typically1Gb/s ethernet links.  No GPU co-processors are attached to the nodes and in order to get the processing speed-up hundreds or thousands of servers must be used.   This leads to an initial substantial up-front investment required to build your own private large-scale MapReduce cluster plus high on-going power consumptions costs.
  • Moving in-house MapReduce computations to an external MapReduce cloud service, whilst eliminating initial upfront hardware build and operational costs, may still result in substantial usage fees if hundreds or thousands of machines are required. Furthermore there is still the problem of moving large data sets to the cloud if your MapReduce jobs consume hundreds of terabytes of data.  This is a data transfer issue that is sometimes overlooked.
Other MapReduce Variants

There are many variants of the MapReduce model that have attempted to address some of these constraints whilst still using commodity CPUs.  These MapReduce variants include:

  • Twister is a MapReduce variant that extends the MapReduce model by supporting iterative computations to make it much easier to implement iterative analytical processing, that is required for activities such as data clustering, machine learning, and computer vision.  For more information visit
  • Hadoop Online Prototype (HOP) extends MapReduce to support continuous queries and online aggregation which allows users to see “early returns” from a job as it is executed. HOP programs can be written for applications such as event monitoring and stream processing. For additional information visit

The key point to note with the above variants is that the MapReduce model is not “set in stone”. Many researchers have been exploring ways to extend the capabilities of MapReduce and overcome some of its design constraints.

The Big Question

Now High Performance Computing (HPC) GPU clusters are typically available at 1/10 the price and 1/20th the power consumption costs compared with traditional large CPU clusters. So the big question is this: “Given the constraints of the Hadoop MapReduce, can we leverage the parallel processing power of the latest generation of Fermi GPUs, coupled with the simplicity of the MapReduce model, to create much smaller and affordable CPU-GPU clusters that can be used for real-time “Big data” analytics?”

Now some readers may not think that this is possible because of the view that the GPUs can only be used for mathematical and scientific processing. How at the heart of “Big data” analytics is a substantive core of mathematical algorithms that are required to do the actual data mining work.

The key to the efficient use of the GPU is to make sure that there is sufficient amount of computationally intensive work for the GPU to do and that this work is done is such way as to hide memory and network latencies.

Fortunately the CUDA GPU has been designed to address these issues and these capabilities have been substantially enhanced in the latest generation of GPUs – the Fermi..

So the answer to the “big question” is yes, provided you have selected and implement correctly the right parallel data mining algorithms for your “Big data” analytical processing.

Now it would be useful to look briefly at the early MapReduce GPU research work that have been conducted over the last two years before examining the potential for the development of new MapReduce variants, designed for the new generation of Fermi GPU clusters.

Early GPU and MapReduce Implementations

Early GPU implementations of the MapReduce model include the following research projects:

  • In 2008 Bingsheng He published the seminal research paper “Mars: A MapReduce Framework for Graphics Processors”.  The Mars MapReduce variant was developed just for a single Nvidia G80 GPU (which contains 120 processors) and found that it was up to 16 times faster than a 4-core CPU-based implementation for six common analytical web applications.  However whilst Mars was not designed to scale above a single was the first project to show the GPU potential for MapReduce.
  • In 2009 Alok Mooley published a paper called “DisMaRC: A Distributed Map Reduce Framework on CUDA” about a project that implemented a MapReduce variant on to a small distributed cluster of commodity CPUs where each CPU had an attached 2 Nvidia FX5600 GPUs.  This was a significant advance over the Mars project and a 2 node cluster (where each node had a CPU with 2 attached GPUs on each node) achieved a speed-up of more than 4x over the Mars performance. The DisMaRC work did not conduct performance tests against standard cluster running Hadoop MapReduce.  However it showed that the initial Mars research work was on the right track.
  • In late 2009 Reza Farivar published a significant and ground breaking paper called “MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture”. This paper demonstrated that a version of Hadoop MapReduce when “ported” to a small 4-node GPU cluster could outperform a regular Hadoop 62 node CPU cluster and achieved a 508x speed-up per cluster node when performing Black Scholes option pricing. It should be noted that Black Scholes algorithm is an analytical algorithm.  The GPU cluster configuration comprised 5 nodes each comprising a quad-core CPU with two 9800 GX2 GPUs, each with 128 core processors, connected to a Gigabit Ethernet router and one control node also connected to the Gigabit Ethernet router.  This work demonstrated, for the first time, that you could have a small hybrid CPU-GPU cluster with less than 1/10 the size of a regular cluster and yet still achieve 508x speed up over a regular Hadoop cluster 10 times its size.

It should be noted that all these initial GPU MapReduce research work used with the older GPUs (G80 and GT200) and I expect to start seeing further papers by next year reporting on the new GPU MapReduce research developments based on the disruptive power of the Fermi GPU.

However a fast Fermi is by itself not sufficient to build the type of hybrid GPU clusters that I believe is required to compete with the large Hadoop clusters. So before I present my predictions for hybrid CPU-GPU MapReduce clusters I would like to talk briefly about Fat Nodes.

Scaling the GPU – using Fat Nodes

One approach for scaling up from a single GPU workstation is to use the concept of Fat Nodes. The purpose of Fat Nodes is to keep as much processing, as possible, on the local node using the following architectural design features:

  • Use dual 12 core CPUs each with 64GB or more RAM on each CPU giving 24 CPU cores and 124GB RAM on each node.
  • Connect 10 or more Fermi GPUs to the dual CPUs to provide 4,800 GPU processing cores and delivering over 10 TFLOPs of processing power on each node.
  • Replace local hard disks with high-speed solid-state drives, each with with 200K IOPs or more per SSD using PCI Express. Multiple SSD can be combined to run in parallel to achieve more than 2.2 million read input/output operations per second (IOPS) on a single node. However there are other design options, for example, the SSDs could be shared with multiple nodes using PCIe IO virtualisation.  However the key point here is to replace local hard disks with SSDs for high data retrieval performance. For example, ,a single SSD with 200K IOPS is equivalent to having between 50 – 100 hard disks running in parallel on a single PCIe card.
  • Use where possible 40 Gb/s InfiniBand network connections (5Gb/s and 10Gb/sEthernet are alternatives but with much higher network latencies) for inter-node network traffic. When this is combined with Nvidia’s GPU Direct it enables GPUs to transfer data from their local device memory to another CPU on another node without first putting the data into local CPU memory. This coupled with a network transfer speed up to 90M MPI messages per second across PCIe 2 bus to another node substantially exceeds the messaging passing capabilities of a large Hadoop clusters.

The ability to easily design and implement Fat Nodes was further enhanced by a company called NextIO ( NextIO has released a GPU IO re-configurable appliance designed to dynamically implement and connect the Fat Node architecture outlined above. (See figure 2).


Figure 2: NextIO High Performance Compute IO Appliance


Next Generation MapReduce CPU-GPU clusters

Over the next 18 months expect to see small to medium size Fat Node CPU-GPU clusters based on the Fermi, and its successors. This will revolutionise the cost and speed of doing “Big data” analytics.   Already all the hardware is now available to build Fat Node GPU clusters from vendors such as Appro (, NextIO ( and SuperMicro (

What is really required is a middleware runtime that provides similar software functionally that the “Big data” community has come to love and associate with Hadoop MapReduce. Also if these GPU clusters are going to provide real-time production analytical services they will need to be integrated with legacy production IT systems (possibly using enterprise service bus (ESB) so different layers of middleware will be required.

Clearly there is still some work to be done to implement this vision of MapReduce on the GPU.  I am also expecting that over the next 12 months we will see substantial software improvements with the CUDA SDK to take advantage of the new GPUs coming in 2011 and 2013 (Kepler and Maxwell).  For example, it would be nice to have true virtual memory on the CUDA main device memory.  Based on the hints given at GTC 2010 we can expect to see this and further exciting developments in the coming months. As the old Chinese curse says “May you live in interesting times”, I think that we are.


Designing a MapReduce variant that is able to leverage the impressive GPU technology that is available now will significantly lower the upfront cluster build and power consumption costs for “Big data” analytics, whilst at the same time take advantage of the MapReduce model’s ability to lower software development costs.

Smaller companies or business units within large companies will be able to conduct “Big data” analytical processing without having to spend $500,000+ to build a Hadoop cluster.  The ability to have in-house small GPU departmental clusters for “Big data” processing will enable business managers to conduct more computationally intensive “Big data” analytics at substantially lower capital and operational costs.

Posted by Suleiman Shehu  – CEO Azinta Systems (