GPU and Large Scale Data Mining

The GPU (Graphics Prossessing Unit) is changing the face of large scale data mining by significantly speeding up the processing of data mining algorithms.  For example, using the K-Means clustering algorithm, the GPU-accelerated version was found to be 200x-400x faster than the popular benchmark program MimeBench running on a single core CPU, and 6x-12x faster than a highly optimised CPU-only version running on an 8 core CPU workstation.

These GPU-accelerated performance results also hold for large data sets.  For example in 2009 data set with 1 billion 2-dimensional data points and 1,000 clusters, the GPU-accelerated K-Means algorithm took 26 minutes (using a GTX 280 GPU with 240 cores) whilst the CPU-only version running on a single-core CPU workstation, using MimeBench, took close to 6 days (see research paper “Clustering Billions of Data Points using GPUs” by Ren Wu, and Bin Zhang, HP Laboratories).  Substantial additional speed-ups are expected were the tests conducted today on the latest Fermi GPUs with 480 cores and 1 TFLOPS performance.

Over the last two years hundreds of research papers have been published, all confirming the substantial improvement in data mining that the GPU delivers.   I will identify a further 7 data mining algorithms where substantial GPU acceleration  have been achieved in the hope that it will stimulate your interest to start using GPUs to accelerate your data mining projects:

  • Hidden Markov Models (HMM) have many data mining applications such as financial economics, computational biology, addressing the challenges of financial time series modelling (non-stationary and non-linearity), analysing network intrusion logs, etc.   Using parallel HMM algorithms designed for the GPU, researchers (see cuHMM: a CUDA Implementation of Hidden Markov Model Training and Classification by Chaun Lin, May 2009) were able to achieve performance speedup of up to 800x on a GPU compared with the time taken on a single-core CPU workstation.
  • Sorting is a very important part of many data mining application.  Last month Duane Merrill and Andrew Grinshaw (from University of Virginia) reported achieving a very fast implementation of the radix sorting method and was able to exceed 1G keys/sec average sort rate on an the GTX480 (NVidia Fermi GPU).  See
  • Density-based Clustering is an important paradigm in clustering since typically it is noise and outlier robust and very good at searching for clusters of arbitrary shape in metric and vector spaces.  Tests have shown that the GPU speed-up ranged from 3.5x for 30k points to almost 15x for 2 million data points. A guaranteed GPU speedup factor of at least 10x was obtained on data sets consisting of more than 250k points. (See “Density-based Clustering using Graphics Processors” by Christian Bohm et al).
  • Similarity Join is an important building block for similarity search and data mining algorithms. Researchers using a special algorithm called Index-supported similarity join for the GPU to outperform the CPU by a factor of 15.9x on 180 Mbytes of data (See “Index-supported Similarity Join on Graphics Processors” by Christian Bohm et al).
  • Bayesian Mixture Models has applications in many areas and of particular interest is the Bayesian analysis of structured massive multivariate mixtures with large data sets.  Recent research work (see “Understanding the GPU Programming for Statistical Computation: Studies in Massively Massive Mixtures” by Marc Suchard et al.) has demonstrated that an old generation GPU (GeForce GTX285 with 240 cores) was able to achieve a 120x speed-up over a quad-core CPU version.
  • Support Vector Machines (SVM) has many diverse data mining uses including classification and regression analysis.   Training SVM and using them for classification remains computationally intensive.  The GPU version of a SVM algorithm was found to be 43x-104x faster than SVM CPU version for building classification models and 112x-212x faster over SVM CPU version for building regression models.  See “GPU Accelerated Support Vector Machines for Mining High-Throughput Screening Data” by Quan Liao, Jibo Wang, et al.
  • Kernel Machines. Algorithms based on kernel methods play a central part in data mining including modern machine learning and non-parametric statistics. Central to these algorithms are a number of linear operations on matrices of kernel functions which take as arguments the training and testing data.   Recent work (See “GPUML: Graphical processes for speeding up kernel machines”  by Balaji Srinivasan et al. 2009) involves transforming these Kernel Machines into parallel kernel algorithms on a GPU and the following are two example where considerable speed-ups were achieved;  (1) To estimate the densities of 10,000 data points on 10,000 samples. The CPU implementation took 16 seconds whilst the GPU implementation took 13ms which is a significant speed-up will in excess of 1,230x;  (2) In a Gaussian process regression, for regression 8 dimensional data the GPU took 2 seconds to make predictions whist the CPU version took hours to make the same prediction which again is a significant speed-up over the CPU version.
Using GPUs on Workstations

The majority of the current GPU data mining research was conducted on workstations where a single CPU is connected, via PCI Express bus, to between 1 and 4 CUDA GPUs.  A typical use-case involved using only one GPU connected to one CPU. CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by Nvidia.  A CUDA program comprises a main host program (written in an extended C/C++) that is executed on the CPU and which calls out to programs that will be loaded and executed on the GPU.   CUDA requires that each GPU is “managed” by a unique CPU thread.  So to use 4 GPUs with single CPU will require at least a 4-core CPU.

There are other types of GPUs (e.g. AMD-ATI) however the CUDA GPU is the market leader and we will only be referring to the Nvidia CUDA GPUs in this post.

Now you may be asking the following questions “What happens when you want to scale-up beyond a single workstation to a cluster of GPUs?” and “Can we use small GPU clusters for Big Data Analytics instead of the traditional large commodity clusters running Hadoop MapReduce?” Answers to these questions will be presented in a subsequent post.

GPU for Data Mining – 10 Things To Do

If you have not yet started experimenting with GPU’s then here are 10 things to do to get started:

  1. Understand the CUDA GPU concepts using the web-based training provided by Nvidia at There are many good courses on this site including screencasts.  Also very useful if you want to learn how to develop CUDA GPU programs.  For those who prefer to read books instead of online training material I would recommend the book “CUDA by Example – An Introduction to General-Purpose GPU Programming” by Jason Sanders and Edward Kandrot.


  1. To get a feel as to the range of CUDA applications (not just data mining) by visiting the CUDA Community Showcase


  1. Visit the GTC 2009 Conference archive to see all the screen-casts and video recordings showing how the CUDA GPU has been applied in science, manufacturing, finance and various research projects. The link is Later in October 2010 the GTC 2010 Conference archive will be available online.  Look out for  webcasts such as “2069 – GPU Accelerated Business Intelligence Analytics” and “2111 – Using R for High-Performance Data Analysis”


  1. If you want to use the GPUs but you do not want to get your hands “dirty” writing CUDA C/C++ code (or other languages bindings such as Python, Java, .NET, Fortran, Perl, or Lau) then consider using MATLAB Parallel Computing Toolbox.  This is a powerful solution for those who know MATLAB.   Alternatively R now has GPU plugins.   A subsequent post will cover using MATLAB and R for GPU accelerated data mining.


  1. It is highly likely that your PC may have an Nvidia GPU graphics card that can be used to start to learn CUDA programming.  For full information on setting up a CUDA development environment and downloading the free CUDA SDK visit


  1. Attempting to move your existing data mining algorithms to the CUDA GPU without fundamentally re-designing your algorithms, to take advantage of the ability of the GPU to execute many thousands of concurrent threads, is likely to result in very poor performance.


  1. Re-design your algorithms to take account of the limited on-device GPU memory (the Fermi now supports 6GB of global on device memory, up from 1GB- 4GB on previous GPUs).  Also you need to understand how to co-ordinate the transfer of data from the CPU to the GPU, with processing on the GPU, and the transfer of the results back to the CPU.


  1. Do not be afraid to forget all the ideas you learnt when you were developing your sequential data mining algorithms for the CPUs.  Instead look for computationally dense parallel algorithms that are can be applied to the CUDA GPU architectures that will give you the required speed-up.


  1. Conduct a Google search looking for recent data mining papers showing how researchers have used the GPU in your area of interest.  You will gain many useful insights by reading these papers, including sample CUDA algorithms.  However do NOT accept that the speed-up they achieved is the last word on the matter.


  1. If you do not want to develop the GPU algorithms yourself then either get your IT department (if they have the expertise) or get an external CUDA consultancy company to develop the GPU algorithms to meet your requirements.



When looking at recently published GPU research papers here are at least three reasons why you can expect today  to achieve even better speed-ups :

  • They may have been using an older version of the GPU such as the G80 chip (instead of the Fermi GTX 480 which at this time is the fastest and latest CUDA GPU).  The older GPUs have less cores, slower processing speeds and some software limitations that are not found in the newer generation of GPUs such as the Fermi.
  • The architecture and capabilities of the Fermi is much greater than earlier generations so do not assume that all the  limitations found in these papers applies to the Fermi.  These means that it is now possible to develop better and faster parallel algorithms than was possible in 2008 and 2009.
  • You may find that the researchers are using data mining algorithms that will only achieve a relatively modest speed-up due to poor parallel design.  It may be possible to  achieve a much better speed-up if you use a more computationally dense variant of the same family of data mining algorithms.

Finally have fun!

Posted by Suleiman Shehu – CEO Azinta Systems