Date of Award


Document Type


Degree Name

Master of Science (MS)

Legacy Department

Computer Engineering


Smith, Melissa

Committee Member

Birchfield , Stanley

Committee Member

Gowdy , John


General-Purpose Graphics Processing Units (GPGPUs) have massively parallel computational capabilities. Low cost and ease of programming make them a popular choice over other parallel architectures such as large clusters and accelerators such as Field-Programmable Gate Arrays (FPGAs). Mature programming frameworks for GPGPUs, such as CUDA from Nvidia and OpenCL from the Khronos Group, reduce the learning curve and development time for programming GPGPU architectures. OpenCL, a relatively new industry standard for parallel computing makes it possible to write a single program for heterogeneous platforms that is portable across multiple platforms including GPGPUs and multi-core processors with minimal coding modifications.
GPGPU architectures have been successfully used for accelerating many computationally expensive problems including many linear algebra algorithms, which are inherently parallel in nature. Singular Value Decomposition (SVD) is a computationally expensive linear algebra matrix decomposition technique that has many applications including data compression, facial recognition, and solving a system of equations. As the dimensions of the matrix increase, SVD computation becomes increasingly time consuming. Since SVD is a major part of some algorithms such as Eigenfaces (a facial recognition algorithm based on Principle Component Analysis), the overall runtime for these algorithms depends heavily on the execution time of SVD. Hence, to implement efficient applications based on SVD, for example real-time facial recognition, it is desirable to accelerate the SVD algorithm.
In this work, a parallel implementation of Singular Value Decomposition is discussed in detail. It uses many basic linear algebra techniques such as matrix-vector multiplication, vector norms and vector outer products. This work focuses on the implementation techniques, optimization methods (specifically for a GPGPU implementation) and their effect on the overall performance of the algorithm. We present the performance analysis of this algorithm on NVIDIA's Tesla C2050 GPU as compared to the single-threaded serial implementation executed on an Intel 2.66 GHz Q9450 processor. We report speedups up to 20x for the parallel SVD computation. The results discussed in this thesis demonstrate the potential of the computational resources available with GPGPUs.