Teraflop Parallel Computing on a Budget: Applications of GPU Computing in Mechanical Engineering
August 30, 2009
Sponsored By: National Science Foundation
In the first part of the workshop a sequence of four presentations will highlight the impact of GPU computing in a wide spectrum of Mechanical Engineering applications (agentbased modeling, finite element analysis, manufacturing, and computational dynamics). The second part of the workshop builds around a “Getting Starting with GPU Computing” presentation and a discussion of more advanced performance tuning aspects related to GPU computing.
Schedule
Part 1: Introduction to GPU Computing
12:30 – 2:00  Dan Negrut, University of Wisconsin, Madison  Getting Started with GPU Computing  abstract presentation 
Part 2: GPUenabled Research in Mechanical Engineering
2:10 – 2:30  Krishnan Suresh, University of Wisconsin, Madison  GPUbased Algebraic Reduction of Bone Models  abstract presentation 
2:30 – 2:50  Roshan D’Souza, Michigan Tech University, Houghton  Where Video Games Meet Scientific Computations  abstract presentation 
2:50 – 3:10  Sara McMains, University of California, Berkeley  GPUs for Computer Aided Design and Manufacturability Analysis  abstract presentation 
3:10 – 3:30  Dan Negrut, University of Wisconsin, Madison  Large Scale Collision Detection on the GPU  abstract presentation 
Part 3: Advanced GPU Computing Tutorial and Q&A Session
3:40 – 4:40  Sara McMains, University of California, Berkeley  GPUbased Algebraic Reduction of Bone Models  abstract presentation 
4:50 – 5:30  Panel Discussion  Where Video Games Meet Scientific Computations  abstract presentation 
Abstracts
Krishnan Suresh: GPUbased Algebraic Reduction of Bone Models (Vikalp Mishra, Krishnan Suresh)

Background
There is an increasing demand for rapid finite element analysis of bioartifacts such as osteopathic bones, fractured hipjoints, etc. Bioartifacts are typically represented via a greyscale voxel image that may be obtained via various modalities such as CTscan, MRI, etc. Consequently, a finite element analysis of such artifacts typically proceeds along one of two paths: (1) a direct hexmesh based analysis where each voxel, or a group of voxels, are replaced by an equivalent finiteelement hex, endowed with hexshape functions, or (2) the artifact’s boundary is inferred via various userassisted algorithms, followed by a standard finite element analysis of the enclosed volume. Each strategy has its strengths and weakness. Briefly, the hexmethod directly exploits the voxel output of various modalities, but can be computationally daunting since hexmethods could result in large stiffness matrices. The boundarybased method, on the other hand, is computationally less demanding, but is a lengthier process that relies heavily on manual assistance. Further, strategies share a common weakness in that they do not directly support lowerdimensional modeling. This is particularly important since bones often exhibit beamlike behavior.

Algebraic Reduction
We shall present an efficient yet easily automatable method for analyzing long and slender bones. The proposed method directly exploits voxel data from CT scanners, and it achieves high computational efficiency through an algebraic reduction process. In the proposed reduction process, a beam bending stiffness matrix is constructed by extracting the appropriate strain energy from the voxel data. This is then projected onto a lowerdimensional space by appealing to standard 1D beamtheories, such EulerBernoulli theory. Thus, neither a 3D hex mesh nor the boundary of the bone needs to be constructed. The theoretical validity of the proposed methodology is substantiated through numerical experiments.

GPU Implementation
The proposed method also lends itself to easy parallelization. We demonstrate an implementation of the proposed algebraic reduction on an NVIDIA general purpose graphic processing (GPU) unit, using the modern CUDA Clike language. We compare the speedup offered by the GPU, as opposed to an implementation on a highend (CPU) workstation.

References
Jorabchi, K., Danczyk, J., and Suresh, K., Shape Optimization of Potentially Slender Structures, Submitted to the Journal of Computing and Information Science in Engineering, 2008. Wang, C.M., Reddy, J. N., Lee, K. H., Shear Deformable Beams and Plates: Relationship to Classical Solutions. 2000, London: Elsevier Science.
Roshan D’Souza: When Video Games Meet Scientic Computing: LargeScale AgentBased Model Simulations on the GPU
Agentbased modeling is a bottom up technique to simulate many discrete dynamic sys tems (swarms, networks etc). It is a direct computational representation where the behavior of individual entities in the dynamic system is represented using a software construct called agent. System level behaviors are obtained through the interaction of a large number of these agents. There are no analytical tools to predict system behaviors and as such the only way to analyze the system is through simulation. Due to its emergent nature, ensemble properties are dependent on simulation size (agent population). Representative agent populations can range into the tens of millions for certain systems. Current stateoftheart systems are incapable of eciently computing such large simulations. In this presentation I describe some of the early work that we have done in enabling largescale agentbased model simulations on graph ics processing units (GPUs). I will discuss representation of agent state data, algorithms for agentcommunication, agent spawning, con ict resolution, and statistics. We have implemented and tested our techniques on an NVIDIA 8800 GTX GPU. Benchmarks against stateofthe art toolkits show that performance gains of over 1000x are possible.
Sara McMains: NURBS Evaluation using the GPU (Adarsh Krishnamurthy, Sara McMains)
We present a unified method for evaluating and displaying NonUniform Rational BSpline (NURBS) surfaces using the Graphics Processing Unit (GPU). NURBS surfaces, the de facto standard in commercial mechanical CAD modeling packages, are currently being tessellated into triangles before being sent to the graphics card for display, since there is no native hardware support for NURBS. Our method uses a unified GPU fragment program to evaluate the surface point coordinates of any arbitrary degree NURBS patch directly, from the control points and knot vectors stored as textures in graphics memory. The display incorporates dynamic Level of Detail (LOD) for realtime interaction at different resolutions of the NURBS surfaces. Different data representations and access patterns are compared for efficiency and the optimized evaluation method is chosen. Our GPU evaluation and rendering speeds are more than 40 times faster than evaluation using the CPU.
Dan Negrut Rigid Body Collision Detection (Hammad Mazhar, Dan Negrut)
This work concentrates on the issue of rigid body collision detection, a critical component of any software package employed to approximate the dynamics of multibody systems with frictional contact. This paper presents a scalable collision detection algorithm designed for massively parallel computing architectures. The approach proposed is implemented on a ubiquitous Graphics Processing Unit (GPU) card and shown to achieve a 180x speedup over stateofthe art Central Processing Unit (CPU) implementations when handling multimillion object collision detection. GPUs are composed of many (on the order of hundreds) scalar processors that can simultaneously execute an operation; this strength is leveraged in the proposed algorithm. The approach can detect collisions between five million objects in less than two seconds; with newer GPUs, the capability of detecting collisions between eighty million objects in less than thirty seconds is expected. The proposed methodology is expected to have an impact on a wide range of granular flow dynamics and smoothed particle hydrodynamics applications, e.g. sand, gravel and fluid simulations, where the number of contacts can reach into the hundreds of millions.
Software
Follow Step 1 and Step 2 below in order to be able to tap into the GPU computational resources of your computer.
After these two steps, if the graphics card in your computer is CUDA compatible, you can immediately start running demos on your GPU. If your computer is not CUDA compatible, you’ll only be able to run CUDA in emulation mode (the CPU will emulate the behavior of the GPU; very slow, useful for debugging only). A list of CUDA enabled products is available here: CUDA GPUs.
Step 1: Install Visual Studio 2008 Express and Windows SDK
n order to compile and run the code samples in the SDK, Microsoft Visual Studio 2008 needs to be installed.
Note: For those who have purchased Visual Studio 2008 Professional Edition, the Windows SDK does NOT need to be installed.
Step 2: Install CUDA
A CUDA capable GPU is required. A list of CUDA enabled products is available here: CUDA GPUs
Download and install in this order: the CUDA Driver, the CUDA Toolkit, and the CUDA SDK (links provided below, according to your Windows distribution).
Organizers
Sara McMains
University of California, Berkeley
Roshan D’Souza
Michigan Tech University, Houghton
Krishnan Suresh
University of Wisconsin, Madison
Dan Negrut
University of Wisconsin, Madison