University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications, Urbana, IL
Harvard University, Cambridge, MA
Michigan State University, Institute for Cyber Enabled Research, East Lansing, MI
Pittsburgh Supercomputing Center, Pittsburgh, PA
Pennsylvania State University, State College, PA
Rutgers University, Piscataway, NJ
University of California Los Angeles, Los Angeles, CA
University of Oklahoma, Norman, OK
University of South Carolina, Columbia, SC
University of Tennessee Knoxville, Knoxville, TN
University of Utah, Salt Lake City, UT
Vanderbilt University, Nashville, TN
Washington University in St. Louis, St. Louis, MO
August 1317, 2012
Studying many current GPU computing applications, we have learned that the limits of an application's scalability are often related to some combination of memory bandwidth saturation, memory contention, imbalanced data distribution, or data structure/algorithm interactions. Successful GPU application developers often adjust their data structures and problem formulation specifically for massive threading and executed their threads leveraging shared on-chip memory resources for bigger impact. We looked for patterns among those transformations, and here present the seven most common and crucial algorithm and data optimization techniques we discovered. Each can improve performance of applicable kernels by 2-10X in current processors while improving future scalability.
NOTE: Students are required to provide their own laptops.