Early Evaluation of Directive-Based GPU Programming Models for Productive Exascale Computing
Authors: Seyong Lee, Jeffrey S. Vetter (Georgia Tech and Oak Ridge National Laboratory)
ABSTRACT: Graphics Processing Unit (GPU)-based parallel computer architectures have shown increased popularity as a building block for high performance computing, and possibly for future Exascale computing. However, their programming complexity remains as a major hurdle for their widespread adoption. To provide better abstractions for programming GPU architectures, researchers and vendors have proposed several directive-based GPU programming models. These directive-based models provide different levels of abstraction, and required different levels of programming effort to port and optimize applications. Understanding these differences among these new models provides valuable insights on their applicability and performance potential. In this paper, we evaluate existing directive-based models by porting thirteen application kernels from various scientific domains to use CUDA GPUs, which, in turn, allows us to identify important issues in the functionality, scalability, tunability, and debuggability of the existing models. Our evaluation shows that directive-based models can achieve reasonable performance, compared to hand-written GPU codes.
Tuesday, Nov. 13, 1:30-2 p.m.
Room: 355-EF
Efficient Backprojection-Based Synthetic Aperture Radar Computation with Many-Core Processors (Finalist: Best Paper Award)
Authors: Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, Thomas Benson (Georgia Tech)
ABSTRACT: Tackling computationally challenging problems with high efficiency often requires the combination of algorithmic innovation, advanced architecture, and thorough exploitation of parallelism. We demonstrate this synergy through synthetic aperture radar (SAR) via backprojection, an image reconstruction method that can require hundreds of TFLOPS. Computation cost is significantly reduced by our new algorithm of approximate strength reduction; data movement cost is economized by software locality optimizations facilitated by advanced architecture supports; parallelism is fully harnessed in various patterns and granularities. We deliver over 35 billion backprojections per second throughput per compute node on a Sandy Bridge-based cluster, equipped with Intel Knights Corner coprocessors. This corresponds to processing a 3K×3K image within a second using a single node. Our study can be extended to other settings: backprojection is applicable elsewhere including medical imaging, approximate strength reduction is a general code transformation technique, and many-core processors are emerging as a solution to energy-efficient computing.
Tuesday, Nov. 13, 2:30 − 3 p.m.
Room: 255-BC
Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool
Authors: Dong Li, Jeffrey Vetter (Georgia Tech and Oak Ridge National Laboratory), Weikuan Yu
ABSTRACT: Extreme-scale scientific applications are at a significant risk of being hit by soft errors on future supercomputers. To better understand soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool – BIFIT – to evaluate how soft errors impact applications. BIFIT is designed with capability to inject faults at specific targets: execution point and data structure. We apply BIFIT to three scientific applications and investigate their vulnerability to soft errors. We classify each application’s individual data structures in terms of their vulnerabilities, and generalize these classifications. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error. Yet, we are able to identify relationships between vulnerabilities and classes of data structures. These classifications can be used to apply appropriate resiliency solutions to each data structure within an application.
Wednesday, Nov. 14, 1:30-2 p.m.
Room: 255-EF
Optimizing the Computation of N-Point Correlations on Large-Scale Astronomical Data
Authors: William B. March, Kenneth Czechowski, Marat Dukhan, Thomas Benson, Dongryeol Lee, Richard Vuduc, Edmond Chow, Alexander G. Gray (Georgia Tech), Andrew J. Connolly
ABSTRACT: The n-point correlation functions (npcf) are powerful statistics that are widely used for data analyses in astronomy and other fields. These statistics have played a crucial role in fundamental physical breakthroughs, including the discovery of dark energy. Unfortunately, directly computing the npcf at a single value requires $\bigO{N^n}$ time for $N$ points and values of $n$ of 2, 3, 4, or even larger. Astronomical data sets can contain billions of points, and the next generation of surveys will generate terabytes of data per night. To meet these computational demands, we present a highly-tuned npcf computation code that show an order-of-magnitude speedup over current state-of-the-art. This enables a much larger 3-point correlation computation on the galaxy distribution than was previously possible. We show a detailed performance evaluation on many different architectures.
Thursday, Nov. 15, 11-11:30 a.m.
Room: 255-EF
Aspen – A Domain Specific Language for Performance Modeling
Authors: Kyle L. Spafford, Jeffrey S. Vetter (Georgia Tech and Oak Ridge National Laboratory)
ABSTRACT: We present a new approach to analytical performance modeling using Aspen, a domain specific language. Aspen (Abstract Scalable Performance Engineering Notation) fills an important gap in existing performance modeling techniques and is designed to enable rapid exploration of new algorithms and architectures. It includes a formal specification of an application’s performance behavior and an abstract machine model. We provide an overview of Aspen’s features and demonstrate how it can be used to express a performance model for a three dimensional Fast Fourier Transform. We then demonstrate the composability and modularity of Aspen by importing and reusing the FFT model in a molecular dynamics model. We have also created a number of tools that allow scientists to balance application and system factors quickly and accurately.
Thursday, Nov. 15, 11:30 a.m. – 12 p.m.
Room: 355-EF
Cray Cascade – A Scalable HPC System Based on a Dragonfly Network
Session Chair: Jeffrey Vetter (Georgia Tech and Oak Ridge National Laboratory)
Authors: Gregory Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Michael Higgins, James Reinhard
ABSTRACT: Higher global bandwidth requirement for many applications and lower network cost have motivated the use of the Dragonfly network topology for high performance computing systems. In this paper we present the architecture of the Cray Cascade system, a distributed memory system based on the Dragonfly network topology. We describe the structure of the system, its Dragonfly network the routing algorithms, and a set of advanced features supporting both mainstream high performance computing applications and emerging global address space programing models. With a combination of performance results from prototype systems and simulation data for large systems, we demonstrate the value of the Dragonfly topology and the benefits obtained through extensive use of adaptive routing.
Thursday, Nov. 15, 3:30-4 p.m.
Room: 255-BC
GRAPE-8 – An Accelerator for Gravitational N-Body Simulation with 20.5GFLOPS/W Performance
Session Chair: Jeffrey Vetter (Georgia Tech and Oak Ridge National Laboratory)
Authors: Junichiro Makino, Hiroshi Daisaka
ABSTRACT: In this paper, we describe the design and performance of GRAPE-8 accelerator processor for gravitational N-body simulations. It is designed to evaluate gravitational interaction with cutoff between particles. The cutoff function is useful for schemes like TreePM or Particle-Particle Particle-Tree, in which gravitational force is divided to short-range and long-range components. A single GRAPE-8 processor chip integrates 48 pipeline processors. The effective number of floating-point operations per interaction is around 40. Thus the peak performance of a single GRAPE-8 processor chip is 480 Gflops. A GRAPE-8 processor card houses two GRAPE-8 chips and one FPGA chip for PCI-Express interface. The total power consumption of the board is 46W. Thus, theoretical peak performance per wattage is 20.5 Gflops/W. The effective performance of the total system, including the host computer, is around 5Gflops/W. This is more than a factor of two higher than the highest number in the current Green500 list.
Thursday, Nov. 15, 4-4:30 p.m.
Room: 255-BC
SGI UV2 – A Fused Computation and Data Analysis Machine
Session Chair: Jeffrey Vetter (Georgia Tech and Oak Ridge National Laboratory)
Authors: Gregory M. Thorson, Michael Woodacre
ABSTRACT: UV2 is SGI’s 2nd generation Data Fusion system. UV2 was designed to meet the latest challenges facing users in computation and data analysis. Its unique ability to perform both functions on a single platform enables efficient, easy to manage workflows. This platform has a hybrid infrastructure, leveraging the latest Intel EP processors to provide industry leading computation. Due to its high bandwidth, extremely low latency NumaLink6 interconnect, plus vectorized synchronization and data movement, UV2 provides industry leading data intensive capability. It supports a single operating system (OS) image up to 64TB and 4K threads. Multiple OS images can be deployed on a single NL6 fabric, which has a single flat address space up to 8PB and 256K threads. These capabilities allow for extreme performance on a broad range of programming models and languages including: OpenMP, MPI, UPC, CAF, and SHMEM. The architecture, implementation, and performance are detailed.
Thursday, Nov. 15, 4:30-5 p.m.
Room: 255-BC