Home Keynotes/Plenaries Program Registration Hotel Accommodation About New Orleans Contact Us
Cluster 2009 Tutorials

Attendance
Tutorials are open to all registered attendees, on a space available basis. Attendees who register for the conference by 29 July (the early registration deadline) will be asked to identify any tutorials they wish to attend. Tutorial spaces will be filled on a first come, first served basis, up to the capacity of the tutorials venue.

Sessions
There will be a total of eight half-day tutorials in two parallel sessions on Monday, 31 August and Friday, 4 September (the day before and the day after the conference technical sessions). A 30-minute morning and afternoon coffee break will be available in an adjacent common area approximately midway through each session.

Monday am tutorials:

Monday pm tutorials: Friday am tutorials: Friday pm tutorials:

Room Layout
Tutorial rooms will be set up "classroom" style, with a projection screen and presenter's table at the front, and tables for attendees facing the presenter. Power strips will be available for attendee laptops. Wireless Internet access will be available in the tutorial rooms for all participants.

Presentation Materials
To minimize cost and resource usage, the conference does not plan to reproduce paper copies of the presentation materials for distribution to attendees. Instead, the conference will post tutorial materials on the conference website about a week before the conference. Attendees are expected to download the materials, either in advance or at the tutorial.

Details of Tutorials

Monday, 31 August 2009, 8:00 - 12:00

Performance Analysis and optimization with Open_SpeedShop (Jim Galarowicz, Don Maghrak - Krell Institute; Martin Schulz - Lawrence Livermore National Laboratory; David Montoya - Los Alamos National Laboratory; Scott Cranford - Sandia National Laboratories.)
(slides)

Performance Analysis is an essential step in the development cycle of HPC codes targeting large scale infrastructures such as those used by many of the attendees of Cluster 2009. In this tutorial we will introduce how programmers can approach this important topic and how they can analyze the performance of their codes using the comprehensive open source tool set Open|SpeedShop, which is being developed and made available through a close collaboration between the Krell Institute and DOE/NNSA’s Tri-Labs (Lawrence Livermore, Los Alamos, and Sandia National Laboratories) for a wide range of cluster architectures.

In this tutorial we not only will introduce the attendees to Open|SpeedShop and its wide functionality, but we will directly focus on how they can use Open|SpeedShop’s extensive set of performance experiments to step by step understand the performance characteristics of their codes. We will focus both on node local (by studying global profiles, stack trace sampling, hardware counters, as well as I/O properties) and parallel performance (using a combination of tracing experiments and advanced analysis techniques). The latter will cover MPI applications as well as threaded codes.

Developing Scientific Applications Using Eclipse and the Parallel Tools Platform (Beth Tibbitts, Greg Watson - IBM; Jay Alameda - NCSA; Jeff Overbey - University of Illinois at Urbana-Champaign)
(slides)

The Eclipse Parallel Tools Platform (PTP) is an open-source Eclipse Foundation project (http://eclipse.org/ptp) for parallel scientific application development. The application development workbench for the NCSA BlueWaters petascale system is based on Eclipse and PTP.

Eclipse offers features expected from a commercial quality integrated development environment: a syntax-highlighting editor, a source-level debugger, revision control, code refactoring, and support for multiple languages, including C, C++, and UPC. PTP extends Eclipse to provide additional tools and support for the development of parallel scientific applications across a wide range of parallel systems. This includes the ability to access source code, build, launch, and debug applications on machines that are physically remote from the user’s development environment, as well as adding support for Fortran. PTP also provides runtime system and job monitoring, a scalable parallel debugger, and tools to aid the development of parallel programs.

This tutorial will cover the key features provided by Eclipse and PTP for the development of parallel scientific applications. Emphasis for the tutorial is placed on C/Fortran MPI applications but features available to support other languages and programming models, such as OpenMP, UPC, etc., will also be described.

Monday, 31 August 2009, 13:00 - 17:00

Parallel Distributed-Memory Visualization with Paraview (Kenneth Moreland - Sandia National Laboratories; David E DeMarle - Kitware, Inc.)
(slides)

ParaView is a powerful open-source turnkey application for analyzing and visualizing large data sets in parallel. ParaView is regularly used by Sandia National Laboratories analysts to visualize simulations run on the Red Storm and ASC Purple supercomputers and by thousands of other users in worldwide. Designed to be configurable, extendible, and scalable, ParaView is built upon the Visualization Toolkit (VTK) to allow rapid deployment of visualization components. ParaView’s client-server architecture allows it to seamlessly transition from computing on a desktop or laptop to interactive processing on clusters of any size. This tutorial presents the architecture of ParaView and the fundamentals of parallel visualization. Attendees will learn the basics of using ParaView for scientific visualization. The tutorial features guidance on compiling, installing, and using ParaView on typical cluster computers.

Programming using the Partitioned Global Address Space (PGAS) Model (Tarek El-Ghazawi - George Washington University; Vijay Saraswat - IBM)
(slides)

The Partitioned Global Address Space (PGAS) programming model provides ease-of-use through a global shared address space while emphasizing performance through locality awareness. Over the past several years, the PGAS model has been gaining rising attention. A number of PGAS languages are now ubiquitous, such as UPC which runs on most high-performance computers. The DARPA HPCS program has also resulted in new promising PGAS languages, such as X10. In this tutorial we will discuss the fundamentals of parallel programming models and will focus on the concepts and issues associated with the PGAS model. We will follow with an in-depth introduction of two PGAS languages, UPC and X10. We will start with basic concepts, syntax and semantics and will include a range of issues from data distribution and locality exploitation to advanced topics such as synchronization, memory consistency and performance optimizations. Application examples will be also shared.

Friday, 4 September 2009, 8:00 - 12:00

High Performance Computing with CUDA (Massimiliano Fatica, Patrick LeGresley - NVIDIA; Jim Phillips - Beckman Institute, University of Illinois at Urbana-Champaign)
(slides)

NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions--a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded manycore GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes.

This tutorial will present CUDA and discuss its advanced use for science and engineering domains. After introducing CUDA programming and the execution and memory models, motivate the use of CUDA with many brief examples from different HPC domains, we will discuss cluster deployments and advanced issues and include real-world case studies from computational biology and computational fluid dynamics.

Parallel Programming Using the Global Arrays Toolkit (Bruce Palmer, Manojkumar Krishnan, Sriram Krishnamoorthy - Pacific Northwest National Laboratory; P Sadayappan - The Ohio State University)
(slides)

This tutorial provides an overview of the Global Arrays (GA) programming toolkit and describes its capabilities, performance, and the use of GA in high performance computing applications. It will also be compared with other Global Address Space models such as UPC and Co-Array Fortran. GA was created to provide programmers with an interface that allows them to distribute data while maintaining the global index space and programming syntax similar to that in serial programs. The goal of GA is to free the programmer from the low-level management of communication and allow them to deal with their problems in the same index space in which they were originally formulated. At the same time, the compatibility of GA with MPI enables the programmer to take advantage of the existing MPI software when appropriate. The variety of existing GA applications attests to the attractiveness of using higher level abstractions to write parallel code.

Friday, 4 September 2009, 13:00 - 17:00

Designing High-End Computing Systems with InfiniBand and 10-Gigabit Ethernet (D. K. Panda, M. Koop - The Ohio State University; P. Balaji - Argonne National Laboratory)
(slides)

InfiniBand (IB) and 10-Gigabit Ethernet (10GE) architectures are generating a lot of excitement towards building next generation High Performance Computing (HPC) systems and enterprise datacenters. This tutorial will provide an overview of these emerging interconnect architectures, their offered features, their current market standing, and their suitability for prime-time HPC. It will start with a brief overview of IB, 10GE and their architectural features. An overview of the emerging OpenFabrics stack which encapsulates both IB and 10GE in a unified manner will be presented. IB and 10GE hardware/software solutions and the market trends will be highlighted. Finally, sample performance numbers highlighting the performance these technologies can achieve in different environments such as MPI, Sockets, Parallel File Systems, Multi-tier Datacenters, and Virtual Machines, will be shown.

Hybrid Parallel Programming and Performance Optimization on a Multi-core, Multi-socket Cluster System (Byoung-Do Kim, John Cazes - Texas Advanced Computing Center (TACC), University of Texas at Austin)
(slides)

As we enter the multi-core era, the responsibility of application developers for understanding system configuration and utilizing the advanced hardware development is ever increasing. There is also a growing concern that a wide range of large-scale scientific applications attaining only low fraction of peak performance has become a common situation on modern systems. Ranger, the NSF petascale system at TACC, is based on quad-core AMD processors in quad-socket blades from Sun Microsystems, with a total of 62,976 cores in the system. This multi-core, multisocket node configuration in a large-scale system presents a typical example of modern HPC systems, and it offers new challenges in effectively designing and developing large-scale applications.

This tutorial presents the basic principles of parallel programming in MPI/OpenMP hybrid mode for the users who wish to migrate to large multi-core, multi-socket systems like Ranger. Practical approaches to multi-core optimization for scientific applications will also be presented by introducing NUMA control tools for mapping tasks and memory to specific cores and sockets. Case studies of NPBMZ benchmark suite on intra- and inter-node scalability will be demonstrated in order to give the audience an insight into characteristics of a large-scale application on the system.

News