Ahmad
Faraj, Florida
State University
STAR-MPI:
Self Tuned Adaptive Routines for MPI Collective Operations
Message
Passing Interface (MPI) collective communication routines are widely used in
parallel applications. In order for a collective communication routine to
achieve high performance for different applications on different platforms, it
must be adaptable to both the system architecture and the application workload.
Current MPI implementations do not support such software adaptability and are
not able to achieve high performance on many platforms. In this paper, we
present STAR-MPI (Self Tuned Adaptive Routines for MPI collective operations),
a set of MPI collective communication routines that are capable of adapting to
system architecture and application workload. For each operation, STAR-MPI
maintains a set of communication algorithms that can potentially be efficient
at different situations. As an application executes, a STAR-MPI routine applies
the Automatic Empirical Optimization of Software (AEOS) technique at run time
to dynamically select the best performing algorithm for the application on the
platform. We describe the techniques used in STAR-MPI, analyze STAR-MPI
overheads, and evaluate the performance of STAR-MPI with applications and
benchmarks. The results of our study indicate that STAR-MPI is robust and
efficient. It is able to find efficient algorithms with reasonable overheads,
and it out-performs traditional MPI implementations to a large degree in many
cases.
Rajesh
Vivekanandham,
Indian Institute of Science
A
Scalable Low Power Issue Queue for Large Instruction Window Out of Order
Processors
Large
instruction windows and issue queues are key to exploiting greater instruction
level parallelism in out-of-order superscalar processors. However, the cycle
time and energy consumption of conventional monolithic issue queues are high.
Previous efforts to reduce cycle time segment the issue queue and pipeline
wakeup. Unfortunately, this results in significant IPC loss. Other proposals
address energy efficiency by avoiding only the unnecessary tag comparisons.
To address
both these issues more efficiently, we propose the Scalable Low power Issue
Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to
mitigate the problem of delayed wakeups while reducing the cycle time. Also,
the SLIQ design naturally leads to significant energy savings by reducing both
the number of tag broadcasts and comparisons required.
A 2 segment
SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite,
while achieving a 25.2% reduction in issue latency when compared to an ideal
monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8
segment SLIQ improves scalability by reducing the issue latency by 38.3% while
incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ reduces the
energy consumption and energy-delay product by 48.3% and 67.4% respectively
Greg
Bronevetsky, Cornell
University
Experimental
Evaluation of Application-level Checkpointing for OpenMP Programs
It is
becoming important for long-running scientific applications to tolerate
hardware faults. The most commonly used approach is checkpoint and restart
(CPR) - the state of the computation is saved periodically to disk, and when a
failure occurs, the computation is restarted from the last saved state. One
common way of doing this, called System-level Checkpointing (SLC), requires
modifying the Operating System and the communication libraries to permit the
saving of the state of the entire parallel application. Unfortunately, this
approach has poor portability since a checkpointer for one system rarely works
on a different system. The only portable alternative is Application-level
Checkpointing (ALC), where the programmer manually modifies their program to
enable CPR, a very labor-intensive task.
We are
investigating the use of compiler technology to instrument codes to embed the
ability to tolerate faults into applications themselves, making them
self-checkpointing and self-restarting on any platform.
In~cite{bronevetsky+:asplos04} we described a general approach for
checkpointing shared memory APIs at the application level.
Since~cite{bronevetsky+:asplos04} applied to only a toy feature set common to
most shared memory APIs, this paper shows the practicality of this approach by
extending it to a specific popular shared memory API: OpenMP. We describe the
challenges involved in providing automated ALC for OpenMP applications and
experimentally validate this approach by showing detailed performance results
for our implementation of this technique. Our experiments with the NAS OpenMP
benchmarks~cite{npb-openmp} and the EPCC
microbenchmarks~cite{epcc-microbenchmarks} show generally low overhead on three
different architectures: Linux/IA64, Tru64/Alpha and Solaris/Sparc and
highlight important lessons about the performance characteristics of this
approach.
Jonathan
Weinberg, University
of California, San Diego
User-Guided
Symbiotic Space-Sharing of Real Workloads
Symbiotic
space-sharing is a technique that improves system throughput by executing
parallel applications in combinations and configurations that alleviate
pressure on shared resources. We have shown prototype schedulers that leverage
such techniques to improve throughput by 20% over conventional space-sharing
schedulers when resource bottlenecks are known. Such evaluations have utilized
benchmark workloads and proposed that schedulers be informed of resource
bottlenecks by users at job submission time; in this work, we investigate the
accuracy with which users can actually identify resource bottlenecks in real
applications and the implications of these predictions for symbiotically
space-sharing production workloads. Using a large HPC platform, a
representative application workload, and a sampling of expert users, we show
that user inputs are of value and that for our chosen workload, user-guided
symbiotic scheduling can improve throughput over conventional space-sharing by
15-22%.
Dean
Hildebrand, University
of Michigan
Large
Files, Small Writes, and pNFS
Workload
characterization studies highlight the prevalence of small and sequential data
requests in scientific applications. Parallel file systems excel at large data
transfers but sometimes at the expense of small I/O performance. pNFS is an
NFSv4.1 high-performance enhancement that provides direct storage access to
parallel file systems while preserving NFSv4 operating system and hardware
platform independence. This paper demonstrates that distributed file systems
can increase write throughput to parallel data storesÑregardless of file
sizeÑby overcoming parallel file system inefficiencies. We also show how pNFS
can improve the overall write performance of parallel file systems by using
direct, parallel I/O for large write requests and a distributed file system for
small write requests. We describe our pNFS prototype and present experiments
demonstrating the performance improvements.
Shinji
Sumimoto, Fujitsu
Laboratories
Scalable
Communication Layer for Multi-Dimensional Hyper Crossbar Network Using Multiple
Gigabit Ethernet
This paper
proposes the scalable communication layer for a multi-dimensional hyper
crossbar network using multiple Gigabit Ethernet for the PACS-CS system which
consists of 2560 single-processor nodes and a 16 x 16 x 10 three dimensional
hyper-crossbar network (3D-HXB). To realize a high performance communication
layer using multiple existing Ethernet networks, the host processor usage for
the communication processing must be minimized. To overcome this problem, we
have developed the PM/Ethernet-HXB communication facility. PM/Ethernet-HXB
realizes communication protocol processing without mutual exclusion even for
Zero-copy communication between the communication buffers of nodes. We have
implemented the PM/Ethernet-HXB on SCore cluster system software, and evaluated
its communication and application performance. PM/Ethernet-HXB achieves a
uni-directional communication bandwidth of 1065 MB/s using nine Gigabit
Ethernet links, a unidirectional communication bandwidth of 741 MB/s (98.8% of
the theoretical performance), a bidirectional bandwidth of 1401 MB/s (93.4% of
the theoretical performance) on the 3D-HXB connections (a total of six Ethernet
links). The results of MPI communication bandwidth are a unidirectional bandwidth
of 960 MB/s and a bidirectional bandwidth of 1008 MB/s using eight links. These
results show that PM/Ethernet-HXB realizes a comparative performance using
multiple Gigabit Ethernet networks to dedicated cluster networks such as
InfiniBand. The speedups of IS and CG Class C NAS parallel benchmarks are
scalable up to using four links on eight node cluster, and performance
degradation between 3D-HXB (2 x 2 x 2) and 1-dimensional network (8 x 1) are
small.
Kyle
Rupnow, University
of Wisconsin Madison
Scientific
Applications vs. SPEC-FP: A Comparison of Program Behavior
Many modern
scientific applications execute on massively parallel collections of
microprocessors. Supercomputers such as the Cray XT3 (Red Storm) and Blue
Gene/L support thousands to tens of thousands of processors per parallel job.
However, individual microprocessor performance remains a critical component of
overall performance. Traditional approaches to improve scientific application
performance concentrate on floating-point (FP) instructions; however, our
studies show that in the scientific applications used at Sandia National Labs,
integer instructions constitute a large and critical part of the instruction
mix. Although the SPEC-FP benchmark suite is considered representative of FP workloads,
it has a much smaller proportion of integer computation instructions than the
Sandia scientific applications, with 22.9% as compared to 36.9%. Integer
instructions in Sandia applications also behave differently than in SPEC-FP.
Integer instruction outputs are reused 8.8x to 13.1x more often in SPEC-FP
benchmarks, and integer dataflow in Sandia applications is more complex than in
the SPEC-FP suite. In this work, we examine common dataflow and usage patterns
of integer instructionsÑinformation essential to develop hardware techniques to
accelerate critical scientific applications. We present statistics for SPEC-FP
and Sandia applications, summarizing integer computation usage and the size,
shape and interface (number of inputs/outputs) of dataflow graphs.
Eduardo
Qui–ones, UPC
Selective
Predicate Prediction for Out-of-Order Processors
If-conversion
transforms control dependencies to data dependencies by using a predication
mechanism. It is useful to eliminate hard-to-predict branches and to reduce the
severe performance impact of branch mispredictions. However, the use of
predicated execution in out-of-order processors has to deal with two problems:
there can be multiple definitions for a single destination register at rename
time, and instructions with a false predicated consume unnecessary resources.
Predicting predicates is an effective approach to address both problems.
However, predicting predicates that come from hard-to-predict branches is not
beneficial in general, because this approach reverses the if-conversion
transformation, loosing its potential benefits. In this paper we propose a new
scheme that dynamically selects which predicates are worthy to be predicted,
and which one are more effective in its if-converted form. We show that our approach
significantly outperforms previous proposed schemes. Moreover it performs
within 5% of an ideal scheme with perfect predicate prediction.
Hakan
Zeffer, Uppsala
University
TMA: A
Trap-Based Memory Architecture
The
advances in semiconductor technology have set the shared-memory server trend
towards processors with multiple cores per die and multiple threads per core.
We believe that this technology shift forces a re-evaluation of how to
interconnect multiple such chips to form larger systems.
This paper
argues that by adding support for coherence traps in future chip
multiprocessors, large-scale server systems can be formed at a much lower cost.
This is due to shorter design time, verification and time to market when
compared to its traditional all-hardware counter part. In the proposed
trap-based memory architecture (TMA), software trap handlers are responsible
for obtaining read/write permission, whereas the coherence trap hardware is
responsible for the actual permission check.
In this
paper we evaluate a TMA implementation (called TMA Lite) with a minimal amount
of hardware extensions, all contained within the processor. The proposed
mechanisms for coherence trap processing should not affect the critical path
and have a negligible cost in terms of area and power for most processor
designs.
Our
evaluation is based on detailed full system simulation using out-of-order
processors with one or two dual-threaded cores per die as processing nodes. The
results show that a TMA based distributed shared memory system can on average
perform within 1 percent of a highly optimized hardware based design.
Jaume
Abella, Intel
& UPC
Heterogeneous
Way-Size Cache
Set-associative
cache architectures are commonly used. These caches consist of a number of
ways, each of the same size. We have observed that the different ways have very
different utilization, which motivates the design of caches with heterogeneous
way sizes. This can potentially result in higher performance for the same area,
better capabilities to implement dynamically adaptive schemes, and more
flexibility for choosing the size of the cache. This paper proposes a novel
cache architecture, Heterogeneous Way-Size cache (HWS cache), in which the
different cache ways may have different sizes. HWS caches are shown to
outperform conventional L1 and L2 caches. For instance, a HWS cache can achieve
up to 20% dynamic and leakage energy savings with respect to its conventional
cache counterpart, while the hit ratio is practically the same. We also present
a Dynamically Adaptive version of the HWS cache (DAHWS cache). DAHWS caches are
shown to be more adaptive than conventional architectures. DAHWS caches achieve
higher energy savings and lower miss rates than conventional caches due to
their higher flexibility. For instance, DAHWS caches reduce the active ratio by
66%, 55% and 41% for L1 instruction, L1 data and L2 caches respectively.
Frank
Mueller, North
Carolina State University
Scalable,
Fault-Tolerant Membership for MPI Tasks on HPC Systems
Reliability
is increasingly becoming a challenge for high-performance computing (HPC)
systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter
mean-time-to-failure can be addressed by adding fault tolerance to reconfigure
working nodes to ensure that communication and computation can progress.
However, existing approaches fall short in providing scalability and small
reconfiguration overhead within the fault-tolerant layer.
This paper
contributes a scalable approach to reconfigure the communication infrastructure
after node failures. We propose a decentralized (peer-to-peer) protocol that
maintains a consistent view of active nodes in the presence of faults. Our
protocol shows response times in the order of hundreds of microseconds and
single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP
over Gigabit, respectively. The protocol can be adapted to match the network
topology to further increase performance. We also verify experimental results
against a performance model, which demonstrates the scalability of the
approach. Hence, the membership service is suitable for deployment in the
communication layer of MPI runtime systems, and we have integrated an early
version into LAM/MPI.
Martin
Rinard, MIT
Probabilistic
Accuracy Bounds for Fault-Tolerant Computations that Discard Tasks
We present
a new technique for enabling computations to survive software errors and
hardware failures while providing a bound on any resulting output distortion. A
developer using the technique first partitions the computation into tasks. The
execution platform then simply discards any task that encounters a fault and
completes the computation by executing any remaining tasks. This technique can
substantially improve the robustness of the computation in the face of errors
and failures. A potential concern is that discarding tasks may change the
result that the computation produces.
Our
technique randomly samples executions of the program at varying task failure
rates to obtain a quantitative, probabilistic model that characterizes the
distortion of the output as a function of the task failure rates. By providing
probabilistic bounds on the distortion, the model allows users to confidently
accept results produced by executions with failures as long as the distortion
falls within acceptable bounds. This approach may prove to be especially useful
for enabling computations to successfully survive hardware failures in
distributed computing environments.
Our
technique also produces a timing model that characterizes the execution time as
a function of the task failure rates. The combination of the distortion and
timing models quantifies an accuracy/execution time trade-off. It therefore
enables the development of techniques that purposefully fail tasks to reduce
the execution time while keeping the distortion within acceptable bounds.
Juan
Carlos Moure, University Aut—noma of Barcelona
Wide and
Efficient Prediction using the Local Trace Predictor
High
prediction bandwidth enables performance improvements and power reduction techniques.
This paper explores a mechanism to increase prediction width (instructions per
prediction) by predicting instruction traces. A thorough analysis shows that
predicting traces including multiple branches is not significantly less
accurate than predicting single branches. A novel Local Trace Predictor
organization is proposed, which increases prediction width without reducing the
ratio of prediction accuracy versus memory resources with respect to a Basic
Block Predictor. Compared to the previously proposed Next-Trace Predictor, the
Local Trace Predictor reduces memory requirements by codifying trace
predictions, and by limiting the number of traces starting at the same
instruction to 2 or 4. The limit lessens prediction width only slightly, and
does not affect prediction accuracy. The overall result is that the Local Trace
Predictor outperforms the Next-Trace Predictor for sizes higher than 12 Kbytes.
Hu Chen,
Intel China
Research Center Ltd.
MPIPP:
An Automatic Profile-guided Parallel Processes Placement Toolset in SMP
Clusters and Multiclusters
SMPs
clusters and multiclusters are widely used to execute message-passing parallel
applications. The ways to map parallel processes to processors (or cores) could
affect the application performance significantly due to the non-uniform
communicate cost in such systems. It is desired to have a tool to map parallel
processes to processors (or cores) automatically. Although there have been
various efforts to address this issue, the existing solutions either require
intensive user intervention, or do not able to handle the situation of
multiclusters well. In this paper, we propose a profile-guided approach to find
the optimized mapping automatically to minimize the cost of point-to-point
communications for arbitrary message passing applications. The implemented
toolset is called MPIPP( MPI Process Placement toolset), which includes several
components: 1) A tool to get communication profile of MPI applications; 2) A
tool to get the network topology of target clusters and 3) An algorithm to find
optimized mapping, which is especially more effective than existing graph
partition algorithms for multiclusters. We evaluated the performance of our
tool with the NPB benchmarks and three other applications in several clusters.
Experimental results show that the optimized process placement generated by our
tools can achieve significant speedup.
Montse
Farreras, Universitat Politecnica de Catalunya
Scaling
MPI to short-memory MPPs such as BG/L
Scalability
to large number of processes is one of the weaknesses of current MPI
implementations. Standard implementations are able to scale to hundreds of
nodes, but not beyond. The main problem in these implementations is that they
assume some resources (for both data and control-data) will always be available
to receive/process unexpected messages. As we will show, this is not always
true, especially in short-memory machines like the BG/L that has 64K nodes but
each node only has 512Mbytes of memory.
The
objective of this paper is to present one algorithm that improves the
robustness of MPI implementations for short-memory MPPs, taking care of data
and control-data reception, the system will scale up to any number of nodes .
The proposed solution achieves this goal without any observable overhead when
there are no memory problems. Furthermore, in the worst case, when memory
resources are extremely scarce, the overhead will never double the execution
time (and we should never forget that in this extreme situation, traditional
MPI implementations would fail to execute).
Apan
Qasem, Rice
University
Profitable
Loop Fusion and Tiling Using Model-driven Empirical Search
Loop fusion
and tiling are both recognized as effective transformations for improving
memory performance of scientific applications. However, because of their
sensitivity to the underlying cache architecture and their interaction with
each other it is difficult to determine a good heuristic for applying these
transformations profitably across architectures. In this paper, we present a
model-guided empirical tuning strategy for profitable application of loop
fusion and tiling. Our strategy consists of a detailed cost model that
characterizes the interaction between the two transformations at different
levels of the memory hierarchy. The novelty of our approach is in exposing key
architectural parameters within the model for automatic tuning through
empirical search. Preliminary experiments with a set of applications on four
different platforms show that our strategy achieves significant performance
improvement over fully optimized code generated by state-of-the-art commercial
compilers. The time spent in searching for the best parameters is considerably
less than with other search strategies.
Nicolas
Vasilache, INRIA
Futurs - France
Violated
Dependence Amalysis
The
polyhedral model is a powerful framework to reason about high level loop
transformations. Yet the lack of scalable algorithms and tools has deterred
actors from both academia and industry to put this model to practical use.
Indeed, for fundamental complexity reasons, its applicability has long been
limited to simple kernels. Recent developments broke some generally accepted
ideas about these limitations. In particular, new algorithms made it possible
to compute the target code for full SPEC benchmarks while this code generation
step was expected not to be scalable.
Instancewise
array dependence analysis computes a finite, intentional representation of the
(statically unbounded) set of all dynamic dependences. This problem has always
been considered non-scalable and/or an overkill with respect to less expressive
and faster dependence tests. On the contrary, this article presents
experimental evidence of its applicability to full SPEC CPU2000 benchmarks. To
make this possible, we revisit the characterization of data dependences,
considering relations between time dimensions of the transformed space. Beyond
algorithmic benefits, this naturally leads to a novel way of reasoning about
violated dependences across arbitrary transformation sequences. Reasoning about
violated dependences relieves the compiler designer from the cumbersome task of
implementing specific legality checks for each single transformation. It also
allows, in the case of invalid transformations, to precisely determine the
violated dependences that need to be corrected. Identifying these violations
can in turn enable automatic correction schemes to fix an illegal
transformation sequence with minimal changes.
Arun
Kejariwal, University
of California at Irvine
On the
Dissection of Performance Potential of Types of Speculative Thread-Level
Parallelism
Recent
research in thread-level speculation (TLS) has proposed several mechanisms for
optimistic execution of difficult-to-analyze serial codes in parallel. Though
it has been shown that TLS does help achieve higher levels of parallelism,
i.e., beyond what is achievable with techniques proposed for instruction-level
parallelism (ILP), the ÔtrueÕ performance potential of TLS over non-speculative
thread-level parallelism (TLP). In this paper, we evaluate the same. Further,
we dissect the performance potential of TLS into what is achievable via each
types --- control speculation, data dependence speculation and data value
speculation --- of speculation. Assuming an ideal support for speculative
execution, i.e., misspeculation does not incur any overhead, our study shows
that, at the loop-level, TLS has a modest overall performance potential of
7.34% and 11.16% for SPEC CFP2006 and SPEC CINT2006 respectively.
Arun
Kejariwal, University
of California at Irvine
Lightweight
Lock-Free Synchronization Methods for Multithreading
Emergence
of chip multiprocessors has created a need for exploitation of beyond
{DOALL-type thread-level parallelism (TLP). This calls for development of efficient
thread synchronization techniques to exploit TLP in general parallel programs
with dependences.
For this,
several thread synchronization techniques have been proposed in the past.
However, these limit the exploitation of fine-grain TLP due to large run-time
overhead. Furthermore, the existing approaches can potentially result in (i)
deadlocks between the different threads and (ii) non-deterministic run-time
execution behavior as these techniques are oblivious of the underlying memory
model. In this paper, we propose lightweight lock-free thread synchronization
methods to exploit TLP in general parallel programs with dependences. Each
synchronization method intrinsically guarantees the following in a
multithreaded program:
(a)
sequential consistency,
(b)
atomicity of writes to the shared synchronization construct and
(c) absence
of deadlocks.
This
reduces the programming effort considerably, thereby easing the development of
software for multithreaded systems.
For each
method we formally prove that there cannot occur a deadlock between the
different threads. This obviates the cumbersome and time-consuming process of
detecting and eliminating deadlocks from the programmer. Experiments show that
our synchronization methods incur a minimal overhead of 7.16% on an average.
Further, we achieve performance speedups up to 3.39x on kernels extracted from
the industry standard SPEC OMPM 2001 benchmarks, on a dedicated Intel Xeon 2.78
GHz 4-way multiprocessor.
Jeremy
Buhler, Washington
University
Accelerator
Design for Protein Sequence HMM Search
Profile
hidden Markov models (HMMs) are a powerful approach to describing biologically
significant functional units, or motifs, in protein sequences. Databases of
such models are regularly compared to large collections of proteins to
recognize motifs in them. Exponentially increasing rates of genome sequencing
have caused both protein and model databases to explode in size, placing an
ever-increasing computational burden on users of these systems.
Here, we
describe an accelerated search system that exploits parallelism in a number of
ways. First, the application is functionally decomposed into a pipeline, with
distinct compute resources executing each pipeline stage. Second, the first
pipeline stage is deployed on an FPGA-based systolic array, which yields
significant fine-grained parallelism. Third, for some instantiations of the
design, parallel copies of the first pipeline stage are used, further
increasing the level of coarse-grained parallelism.
A naive parallelization
of the first stage computation has serious repercussions for the sensitivity of
the search. We present two remedies to this dilemma and quantify when each is
most effective. Analytic performance models are used to assess the speedup that
can be attained relative to a single-processor software solution. Performance
improvements of 1 to 2 orders of magnitude are predicted.
Adam
Oliner, Stanford
University
Cooperative
Checkpointing: A Robust Approach to Large-Scale Systems Reliability
Cooperative
checkpointing increases the performance and robustness of a system by allowing
checkpoints requested by applications to be dynamically skipped at runtime. A
robust system must be more than merely resilient to failures; it must be
adaptable and flexible in the face of new and evolving challenges. A
simulation-based experimental analysis using both probabilistic and harvested
failure distributions reveals that cooperative checkpointing enables an
application to make progress under a wide variety of failure distributions that
periodic checkpointing lacks the flexibility to handle. Cooperative
checkpointing can be easily implemented on top of existing
application-initiated checkpointing mechanisms and may be used to enhance other
reliability techniques like QoS guarantees and fault-aware job scheduling. The
simulations also support a number of theoretical predictions related to
cooperative checkpointing, including the non-competitiveness of periodic
checkpointing. As high-performance computing systems continue to grow in size
and complexity, the robustness conferred by cooperative checkpointing will be
crucial for reliably running long jobs on inherently unreliable hardware.
Yogish
Sabharwal, IBM
India Research Lab
Scalable
Algorithms for Global Snapshots in Distributed Systems
Existing
algorithms for global snapshots in distributed systems are not scalable when
the underlying topology is complete. In a network with $N$ processors, these
algorithms require O(N) space and O(N) messages per processor. As a result, these
algorithms are not efficient in large systems when the logical topology of the
communication layer such as MPI is complete.
In this
paper, we propose three algorithms for global snapshot: a grid-based, a
tree-based and a centralized algorithm. The grid-based algorithm uses O(N)
space but only O(\sqrt N) messages per processor. The tree- based algorithm
requires only O(1) space and O(\log N \log w) messages per processor where $w$
is the average number of messages in transit per processor. The centralized
algorithm requires only O(1) space and O(\log w) messages per processor. We
also show a matching lower bound for this problem.
Our
algorithms have applications in checkpointing, detecting stable predicates and
implementing synchronizers. We have implemented our algorithms on top of the
MPI library on the BlueGene/L supercomputer. Our experiments confirm that the
proposed algorithms significantly reduce the message and space complexity of a
global snapshot.
Dan
Wallin, Uppsala
university, Sweden
Multigrid
and Gauss-Seidel Smoothers Revisited: Parallelization on Chip Multiprocessors
Efficient
solutions require a match between the algorithm and the underlying
architecture. The new chip-multiprocessors, CMPs (a.k.a. multicore), feature
low intra-chip communication cost and smaller per-thread caches compared to
earlier systems. From an algorithmic point of view this means that data
locality issues become more important than communication overheads. This may
require re-evaluation of many existing algorithms.
We have
investigated parallel implementations of multigrid methods using a new
temporally blocked, naturally ordered, smoother implementation. Compared with
the standard multigrid solution based on the two-color red-black algorithm, we
improve the data locality often as much as ten times while our use of a
fine-grained locking scheme keeps the parallel efficiency high.
While our algorithm initially was
inspired by CMPs, it was surprising to see our OpenMP multigrid implementation
run up to 40 percent faster than the standard red-black algorithm on an 8-way
SMP system. Studying the smoother part of the algorithm in isolation often
shows it performing two iterations at the same time as a single iteration with
an ordinary red-black smoother. Running our smoother on a 32-thread UltraSPARC
T1 (Niagara) CMP demonstrates the communication cost of our algorithm to be low
for such architectures.
Matteo
Monchiero, Politecnico di Milano/UPC
Design
Space Exploration for Multicore Architectures: A Power/Performance/Thermal View
Multicore
architectures are ruling the recent microprocessor design trend. This is due to
different reasons: better performance, thread-level parallelism bounds in
modern applications, ILP diminishing returns, better thermal/power scaling (many
small cores dissipate less than a large and complex one); and, ease and reuse
of design.
This paper
presents a thorough evaluation of multicore architectures. The architecture we
target is composed of a configurable number of cores, a memory hierarchy
consisting of private L1 and L2, and shared bus interconnect. We consider
parallel shared memory applications. We explore the design space related to the
number of cores, L2 cache size and processor complexity, showing the behavior
of the different configurations/applications with respect to performance,
energy consumption and temperature. Design tradeoffs are analyzed, stressing
the interdependency of the metrics and design factors. In particular, we
evaluate several chip floor plans. Their power/thermal characteristics are
analyzed and they show the importance of considering thermal effects at the
architectural level to achieve the best design choice.
Paul
Stodghill, Cornell
University
A
Distributed System Based on Web Services for Computational Science Simulations
In this
paper, we describe the ASP system, a test bed based on Web Services for coupled
multi-physics simulations. The system is organized as a collection of
geographically-distributed software components in which each component provides
a Web Service, and uses standard SOAP-based Web Service protocols to interact
with other components. There are a number of advantages to organizing a system
in this way, which we discuss. We have analyzed the performance of our system
for several applications and a number of problem sizes and have found that the
overhead for using SOAP-based Web Services is small and tends to decrease as
the problem size increases. Our results suggest that potential performance
bottlenecks identified in the literature may not be major issues in practice,
and that a standards-compliant implementation like ours can delivery excellent
scalable performance even on coupled problems, provided Web Services are used
judiciously.
Manolis
Marazakis, FORTH-ICS
Efficient
Remote Block-level I/O over a RDMA-capable NIC
We present
a performance evaluation of remote block-level I/O over an RDMA-capable network
interface card (NIC), currently under development at FORTH-ICS for use in SAN
environments. We focus on offering application programs transparent and
cost-effective access to a data storage utility. We find that the NIC's latency
and throughput characteristics make it attractive as the underlying
interconnect for a storage system.
This paper
outlines the motivation for this work, and then proceeds to describe the
architecture and state-of-development of our prototype, along with performance
measurements. The measurements presented in this paper show that for increasing
I/O request sizes the throughput approaches that of directly-attached storage;
however important areas of overhead remain to be addressed.
James
Balfour, Stanford
University
Design
Tradeoffs for Tiled CMP On-Chip Networks
We present
detailed area and energy models for on-chip interconnection networks and
describe tradeoffs in the design of networks for tiled CMPs. We investigate how
aspects of the network architecture such as topology, channel width, and buffer
size affect the network's performance and contribute to the interconnect
overhead. We simulate the performance of a variety of on-chip networks designed
for a tiled chip multiprocessor implemented in an advanced VLSI process and
report area and energy efficiencies estimated using our models. Our results
demonstrate that increased channel widths significantly improve area and energy
efficiency in on-chip networks. We describe how introducing a second network to
the system can increase performance while improving area and energy efficiency,
and evaluate the tradeoffs of strategies for distributing traffic over the
subnetworks. Drawing on insights from our analysis, we present a concentrated
mesh topology with replicated subnetworks and express channels which provides a
28% improvement in area-delay and a 50% improvement in energy-delay over other
networks evaluated in this study.
Mark
Hampton, MIT
CSAIL
Implementing
Virtual Memory in a Vector Processor with Software Restart Markers
Vector
processing provides many benefits in the domain of high-performance,
energy-efficient computing. However, implementing a precise exception model can
be costly in a vector processor due to the need to commit instructions in
program order. As a result, vector machines typically do not support virtual
memory, as this usually relies on precise exception handling. Lack of virtual
memory support is one of the key factors that has hindered vector processing
from being more widely used in general-purpose computing.
We remove
the in-order commit requirements of precise exceptions by using software
restart markers, which divide the program into idempotent regions of code. When
executing instructions from a single region, the processor can commit results
to architectural state in any order. If an exception occurs, the processor
restarts execution at the beginning of the region. Since the values within a
region do not need to be buffered, we are able to support virtual memory in a
vector processor without incurring significant hardware overhead. Our scheme
also removes the requirement of preserving vector register file contents in the
event of a context switch. We show that using our approach causes an average
performance reduction of less than 4% across a variety of benchmarks.
Joshua
Yi, Freescale
Semiconductor
The
Exigency of Benchmark and Compiler Drift: Designing TomorrowÕs Processors with
YesterdayÕs Tools
Due to the
amount of time required to design a new processor, one set of benchmark
programs may be used during the design phase while another may be the standard
when the design is finally delivered. Using one benchmark suite to design a
processor while using a different, presumably more current, suite to evaluate
its ultimate performance may lead to sub-optimal design decisions if there are
large differences between the characteristics of the two suites and their
respective compilers. We call this change across time ÒdriftÓ. To evaluate the
impact of using yesterdayÕs benchmark and compiler technology to design
tomorrowÕs processors, we compare common benchmarks from the SPEC 95 and SPEC
2000 benchmark suites. Our results yield three key conclusions. First, we show
that the amount of drift, for common programs in successive SPEC benchmark
suites, is significant. In SPEC 2000, the main memory access time is a far more
significant performance bottleneck than in SPEC 95, while less significant SPEC
2000 performance bottlenecks include the L2 cache latency, the L1 I-cache size,
and the number of reorder buffer entries. Second, using two different
statistical techniques, we show that compiler drift is not as significant as
benchmark drift. Third, we show that benchmark and compiler drift can have a
significant impact on the final design decisions. Specifically, we use a
one-parameter-at-a-time optimization algorithm to design two different
year-2000 processors, one optimized for SPEC 95 and the other optimized for
SPEC 2000, using the energy-delay product (EDP) as the optimization criterion.
The results show that using SPEC 95 to design a year-2000 processor results in
an 18.5% larger EDP and a 20.8% higher CPI than using the SPEC 2000 benchmarks
to design the corresponding processor. Finally, we make a few recommendations
to help computer architects minimize the effects of benchmark and compiler
drift.
Steve
Carr, Michigan
Technological University
Feedback-directed
Memory Disambiguation Through Store Distance Analysis
Feedback-directed
optimization has developed into an important tool in designing and building
optimizing compilers. Based upon profiling, memory distance analysis has shown
much promise in predicting data locality and memory dependences, and has seen
use in locality based optimizations and memory disambiguation. In this paper,
we introduce a new form of memory distance, called store distance, which is
defined as the number of store references between a load and the previous store
accessing the same memory location. We apply store distance analysis to the
problem of memory disambiguation in out-of-order issue processors.
By
generating a representative store distance for each load, we can apply a
compiler/micro-architecture cooperative scheme to direct run-time load
speculation. Using store distance, the processor can, in most cases, accurately
determine on which specific store instruction a load depends according to its
store distance annotation. Our experiments show that the proposed method performs
much better than the previous distance-based memory disambiguation scheme, and
yields performance very close to perfect memory disambiguation. The store
distance based scheme also outperforms the store set technique with a
relatively small predicator space and achieves performance comparable to that
of a 16K-entry store set implementation for both floating point and integer
programs.
Daniel
Vanderster, University
of Victoria
Sensitivity
Analysis of Knapsack-based Task Scheduling on the Grid
The knapsack-based
task scheduler has been previously shown to provide Quality of Service to
malleable tasks on computational grids. In this study, we measure the
sensitivity of the knapsack-derived schedules to variations in the prescribed
allocation policies and their corresponding utility functions. In particular,
we explore the effects of varying the strengths of an external user-specified
monetary metric, an intrinsic estimated response time mediated by nearness to
completion time metric, and of varying the shape of a sigmoidal normalizing
utility function. The results of our analyses show that the knapsack strategy
results in schedules that are consistent with the defined allocation policies.
We conclude by indicating the recommended metric weights, that is, those that
produce desirable schedule characteristics.
Sudharshan
Vazhkudai, Oak
Ridge National Lab
Coupling
Prefix Caching and Collective Downloads for Remote Data Access
Scientific
computing user communities construct complex distributed workflows. Data from
these operations is typically archived at mass storage systems or data centers
close to supercomputers or instruments. End-users of these datasets, however,
usually carry out parts of the workflow at their local computers. In such
cases, client-side caching can offer significant gains by hiding wide-area
latency and improving performance.
Scientific
data caches, however, have been traditionally caching entire datasets, which is
not always necessary. In this paper, we propose a novel combination of seed
caching and transparent collective downloads in the context of FreeLoader
collaborative desktop cache. Seed caching allows the bootstrapping of dataset
downloads by caching only a prefix of the dataset, while collective downloads
(like collective I/O in parallel I/O libraries) facilitate efficient patching
of the missing suffix from an external data source. To estimate the optimal
seed size, we further present an analytical model that takes into account both
the initial overhead and the downloading bandwidth. Experimental results (using
multiple scientific data repositories, wide-area data transfer tools, as well
as a real-world scientific dataset access trace) demonstrate that our model can
select an appropriate prefix size and improves the overall cache performance
without hurting the local access rate of cached datasets.
Andreas
Moshovos, Univ.
of Toronto
BranchTap:
Improving Performance with Very Few Checkpoints Through Adaptive Speculation
Control
Checkpoint
prediction and intelligent management have been recently proposed for reducing
the number of coarse-grain checkpoints needed to support high performance
through speculative execution. In this work, we take a closer look at various
checkpoint prediction and management alternatives comparing their performance
and requirements as the scheduler window increases. We also study a few
additional design choices. The key contribution of this work is BranchTap, a
novel checkpoint allocation strategy that temporarily throttles speculation to
reduce recovery cost while allowing speculation to proceed when it is likely to
boost performance. BranchTap dynamically adapts to application behavior. We
demonstrate that for a 512-entry window processor with a FIFO of just four
checkpoints our adaptive speculation control mechanism leads to an average
performance degradation of just 1.03% compared to a processor that has an
infinite number of checkpoints. This represents an improvement of 23.7% over
the underlying prediction-based-only policy which results in an average
performance deterioration of 1.35%. For the same configuration, BranchTap
improves worst case deterioration drops from 4.82% to 3.50%.
Andrew
Lumsdaine, Indiana
University
Accelerating
Sparse Matrix Computations via Data Compression
Sparse
matrix computations are important for many scientific computations, with
matrix-vector multiplication being a fundamental operation for modern iterative
algorithms. For large sparse matrices, the primary performance limitation on
matrix-vector product is memory bandwidth, rather than algorithm performance.
In fact, the wide disparity between memory bandwidth and CPU performance
suggests that one could trade cycles for bandwidth and still improve the time
to compute a matrix-vector product. Accordingly, this paper presents an
approach to improving the performance of matrix-vector product based on
lossless compression of the index information commonly stored in sparse matrix
representations. Two compressed formats, and their multiplication algorithms,
are given, along with experimental results demonstrating their effectiveness.
For an assortment of large sparse matrices, compression ratios and
corresponding speedups of up to 30% are achieved. The efficiency of the
compression algorithm allows its cost to be easily amortized across repeated
matrix-vector products.
Wei
Huang, The Ohio
State University
A Case
for High Performance Computing with Virtual Machines
Virtual
machine (VM) technologies are experiencing a resurgence in both industry and
research communities. VMs offer many desirable features such as security, ease
of management, OS customization, performance isolation, check-pointing, and
migration, which can be very beneficial to the performance and the
manageability of high performance computing (HPC) applications. However, very
few HPC applications are currently running in a virtualized environment due to
the performance overhead of virtualization. Further, using VMs for HPC also
introduces additional challenges such as management and distribution of OS
images.
In this
paper we present a case for HPC with VMs by introducing a framework which
addresses the performance and management overhead associated with VM-based
computing. Two key ideas in our design are: Virtual Machine Monitor (VMM)
bypass I/O and scalable VM image management. VMM-bypass I/O achieves high
communication performance for VMs by exploiting the OS-bypass feature of modern
high speed interconnects such as InfiniBand. Scalable VM image management
significantly reduces the overhead of distributing and managing VMs in large
scale clusters. Our current implementation is based on the Xen VM environment
and InfiniBand, however, many of our ideas are readily applicable to other VM
environments and high speed interconnects.
We carry
out detailed analysis on the performance and management overhead of our
VM-based HPC framework. Our evaluation shows that HPC applications can achieve
almost the same performance as those running in a native, non-virtualized
environment. Therefore, our approach holds promise to bring the benefits of VMs
to HPC applications with very little degradation in performance.
Matthew
Curtis-Maury, College
of William and Mary
Online
Power-Performance Adaptation of Multithreaded Programs using Hardware
Event-Based Prediction
With
high-end systems featuring multicore/multithreaded processors and high
component density, power-aware high-performance multithreading libraries become
a critical element of the system software stack. Online power and performance
adaptation of multithreaded code from within user-level runtime libraries is a
relatively new and unexplored area of research. We present a user-level library
framework for nearly optimal online adaptation of multithreaded codes for
low-power, high-performance execution. Our framework operates by regulating
concurrency and changing the processors/threads configuration as the program
executes. It is innovative in that it uses fast, runtime performance prediction
derived from hardware event-driven profiling, to select thread granularities
that achieve nearly optimal energy-efficiency points. The use of predictors
substantially reduces the runtime cost of granularity control and program
adaptation. Furthermore, our prediction model significantly improves prediction
accuracy compared to other hardware profile-driven performance prediction
models proposed earlier. Our overall framework achieves performance and
$ED^{2}$ (energy-delay-squared) levels which are: i) comparable to or better
than those of oracle-derived offline predictors; ii) significantly better than
those of online predictors using exhaustive or localized linear search. The
complete prediction and adaptation framework is implemented on a real multi-SMT
system with Intel Hyperthreaded processors and embeds adaptation capabilities
in OpenMP programs.
Lieven
Eeckhout, Ghent
University
Accurate
Memory Data Flow Modeling in Statistical Simulation
Microprocessor
design is a very complex and time-consuming activity. One of the primary
reasons is the huge design space that needs to be explored in order to identify
the optimal design given a number of constraints. Simulations are usually used
to explore these huge design spaces, however, they are fairly slow. Several
hundreds of billions of instructions need to be simulated per benchmark; and
this needs to be done for every design point of interest.
Recently,
statistical simulation was proposed to efficiently cull a huge design space.
The basic idea of statistical simulation is to collect a number of important
program characteristics and to generate a synthetic trace from it. Simulating
this synthetic trace is extremely fast as it contains a million instructions
only.
This paper
improves the statistical simulation methodology by proposing accurate memory
data flow models. We model (i) load forwarding, (ii) delayed cache hits, and
(iii) correlation between cache misses based on path info. Our experiments
using the SPEC CPU2000 benchmarks show a substantial improvement upon current
state-of-the-art statistical simulation methods. For example, for our baseline
configuration we reduce the average IPC prediction error from 10.7% to 2.3%. In
addition, we show that performance trends are predicted very accurately, making
statistical simulation enhanced with accurate data flow models a useful tool
for efficient and accurate microprocessor design space explorations.