Journal articles: 'Shared-memory parallel programming'

1

Beck, B. "Shared-memory parallel programming in C++." IEEE Software 7, no. 4 (July 1990): 38–48. http://dx.doi.org/10.1109/52.56449.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Bonetta, Daniele, Luca Salucci, Stefan Marr, and Walter Binder. "GEMs: shared-memory parallel programming for Node.js." ACM SIGPLAN Notices 51, no. 10 (December 5, 2016): 531–47. http://dx.doi.org/10.1145/3022671.2984039.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Deshpande, Ashish, and Martin Schultz. "Efficient Parallel Programming with Linda." Scientific Programming 1, no. 2 (1992): 177–83. http://dx.doi.org/10.1155/1992/829092.

Full text

Abstract:

Linda is a coordination language inverted by David Gelernter at Yale University, which when combined with a computation language (like C) yields a high-level parallel programming language for MIMD machines. Linda is based on a virtual shared associative memory containing objects called tuples. Skeptics have long claimed that Linda programs could not be efficient on distributed memory architectures. In this paper, we address this claim by discussing C-Linda's performance in solving a particular scientific computing problem, the shallow water equations, and make comparisons with alternatives available on various shared and distributed memory parallel machines.

APA, Harvard, Vancouver, ISO, and other styles

4

Quammen, Cory. "Introduction to programming shared-memory and distributed-memory parallel computers." XRDS: Crossroads, The ACM Magazine for Students 8, no. 3 (April 2002): 16–22. http://dx.doi.org/10.1145/567162.567167.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Quammen, Cory. "Introduction to programming shared-memory and distributed-memory parallel computers." XRDS: Crossroads, The ACM Magazine for Students 12, no. 1 (October 2005): 2. http://dx.doi.org/10.1145/1144382.1144384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Keane, J. A., A. J. Grant, and M. Q. Xu. "Comparing distributed memory and virtual shared memory parallel programming models." Future Generation Computer Systems 11, no. 2 (March 1995): 233–43. http://dx.doi.org/10.1016/0167-739x(94)00065-m.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Redondo, J. L., I. García, and P. M. Ortigosa. "Parallel evolutionary algorithms based on shared memory programming approaches." Journal of Supercomputing 58, no. 2 (December 18, 2009): 270–79. http://dx.doi.org/10.1007/s11227-009-0374-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Di Martino, Beniamino, Sergio Briguglio, Gregorio Vlad, and Giuliana Fogaccia. "Workload Decomposition Strategies for Shared Memory Parallel Systems with OpenMP." Scientific Programming 9, no. 2-3 (2001): 109–22. http://dx.doi.org/10.1155/2001/891073.

Full text

Abstract:

A crucial issue in parallel programming (both for distributed and shared memory architectures) is work decomposition. Work decomposition task can be accomplished without large programming effort with use of high-level parallel programming languages, such as OpenMP. Anyway particular care must still be payed on achieving performance goals. In this paper we introduce and compare two decomposition strategies, in the framework of shared memory systems, as applied to a case study particle in cell application. A number of different implementations of them, based on the OpenMP language, are discussed with regard to time efficiency, memory occupancy, and program restructuring effort.

APA, Harvard, Vancouver, ISO, and other styles

9

Alaghband, Gita, and Harry F. Jordan. "Overview of the Force Scientific Parallel Language." Scientific Programming 3, no. 1 (1994): 33–47. http://dx.doi.org/10.1155/1994/632497.

Full text

Abstract:

The Force parallel programming language designed for large-scale shared-memory multiprocessors is presented. The language provides a number of parallel constructs as extensions to the ordinary Fortran language and is implemented as a two-level macro preprocessor to support portability across shared memory multiprocessors. The global parallelism model on which the Force is based provides a powerful parallel language. The parallel constructs, generic synchronization, and freedom from process management supported by the Force has resulted in structured parallel programs that are ported to the many multiprocessors on which the Force is implemented. Two new parallel constructs for looping and functional decomposition are discussed. Several programming examples to illustrate some parallel programming approaches using the Force are also presented.

APA, Harvard, Vancouver, ISO, and other styles

10

Warren, Karen H. "PDDP, A Data Parallel Programming Model." Scientific Programming 5, no. 4 (1996): 319–27. http://dx.doi.org/10.1155/1996/857815.

Full text

Abstract:

PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements high-performance Fortran-compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

APA, Harvard, Vancouver, ISO, and other styles

11

Lubachevsky, Boris D. "Synchronization barrier and related tools for shared memory parallel programming." International Journal of Parallel Programming 19, no. 3 (June 1990): 225–50. http://dx.doi.org/10.1007/bf01407956.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

ALDINUCCI, MARCO. "${\mathsf{eskimo}}$: EXPERIMENTING WITH SKELETONS IN THE SHARED ADDRESS MODEL." Parallel Processing Letters 13, no. 03 (September 2003): 449–60. http://dx.doi.org/10.1142/s0129626403001410.

Full text

Abstract:

We discuss the lack of expressivity in some skeleton-based parallel programming frameworks. The problem is further exacerbated when approaching irregular problems and dealing with dynamic data structures. Shared memory programming has been argued to have substantial ease of programming advantages for this class of problems. We present the [Formula: see text] library which represents an attempt to merge the two programming models by introducing skeletons in a shared memory framework.

APA, Harvard, Vancouver, ISO, and other styles

13

Pryadko, S. A., A. Yu Troshin, V. D. Kozlov, and A. E. Ivanov. "Parallel programming technologies on computer complexes." Radio industry (Russia) 30, no. 3 (September 8, 2020): 28–33. http://dx.doi.org/10.21778/2413-9599-2020-30-3-28-33.

Full text

Abstract:

The article describes various options for speeding up calculations on computer systems. These features are closely related to the architecture of these complexes. The objective of this paper is to provide necessary information when selecting the capability for the speeding process of solving the computation problem. The main features implemented using the following models are described: programming in systems with shared memory, programming in systems with distributed memory, and programming on graphics accelerators (video cards). The basic concept, principles, advantages, and disadvantages of each of the considered programming models are described. All standards for writing programs described in the article can be used both on Linux and Windows operating systems. The required libraries are available and compatible with the C/C++ programming language. The article concludes with recommendations on the use of a particular technology, depending on the type of task to be solved.

APA, Harvard, Vancouver, ISO, and other styles

14

Watanabe, Y., S. Nakamura, and K. Shimizu. "Parallel programming environment with distirbuted shared memory and application to bioinfomatics." Seibutsu Butsuri 41, supplement (2001): S38. http://dx.doi.org/10.2142/biophys.41.s38_3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Ching-Cheng, Lee, and H. A. Fatmi. "Run-time support For parallel functional programming on shared-memory multiprocessors." Journal of Systems and Software 16, no. 1 (September 1991): 69–74. http://dx.doi.org/10.1016/0164-1212(91)90033-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Brooks III, Eugene D., Brent C. Gorda, and Karen H. Warren. "The Parallel C Preprocessor." Scientific Programming 1, no. 1 (1992): 79–89. http://dx.doi.org/10.1155/1992/708085.

Full text

Abstract:

We describe a parallel extension of the C programming language designed for multiprocessors that provide a facility for sharing memory between processors. The programming model was initially developed on conventional shared memory machines with small processor counts such as the Sequent Balance and Alliant FX/8, but has more recently been used on a scalable massively parallel machine, the BBN TC2000. The programming model issplit-joinrather thanfork-join. Concurrency is exploited to use a fixed number of processors more efficiently rather than to exploit more processors as in the fork-join model. Team splitting, a mechanism to split the team of processors executing a code into subteams to handle parallel subtasks, is used to provide an efficient mechanism to exploit nested concurrency. We have found the split-join programming model to have an inherent implementation advantage, compared to the fork-join model, when the number of processors in a machine becomes large.

APA, Harvard, Vancouver, ISO, and other styles

17

Sanchez, Luis Miguel, Javier Fernandez, Rafael Sotomayor, Soledad Escolar, and J. Daniel Garcia. "A Comparative Study and Evaluation of Parallel Programming Models for Shared-Memory Parallel Architectures." New Generation Computing 31, no. 3 (July 2013): 139–61. http://dx.doi.org/10.1007/s00354-013-0301-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Löff, Júnior, Dalvan Griebler, Gabriele Mencagli, Gabriell Araujo, Massimo Torquati, Marco Danelutto, and Luiz Gustavo Fernandes. "The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures." Future Generation Computer Systems 125 (December 2021): 743–57. http://dx.doi.org/10.1016/j.future.2021.07.021.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Marowka, Ami. "Execution Model of Three Parallel Languages: OpenMP, UPC and CAF." Scientific Programming 13, no. 2 (2005): 127–35. http://dx.doi.org/10.1155/2005/914081.

Full text

Abstract:

The aim of this paper is to present a qualitative evaluation of three state-of-the-art parallel languages: OpenMP, Unified Parallel C (UPC) and Co-Array Fortran (CAF). OpenMP and UPC are explicit parallel programming languages based on the ANSI standard. CAF is an implicit programming language. On the one hand, OpenMP designs for shared-memory architectures and extends the base-language by using compiler directives that annotate the original source-code. On the other hand, UPC and CAF designs for distribute-shared memory architectures and extends the base-language by new parallel constructs. We deconstruct each language into its basic components, show examples, make a detailed analysis, compare them, and finally draw some conclusions.

APA, Harvard, Vancouver, ISO, and other styles

20

Wang, Ping, and Xiaoping Wu. "OpenMP Programming for a Global Inverse Model." Scientific Programming 10, no. 3 (2002): 253–61. http://dx.doi.org/10.1155/2002/620712.

Full text

Abstract:

The objective of our investigation is to establish robust inverse algorithms to convert GRACE gravity and ICESat altimetry mission data into global current and past surface mass variations. To assess separation of global sources of change and to evaluate spatio-temporal resolution and accuracy statistically from full posterior covariance matrices, a high performance version of a global simultaneous grid inverse algorithm is essential. One means to accomplish this is to implement a general, well-optimized, parallel global model on massively parallel supercomputers. In our present work, an efficient parallel version of a global inverse program has been implemented on the Origin 2000 using the OpenMP programming model. In this paper, porting a sequential global code to a shared-memory computing system is discussed; several efficient strategies to optimize the code are reported; well-optimized scientific libraries are used; detailed parallel implementation of the global model is reported; performance data of the code are analyzed. Scaling performance on a shared-memory system is also discussed. The parallel version software gives good speedup and dramatically reduces total data processing time.

APA, Harvard, Vancouver, ISO, and other styles

21

Sato, Mitsuhisa, Hiroshi Harada, Atsushi Hasegawa, and Yutaka Ishikawa. "Cluster-Enabled OpenMP: An OpenMP Compiler for the SCASH Software Distributed Shared Memory System." Scientific Programming 9, no. 2-3 (2001): 123–30. http://dx.doi.org/10.1155/2001/605217.

Full text

Abstract:

OpenMP is attracting wide-spread interest because of its easy-to-use parallel programming model for shared memory multiprocessors. We have implemented a "cluster-enabled" OpenMP compiler for a page-based software distributed shared memory system, SCASH, which works on a cluster of PCs. It allows OpenMP programs to run transparently in a distributed memory environment. The compiler transforms OpenMP programs into parallel programs using SCASH so that shared global variables are allocated at run time in the shared address space of SCASH. A set of directives is added to specify data mapping and loop scheduling method which schedules iterations onto threads associated with the data mapping. Our experimental results show that the data mapping may greatly impact on the performance of OpenMP programs in the software distributed shared memory system. The performance of some NAS parallel benchmark programs in OpenMP is improved by using our extended directives.

APA, Harvard, Vancouver, ISO, and other styles

22

Gerndt, Michael. "High-level programming of massively parallel computers based on shared virtual memory." Parallel Computing 24, no. 3-4 (May 1998): 383–400. http://dx.doi.org/10.1016/s0167-8191(98)00018-0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Benkner, Siegfried, and Thomas Brandes. "Efficient parallel programming on scalable shared memory systems with High Performance Fortran." Concurrency and Computation: Practice and Experience 14, no. 8-9 (2002): 789–803. http://dx.doi.org/10.1002/cpe.649.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Yzelman, A. N., R. H. Bisseling, D. Roose, and K. Meerbergen. "MulticoreBSP for C: A High-Performance Library for Shared-Memory Parallel Programming." International Journal of Parallel Programming 42, no. 4 (August 25, 2013): 619–42. http://dx.doi.org/10.1007/s10766-013-0262-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Tousimojarad, Ashkan, and Wim Vanderbauwhede. "The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition." Electronic Proceedings in Theoretical Computer Science 137 (December 8, 2013): 79–94. http://dx.doi.org/10.4204/eptcs.137.7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Silva, Luis M., JoÃo Gabriel Silva, and Simon Chapple. "Implementation and Performance of DSMPI." Scientific Programming 6, no. 2 (1997): 201–14. http://dx.doi.org/10.1155/1997/452521.

Full text

Abstract:

Distributed shared memory has been recognized as an alternative programming model to exploit the parallelism in distributed memory systems because it provides a higher level of abstraction than simple message passing. DSM combines the simple programming model of shared memory with the scalability of distributed memory machines. This article presents DSMPI, a parallel library that runs atop of MPI and provides a DSM abstraction. It provides an easy-to-use programming interface, is fully, portable, and supports heterogeneity. For the sake of flexibility, it supports different coherence protocols and models of consistency. We present some performance results taken in a network of workstations and in a Cray T3D which show that DSMPI can be competitive with MPI for some applications.

APA, Harvard, Vancouver, ISO, and other styles

27

Magomedov, Sh G., and A. S. Lebedev. "A tool for automatic parallelization of affine programs for systems with shared and distributed memory." Russian Technological Journal 7, no. 5 (October 15, 2019): 7–19. http://dx.doi.org/10.32362/2500-316x-2019-7-5-7-19.

Full text

Abstract:

Effective programming of parallel architectures has always been a challenge, and it is especially complicated with their modern diversity. The task of automatic parallelization of program code was formulated from the moment of the appearance of the first parallel computers made in Russia (for example, PS2000). To date, programming languages and technologies have been developed that simplify the work of a programmer (T-System, MC#, Erlang, Go, OpenCL), but do not make parallelization automatic. The current situation requires the development of effective programming tools for parallel computing systems. Such tools should support the development of parallel programs for systems with shared and distributed memory. The paper deals with the problem of automatic parallelization of affine programs for such systems. Methods for calculating space-time mappings that optimize the locality of the program are discussed. The implementation of developed methods is done in Haskell within the source-to-source translator performing automatic parallelization. A comparison of the performance of parallel variants of lu, atax, syr2k programs obtained using the developed tool and the modern Pluto tool is made. The experiments were performed on two x86_64 machines connected by the InfiniBand network. OpenMP and MPI were used as parallelization technologies. The performance of the resulting parallel program indicates the practical applicability of the developed tool for affine programs parallelization.

APA, Harvard, Vancouver, ISO, and other styles

28

Zhang, Xiaodong, and Lin Sun. "Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns." Scientific Programming 7, no. 1 (1999): 1–19. http://dx.doi.org/10.1155/1999/468372.

Full text

Abstract:

Shared‐memory and data‐parallel programming models are two important paradigms for scientific applications. Both models provide high‐level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM) and a linear system solver using the shared‐memory model on the KSR‐1 and the data‐parallel model on the CM‐5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two models; to study memory access patterns; to address scalability issues; and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation‐intensive in the KSR‐1 shared‐memory system, and memory‐demanding in the CM‐5 data‐parallel system when the systems and the problems are scaled. The EM program, a highly data‐parallel program performed extremely well, and the linear system solver, a highly control‐structured program suffered significantly in the data‐parallel model on the CM‐5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance.

APA, Harvard, Vancouver, ISO, and other styles

29

Picano, Silvio, Eugene D. Brooks III, and Joseph E. Hoag. "Assessing Programming Costs of Explicit Memory Localization on a Large Scale Shared Memory Multiprocessor." Scientific Programming 1, no. 1 (1992): 67–78. http://dx.doi.org/10.1155/1992/923069.

Full text

Abstract:

We present detailed experimental work involving a commercially available large scale shared memory multiple instruction stream-multiple data stream (MIMD) parallel computer having a software controlled cache coherence mechanism. To make effective use of such an architecture, the programmer is responsible for designing the program's structure to match the underlying multiprocessors capabilities. We describe the techniques used to exploit our multiprocessor (the BBN TC2000) on a network simulation program, showing the resulting performance gains and the associated programming costs. We show that an efficient implementation relies heavily on the user's ability to explicitly manage the memory system.

APA, Harvard, Vancouver, ISO, and other styles

30

Bozkus, Zeki, Larry Meadows, Steven Nakamoto, Vincent Schuster, and Mark Young. "PGHPF – An Optimizing High Performance Fortran Compiler for Distributed Memory Machines." Scientific Programming 6, no. 1 (1997): 29–40. http://dx.doi.org/10.1155/1997/705102.

Full text

Abstract:

High Performance Fortran (HPF) is the first widely supported, efficient, and portable parallel programming language for shared and distributed memory systems. HPF is realized through a set of directive-based extensions to Fortran 90. It enables application developers and Fortran end-users to write compact, portable, and efficient software that will compile and execute on workstations, shared memory servers, clusters, traditional supercomputers, or massively parallel processors. This article describes a production-quality HPF compiler for a set of parallel machines. Compilation techniques such as data and computation distribution, communication generation, run-time support, and optimization issues are elaborated as the basis for an HPF compiler implementation on distributed memory machines. The performance of this compiler on benchmark programs demonstrates that high efficiency can be achieved executing HPF code on parallel architectures.

APA, Harvard, Vancouver, ISO, and other styles

31

Ierotheou, C. S., S. P. Johnson, P. F. Leggett, M. Cross, E. W. Evans, H. Jin, M. Frumkin, and J. Yan. "The Semi-Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit." Scientific Programming 9, no. 2-3 (2001): 163–73. http://dx.doi.org/10.1155/2001/327048.

Full text

Abstract:

The shared-memory programming model can be an effective way to achieve parallelism on shared memory parallel computers. Historically however, the lack of a programming standard using directives and the limited scalability have affected its take-up. Recent advances in hardware and software technologies have resulted in improvements to both the performance of parallel programs with compiler directives and the issue of portability with the introduction of OpenMP. In this study, the Computer Aided Parallelisation Toolkit has been extended to automatically generate OpenMP-based parallel programs with nominal user assistance. We categorize the different loop types and show how efficient directives can be placed using the toolkit's in-depth interprocedural analysis. Examples are taken from the NAS parallel benchmarks and a number of real-world application codes. This demonstrates the great potential of using the toolkit to quickly parallelise serial programs as well as the good performance achievable on up to 300 processors for hybrid message passing-directive parallelisations.

APA, Harvard, Vancouver, ISO, and other styles

32

ROBERGE, VINCENT, MOHAMMED TARBOUCHI, and FRANÇOIS ALLAIRE. "PARALLEL HYBRID METAHEURISTIC ON SHARED MEMORY SYSTEM FOR REAL-TIME UAV PATH PLANNING." International Journal of Computational Intelligence and Applications 13, no. 02 (June 2014): 1450008. http://dx.doi.org/10.1142/s1469026814500084.

Full text

Abstract:

In this paper, we present a parallel hybrid metaheuristic that combines the strengths of the particle swarm optimization (PSO) and the genetic algorithm (GA) to produce an improved path-planner algorithm for fixed wing unmanned aerial vehicles (UAVs). The proposed solution uses a multi-objective cost function we developed and generates in real-time feasible and quasi-optimal trajectories in complex 3D environments. Our parallel hybrid algorithm simulates multiple GA populations and PSO swarms in parallel while allowing migration of solutions. This collaboration between the GA and the PSO leads to an algorithm that exhibits the strengths of both optimization methods and produces superior solutions. Moreover, by using the "single-program, multiple-data" parallel programming paradigm, we maximize the use of today's multicore CPU and significantly reduce the execution time of the parallel program compared to a sequential implementation. We observed a quasi-linear speedup of 10.7 times faster on a 12-core shared memory system resulting in an execution time of 5 s which allows in-flight planning. Finally, we show with statistical significance that our parallel hybrid algorithm produces superior trajectories to the parallel GA or the parallel PSO we previously developed.

APA, Harvard, Vancouver, ISO, and other styles

33

Hoefler, Torsten, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory." Computing 95, no. 12 (May 19, 2013): 1121–36. http://dx.doi.org/10.1007/s00607-013-0324-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Kang, Sol Ji, Sang Yeon Lee, and Keon Myung Lee. "Performance Comparison of OpenMP, MPI, and MapReduce in Practical Problems." Advances in Multimedia 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/575687.

Full text

Abstract:

With problem size and complexity increasing, several parallel and distributed programming models and frameworks have been developed to efficiently handle such problems. This paper briefly reviews the parallel computing models and describes three widely recognized parallel programming frameworks: OpenMP, MPI, and MapReduce. OpenMP is the de facto standard for parallel programming on shared memory systems. MPI is the de facto industry standard for distributed memory systems. MapReduce framework has become the de facto standard for large scale data-intensive applications. Qualitative pros and cons of each framework are known, but quantitative performance indexes help get a good picture of which framework to use for the applications. As benchmark problems to compare those frameworks, two problems are chosen: all-pairs-shortest-path problem and data join problem. This paper presents the parallel programs for the problems implemented on the three frameworks, respectively. It shows the experiment results on a cluster of computers. It also discusses which is the right tool for the jobs by analyzing the characteristics and performance of the paradigms.

APA, Harvard, Vancouver, ISO, and other styles

35

Aslot, Vishal, and Rudolf Eigenmann. "Quantitative Performance Analysis of the SPEC OMPM2001 Benchmarks." Scientific Programming 11, no. 2 (2003): 105–24. http://dx.doi.org/10.1155/2003/401032.

Full text

Abstract:

The state of modern computer systems has evolved to allow easy access to multiprocessor systems by supporting multiple processors on a single physical package. As the multiprocessor hardware evolves, new ways of programming it are also developed. Some inventions may merely be adopting and standardizing the older paradigms. One such evolving standard for programming shared-memory parallel computers is the OpenMP API. The Standard Performance Evaluation Corporation (SPEC) has created a suite of parallel programs called SPEC OMP to compare and evaluate modern shared-memory multiprocessor systems using the OpenMP standard. We have studied these benchmarks in detail to understand their performance on a modern architecture. In this paper, we present detailed measurements of the benchmarks. We organize, summarize, and display our measurements using a Quantitative Model. We present a detailed discussion and derivation of the model. Also, we discuss the important loops in the SPEC OMPM2001 benchmarks and the reasons for less than ideal speedup on our platform.

APA, Harvard, Vancouver, ISO, and other styles

36

Averbuch, A., E. Gabber, S. Itzikowitz, and B. Shoham. "On the Parallel Elliptic Single/Multigrid Solutions about Aligned and Nonaligned Bodies Using the Virtual Machine for Multiprocessors." Scientific Programming 3, no. 1 (1994): 13–32. http://dx.doi.org/10.1155/1994/895737.

Full text

Abstract:

Parallel elliptic single/multigrid solutions around an aligned and nonaligned body are presented and implemented on two multi-user and single-user shared memory multiprocessors (Sequent Symmetry and MOS) and on a distributed memory multiprocessor (a Transputer network). Our parallel implementation uses the Virtual Machine for Muli-Processors (VMMP), a software package that provides a coherent set of services for explicitly parallel application programs running on diverse multiple instruction multiple data (MIMD) multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. The performance of our algorithm is investigated in detail. It is seen to fit well the above architectures when the number of processors is less than the maximal number of grid points along the axes. In general, the efficiency in the nonaligned case is higher than in the aligned case. Alignment overhead is observed to be up to 200% in the shared-memory case and up to 65% in the message-passing case. We have demonstrated that when using VMMP, the portability of the algorithms is straightforward and efficient.

APA, Harvard, Vancouver, ISO, and other styles

37

Fung, Larry S. K., Mohammad O. Sindi, and Ali H. Dogru. "Multiparadigm Parallel Acceleration for Reservoir Simulation." SPE Journal 19, no. 04 (January 6, 2014): 716–25. http://dx.doi.org/10.2118/163591-pa.

Full text

Abstract:

Summary With the advent of the multicore central-processing unit (CPU), today's commodity PC clusters are effectively a collection of interconnected parallel computers, each with multiple multicore CPUs and large shared random access memory (RAM), connected together by means of high-speed networks. Each computer, referred to as a compute node, is a powerful parallel computer on its own. Each compute node can be equipped further with acceleration devices such as the general-purpose graphical processing unit (GPGPU) to further speed up computational-intensive portions of the simulator. Reservoir-simulation methods that can exploit this heterogeneous hardware system can be used to solve very-large-scale reservoir-simulation models and run significantly faster than conventional simulators. Because typical PC clusters are essentially distributed share-memory computers, this suggests that the use of the mixed-paradigm parallelism (distributed-shared memory), such as message-passing interface and open multiprocessing (MPI-OMP), should work well for computational efficiency and memory use. In this work, we compare and contrast the single-paradigm programming models, MPI or OMP, with the mixed paradigm, MPI-OMP, programming model for a class of solver method that is suited for the different modes of parallelism. The results showed that the distributed memory (MPI-only) model has superior multicompute-node scalability, whereas the shared memory (OMP-only) model has superior parallel performance on a single compute node. The mixed MPI-OMP model and OMP-only model are more memory-efficient for the multicore architecture than the MPI-only model because they require less or no halo-cell storage for the subdomains. To exploit the fine-grain shared memory parallelism available on the GPGPU architecture, algorithms should be suited to the single-instruction multiple-data (SIMD) parallelism, and any recursive operations are serialized. In addition, solver methods and data store need to be reworked to coalesce memory access and to avoid shared memory-bank conflicts. Wherever possible, the cost of data transfer through the peripheral component interconnect express (PCIe) bus between the CPU and GPGPU needs to be hidden by means of asynchronous communication. We applied multiparadigm parallelism to accelerate compositional reservoir simulation on a GPGPU-equipped PC cluster. On a dual-CPU-dual-GPGPU compute node, the parallelized solver running on the dual-GPGPU Fermi M2090Q achieved up to 19 times speedup over the serial CPU (1-core) results and up to 3.7 times speedup over the parallel dual-CPU X5675 results in a mixed MPI + OMP paradigm for a 1.728-million-cell compositional model. Parallel performance shows a strong dependency on the subdomain sizes. Parallel CPU solve has a higher performance for smaller domain partitions, whereas GPGPU solve requires large partitions for each chip for good parallel performance. This is related to improved cache efficiency on the CPU for small subdomains and the loading requirement for massive parallelism on the GPGPU. Therefore, for a given model, the multinode parallel performance decreases for the GPGPU relative to the CPU as the model is further subdivided into smaller subdomains to be solved on more compute nodes. To illustrate this, a modified SPE5 (Killough and Kossack 1987) model with various grid dimensions was run to generate comparative results. Parallel performances for three field compositional models of various sizes and dimensions are included to further elucidate and contrast CPU-GPGPU single-node and multiple-node performances. A PC cluster with the Tesla M2070Q GPGPU and the 6-core Xeon X5675 Westmere was used to produce the majority of the reported results. Another PC cluster with the Tesla M2090Q GPGPU was available for some cases, and the results are reported for the modified SPE5 (Killough and Kossack 1987) problems for comparison.

APA, Harvard, Vancouver, ISO, and other styles

38

Radhakrishnan, Hari, Damian W. I. Rouson, Karla Morris, Sameer Shende, and Stavros C. Kassinos. "Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study." Scientific Programming 2015 (2015): 1–12. http://dx.doi.org/10.1155/2015/904983.

Full text

Abstract:

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.

APA, Harvard, Vancouver, ISO, and other styles

39

Chorley, Martin J., David W. Walker, and Martyn F. Guest. "Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters." International Journal of High Performance Computing Applications 23, no. 3 (June 2, 2009): 196–211. http://dx.doi.org/10.1177/1094342009106188.

Full text

Abstract:

Hybrid programming, whereby shared-memory and message-passing programming techniques are combined within a single parallel application, has often been discussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model brings any performance benefits for clusters based on multicore processors. A molecular dynamics application has been parallelized using both MPI and hybrid MPI/OpenMP programming models. The performance of this application has been examined on two high-end multicore clusters using both Infiniband and Gigabit Ethernet interconnects. The hybrid model has been found to perform well on the higher-latency Gigabit Ethernet connection, but offers no performance benefit on low-latency Infiniband interconnects. The changes in performance are attributed to the differing communication profiles of the hybrid and MPI codes.

APA, Harvard, Vancouver, ISO, and other styles

40

Mattson, Timothy G. "How Good is OpenMP." Scientific Programming 11, no. 2 (2003): 81–93. http://dx.doi.org/10.1155/2003/124373.

Full text

Abstract:

The OpenMP standard defines an Application Programming Interface (API) for shared memory computers. Since its introduction in 1997, it has grown to become one of the most commonly used API's for parallel programming. But success in the market doesn't necessarily imply successful computer science. Is OpenMP a "good" programming environment? What does it even mean to call a programming environment good? And finally, once we understand how good or bad OpenMP is; what can we do to make it even better? In this paper, we will address these questions.

APA, Harvard, Vancouver, ISO, and other styles

41

Kim, DaeHwan. "Experiences of the GPU Thread Configuration and Shared Memory." European Journal of Engineering Research and Science 3, no. 7 (July 17, 2018): 12. http://dx.doi.org/10.24018/ejers.2018.3.7.788.

Full text

Abstract:

Nowadays, GPU processors are widely used for general-purpose parallel computation applications. In the GPU programming, thread and block configuration is one of the most important decisions to be made, which increases parallelism and hides instruction latency. However, in many cases, it is often difficult to have sufficient parallelism to hide all the latencies, where the high latencies are often caused by the global memory accesses. In order to reduce the number of those accesses, the shared memory is instead used which is much faster than the global memory being located on a chip. The performance of the proposed thread configuration is evaluated on the GPU 960 processor. The experimental result shows that the best configuration improves the performance by 7.3 times compared to the worst configuration in the experiment. The experiences are also discussed for the shared memory performance when compared to that of the global memory.

APA, Harvard, Vancouver, ISO, and other styles

42

Otto, Steve W. "Parallel Array Classes and Lightweight Sharing Mechanisms." Scientific Programming 2, no. 4 (1993): 203–16. http://dx.doi.org/10.1155/1993/393409.

Full text

Abstract:

We discuss a set of parallel array classes, MetaMP, for distributed-memory architectures. The classes are implemented in C++ and interface to the PVM or Intel NX message-passing systems. An array class implements a partitioned array as a set of objects distributed across the nodes – a "collective" object. Object methods hide the low-level message-passing and implement meaningful array operations. These include transparent guard strips (or sharing regions) that support finite-difference stencils, reductions and multibroadcasts for support of pivoting and row operations, and interpolation/contraction operations for support of multigrid algorithms. The concept of guard strips is generalized to an object implementation of lightweight sharing mechanisms for finite element method (FEM) and particle-in-cell (PIC) algorithms. The sharing is accomplished through the mechanism of weak memory coherence and can be efficiently implemented. The price of the efficient implementation is memory usage and the need to explicitly specify the coherence operations. An intriguing feature of this programming model is that it maps well to both distributed-memory and shared-memory architectures.

APA, Harvard, Vancouver, ISO, and other styles

43

Płażek, Joanna, Krzysztof Banaś, and Jacek Kitowski. "Comparison of Message-Passing and Shared Memory Implementations of the GMRES Method on MIMD Computers." Scientific Programming 9, no. 4 (2001): 195–209. http://dx.doi.org/10.1155/2001/681621.

Full text

Abstract:

In this paper we compare different parallel implementations of the same algorithm for solving nonlinear simulation problems on unstructured meshes. In the first implementation, making use of the message-passing programming model and the PVM system, the domain decomposition of unstructured mesh is implemented, while the second implementation takes advantage of the inherent parallelism of the algorithm by adopting the shared-memory programming model. Both implementations are applied to the preconditioned GMRES method that solves iteratively the system of linear equations. A combined approach, the hybrid programming model suitable for multicomputers with SMP nodes, is introduced. For performance measurements we use compressible fluid flow simulation in which sequences of finite element solutions form time approximations to the Euler equations. The tests are performed on HP SPP1600, HP S2000 and SGI Origin2000 multiprocessors and report wall-clock execution time and speedup for different number of processing nodes and for different meshes. Experimentally, the explicit programming model proves to be more efficient than the implicit model by 20—70%, depends on the mesh and the machine.

APA, Harvard, Vancouver, ISO, and other styles

44

Szustak, Lukasz, and Pawel Bratek. "Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors." International Journal of High Performance Computing Applications 33, no. 3 (February 24, 2019): 534–53. http://dx.doi.org/10.1177/1094342019828153.

Full text

Abstract:

In this work, we take up the challenge of performance portable programming of heterogeneous stencil computations across a wide range of modern shared-memory systems. An important example of such computations is the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), the second major part of the dynamic core of the EULAG geophysical model. For this aim, we develop a set of parametric optimization techniques and four-step procedure for customization of the MPDATA code. Among these techniques are: islands-of-cores strategy, (3+1)D decomposition, exploiting data parallelism and simultaneous multithreading, data flow synchronization, and vectorization. The proposed adaptation methodology helps us to develop the automatic transformation of the MPDATA code to achieve high sustained scalable performance for all tested ccNUMA platforms with Intel processors of last generations. This means that for a given platform, the sustained performance of the new code is kept at a similar level, independently of the problem size. The highest performance utilization rate of about 41–46% of the theoretical peak, measured for all benchmarks, is provided for any of the two-socket servers based on Skylake-SP (SKL-SP), Broadwell, and Haswell CPU architectures. At the same time, the four-socket server with SKL-SP processors achieves the highest sustained performance of around 1.0–1.1 Tflop/s that corresponds to about 33% of the peak.

APA, Harvard, Vancouver, ISO, and other styles

45

Czarnul, Paweł, Jerzy Proficz, and Krzysztof Drypczewski. "Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems." Scientific Programming 2020 (January 29, 2020): 1–19. http://dx.doi.org/10.1155/2020/4176794.

Full text

Abstract:

This paper provides a review of contemporary methodologies and APIs for parallel programming, with representative technologies selected in terms of target system type (shared memory, distributed, and hybrid), communication patterns (one-sided and two-sided), and programming abstraction level. We analyze representatives in terms of many aspects including programming model, languages, supported platforms, license, optimization goals, ease of programming, debugging, deployment, portability, level of parallelism, constructs enabling parallelism and synchronization, features introduced in recent versions indicating trends, support for hybridity in parallel execution, and disadvantages. Such detailed analysis has led us to the identification of trends in high-performance computing and of the challenges to be addressed in the near future. It can help to shape future versions of programming standards, select technologies best matching programmers’ needs, and avoid potential difficulties while using high-performance computing systems.

APA, Harvard, Vancouver, ISO, and other styles

46

Giraud, L. "Combining Shared and Distributed Memory Programming Models on Clusters of Symmetric Multiprocessors: Some Basic Promising Experiments." International Journal of High Performance Computing Applications 16, no. 4 (November 2002): 425–30. http://dx.doi.org/10.1177/109434200201600405.

Full text

Abstract:

This note presents some experiments on different clusters of SMPs, where both distributed and shared memory parallel programming paradigms can be naturally combined. Although the platforms exhibit the same macroscopic memory organization, it appears that their individual overall performance is closely dependent on the ability of their hardware to efficiently exploit the local shared memory within the nodes. In that context, cache blocking strategy appears to be very important not only to get good performance out of each individual processor but mainly good performance out of the overall computing node since sharing memory locally might become a severe bottleneck. On a very simple benchmark, representative of many large simulation codes, we show through numerical experiments that mixing the two programming models enables us to get attractive speed-ups that compete with a pure distributed memory approach. This opens promising perspectives for smoothly moving large industrial codes developed on distributed vector computers with a moderate number of processors on these emerging platforms for intensive scientific computing that are the clusters of SMPs.

APA, Harvard, Vancouver, ISO, and other styles

47

Luo, Ya Di, Jing Li, Jie Xu, Cheng Long Dou, and Yu Pei Jia. "Parallel Program Design and Development of Large-Scale Grid Topology Analysis." Advanced Materials Research 1070-1072 (December 2014): 804–8. http://dx.doi.org/10.4028/www.scientific.net/amr.1070-1072.804.

Full text

Abstract:

This paper proposes a parallel computing method of topological analysis based on the partition of grid model data, and based on the smart grid dispatch control system it design and develop the parallel topology analysis service using the OpenMP shared memory programming model and C/C++ programming language. According to the layering and zoning features of the smart grid dispatch control system, this method divides the grid mode by area and power station. Package the topology search function and do parallel processing for mode data according to different area and power station, which can realize the parallel network topology analysis. The test result on the actual grid shows that this method has good stability and real-time, and it can meet the system online simulation, analysis and control applications requires for the topology analysis.

APA, Harvard, Vancouver, ISO, and other styles

48

Mattson, Timothy G. "The Efficiency of Linda for General Purpose Scientific Programming." Scientific Programming 3, no. 1 (1994): 61–71. http://dx.doi.org/10.1155/1994/401086.

Full text

Abstract:

Linda (Linda is a registered trademark of Scientific Computing Associates, Inc.) is a programming language for coordinating the execution and interaction of processes. When combined with a language for computation (such as C or Fortran), the resulting hybrid language can be used to write portable programs for parallel and distributed multiple instruction multiple data (MIMD) computers. The Linda programming model is based on operations that read, write, and erase a virtual shared memory. It is easy to use, and lets the programmer code in a very expressive, uncoupled programming style. These benefits, however, are of little value unless Linda programs execute efficiently. The goal of this article is to demonstrate that Linda programs are efficient making Linda an effective general purpose tool for programming MIMD parallel computers. Two arguments for Linda's efficiency are given; the first is based on Linda's implementation and the second on a range of case studies spanning a complete set of parallel algorithm classes.

APA, Harvard, Vancouver, ISO, and other styles

49

Brown, Christopher, Vladimir Janjic, M. Goli, and J. McCall. "Programming Heterogeneous Parallel Machines Using Refactoring and Monte–Carlo Tree Search." International Journal of Parallel Programming 48, no. 4 (June 10, 2020): 583–602. http://dx.doi.org/10.1007/s10766-020-00665-z.

Full text

Abstract:

Abstract This paper presents a new technique for introducing and tuning parallelism for heterogeneous shared-memory systems (comprising a mixture of CPUs and GPUs), using a combination of algorithmic skeletons (such as farms and pipelines), Monte–Carlo tree search for deriving mappings of tasks to available hardware resources, and refactoring tool support for applying the patterns and mappings in an easy and effective way. Using our approach, we demonstrate easily obtainable, significant and scalable speedups on a number of case studies showing speedups of up to 41 over the sequential code on a 24-core machine with one GPU. We also demonstrate that the speedups obtained by mappings derived by the MCTS algorithm are within 5–15% of the best-obtained manual parallelisation.

APA, Harvard, Vancouver, ISO, and other styles

50

SCHMOLLINGER, MARTIN, and MICHAEL KAUFMANN. "DESIGNING PARALLEL ALGORITHMS FOR HIERARCHICAL SMP CLUSTERS." International Journal of Foundations of Computer Science 14, no. 01 (February 2003): 59–78. http://dx.doi.org/10.1142/s0129054103001595.

Full text

Abstract:

Clusters of symmetric multiprocessor nodes (SMP clusters) are one of the most important parallel architectures at the moment. The architecture consists of shared-memory nodes with multiple processors and a fast interconnection network between the nodes. New programming models try to exploit this architecture by using threads in the nodes and using message-passing-libraries for inter-node communication. In order to develop efficient algorithms, it is necessary to consider the hybrid nature of the architecture and of the programming models. We present the κNUMA-model and a methodology that build a good base for designing efficient algorithms for SMP clusters. The κNUMA-model is a computational model that extends the bulk-synchronous parallel (BSP) model with the characteristics of SMP clusters and new hybrid programming models. The κNUMA-methodology suggests to develop efficient overall algorithms by developing efficient algorithms for each level in the hierarchy. We use the problem of personalized one-to-all-broadcast and the dense matrix-vector-multiplication for the presentation. The theoretical results of the analysis of the dense matrix-vector-multiplication are verified practically. We show results of experiments, made on a Linux-cluster of dual Pentium-III nodes.

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Shared-memory parallel programming'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles