Dissertations / Theses: 'Parallel Programming Frameworks'

1

Podobas, Artur. "Performance-driven exploration using Task-based Parallel Programming Frameworks." Licentiate thesis, KTH, Programvaruteknik och Datorsystem, SCS, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-122569.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Ali, Akhtar. "Comparative study of parallel programming models for multicore computing." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-94296.

Full text

Abstract:

Shared memory multi-core processor technology has seen a drastic developmentwith faster and increasing number of processors per chip. This newarchitecture challenges computer programmers to write code that scales overthese many cores to exploit full computational power of these machines.Shared-memory parallel programming paradigms such as OpenMP and IntelThreading Building Blocks (TBB) are two recognized models that offerhigher level of abstraction, shields programmers from low level detailsof thread management and scales computation over all available resources.At the same time, need for high performance power-ecient computing iscompelling developers to exploit GPGPU computing due to GPU's massivecomputational power and comparatively faster multi-core growth. Thistrend leads to systems with heterogeneous architectures containing multicoreCPUs and one or more programmable accelerators such as programmableGPUs. There exist dierent programming models to program these architecturesand code written for one architecture is often not portable to anotherarchitecture. OpenCL is a relatively new industry standard framework, de-ned by Khronos group, which addresses the portability issue. It oers aportable interface to exploit the computational power of a heterogeneous setof processors such as CPUs, GPUs, DSP processors and other accelerators. In this work, we evaluate the eectiveness of OpenCL for programmingmulti-core CPUs in a comparative case study with two CPU specic stableframeworks, OpenMP and Intel TBB, for ve benchmark applicationsnamely matrix multiply, LU decomposition, image convolution, Pi value approximationand image histogram generation. The evaluation includes aperformance comparison of the three frameworks and a study of the relativeeects of applying compiler optimizations on performance numbers.OpenCL performance on two vendor-dependent platforms Intel and AMD,is also evaluated. Then the same OpenCL code is ported to a modern GPUand its code correctness and performance portability is investigated. Finally,usability experience of coding using the three multi-core frameworksis presented.

APA, Harvard, Vancouver, ISO, and other styles

3

Chavez, Daniel. "Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-190883.

Full text

Abstract:

Map projections lie at the core of geographic information systems and numerous projections are used today. The reprojection between different map projections is recurring in a geographic information system and it can be parallelized with multi-core CPUs and GPUs. This thesis implements a parallel analytic reprojection algorithm of raster data in C/C++ with the parallel programming frameworks Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA and OpenCL. The thesis compares the execution times from the different implementations on small, medium and large raster data sets, where OpenMP had the best speedup of 6, 6.2 and 5.5, respectively. Meanwhile, the GPU implementations were 293 % faster than the fastest CPU implementations, where profiling shows that the CPU implementations spend most time on trigonometry functions. The results show that reprojection algorithm is well suited for the GPU, while OpenMP and Intel TBB are the fastest of the CPU frameworks.
Kartprojektioner är en central del av geografiska informationssystem och en otalig mängd av kartprojektioner används idag. Omprojiceringen mellan olika kartprojektioner sker regelbundet i ett geografiskt informationssystem och den kan parallelliseras med flerkärniga CPU:er och GPU:er. Denna masteruppsats implementerar en parallel och analytisk omprojicering av rasterdata i C/C++ med ramverken Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA och OpenCL. Uppsatsen jämför de olika implementationernas exekveringstider på tre rasterdata av varierande storlek, där OpenMP hade bäst speedup på 6, 6.2 och 5.5. GPU-implementationerna var 293 % snabbare än de snabbaste CPU-implementationerna, där profileringen visar att de senare spenderade mest tid på trigonometriska funktioner. Resultaten visar att GPU:n är bäst lämpad för omprojicering av rasterdata, medan OpenMP är den snabbaste inom CPU ramverken.

APA, Harvard, Vancouver, ISO, and other styles

4

Sonoda, Eloiza Helena. "OOPS - Object-Oriented Parallel System. Um framework de classes para a programação científica paralela." Universidade de São Paulo, 2006. http://www.teses.usp.br/teses/disponiveis/76/76132/tde-14022007-101855/.

Full text

Abstract:

Neste trabalho foi realizado o projeto e o desenvolvimento do framework de classes OOPS - Object-Oriented Parallel System. Esta é uma ferramenta que utiliza orientação a objetos para apoiar a implementação de programas científicos concorrentes para execução paralela. O OOPS fornece abstrações de alto nível para que o programador da aplicação não se envolva diretamente com detalhes de implementação paralela, sem contudo ocultar completamente aspectos paralelos de projeto, como particionamento e distribuição dos dados, por questões de eficiência e de desempenho da aplicação. Para isso, o OOPS apresenta um conjunto de classes que permitem o encapsulamento de técnicas comumente encontradas em programação de sistemas paralelos. Utiliza o conceito de processadores virtuais organizados em grupos, aos quais podem ser aplicadas topologias que fornecem modos de comunicação entre os processadores virtuais, e contêineres podem ter seus elementos distribuídos por essas topologias, com componentes paralelos atuando sobre eles. A utilização das classes fornecidas pelo OOPS facilita a implementação do código sem adicionar sobrecarga significativa à aplicação paralela, representando uma camada fina sobre a biblioteca de passagem de mensagens usada.
This work describes the design and development of the OOPS (Object Oriented Parallel System) class framework, which is a tool that uses object orientation to support programming of concurrent scientific applications for parallel execution. OOPS provides high level abstractions to avoid application programmer\'s involvement with many parallel implementation details. For performance considerations, some parallel aspects such as decomposition and data distribution are not completely hidden from the application programmer. To achieve its intents, OOPS encapsulates some programming techniques frequently used for parallel systems. Virtual processors are organized in groups, over which topologies that provide communication between the processors can be constructed; distributed containers have their elements distributed across the processors of a topology, and parallel components use these containers for their work. The use of the classes supplied by OOPS simplifies the implementation of parallel applications, without incurring in pronounced overhead. OOPS is thus a thin layer over the message passing interface used for its implementation.

APA, Harvard, Vancouver, ISO, and other styles

5

Torbey, Sami. "Towards a framework for intuitive programming of cellular automata." Thesis, Kingston, Ont. : [s.n.], 2007. http://hdl.handle.net/1974/929.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Hamdan, Mohammad M. "A combinational framework for parallel programming using algorithmic skeletons." Thesis, Heriot-Watt University, 2000. http://hdl.handle.net/10399/567.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Moraes, Sergio A. S. "A distributed processing framework with application to graphics." Thesis, University of Sussex, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.387338.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Cuello, Rosandra. "Providing Support for the Movidius Myriad1 Platform in the SkePU Skeleton Programming Framework." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-111844.

Full text

Abstract:

The Movidius Myriad1 Platform is a multicore embedded platform primed to offer high performance and power efficiency for computer vision applications in mobile devices. The challenges of programming multicore environments are well known and skeleton programming offers a high-level programming alternative for parallel computing, intended to hide the complexities of the system from the programmer. The SkePU Skeleton Programming Framework includes backend implementations for CPU and GPU systems and it has the capacity to support more platforms by extending its backend implementations. With this master thesis project we aim to extend the SkePU Skeleton Programming Framework to provide support for execution in the Movidius Myriad1 embedded platform. Our SkePU backend for Myriad1 consists on a set of macros and functions to compose the different elements of a Myriad1 application, data communication structures to exchange data between the host systems and Myriad1, and a helper script and auxiliary files to generate a Myriad1 application.Evaluation and testing demonstrate that our backend is usable, however further optimizations are needed to obtain good performance that would make it practical to use in real life applications, particularly when it comes to data communication. As part of this project, we have outlined some improvements that could be applied to obtain better performance overall in the future, addressing the issues found with the methods of data communication.

APA, Harvard, Vancouver, ISO, and other styles

9

Ernstsson, August. "Designing a Modern Skeleton Programming Framework for Parallel and Heterogeneous Systems." Licentiate thesis, Linköpings universitet, Programvara och system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-170194.

Full text

Abstract:

Today's society is increasingly software-driven and dependent on powerful computer technology. Therefore it is important that advancements in the low-level processor hardware are made available for exploitation by a growing number of programmers of differing skill level. However, as we are approaching the end of Moore's law, hardware designers are finding new and increasingly complex ways to increase the accessible processor performance. It is getting more and more difficult to effectively target these processing resources without expert knowledge in parallelization, heterogeneous computation, communication, synchronization, and so on. To ensure that the software side can keep up, advanced programming environments and frameworks are needed to bridge the widening gap between hardware and software. One such example is the pattern-centric skeleton programming model and in particular the SkePU project. The work presented in this thesis first redesigns the SkePU framework based on modern C++ variadic template metaprogramming and state-of-the-art compiler technology. It then explores new ways to improve performance: by providing new patterns, improving the data access locality of existing ones, and using both static and dynamic knowledge about program flow. The work combines novel ideas with practical evaluation of the approach on several applications. The advancements also include the first skeleton API that allows variadic skeletons, new data containers, and finally an approach to make skeleton programming more customizable without compromising universal portability.

Ytterligare forskningsfinansiärer: EU H2020 project EXA2PRO (801015); SeRC.

APA, Harvard, Vancouver, ISO, and other styles

10

Manasievski, Milan. "Asynchronous and parallel programming in .NET framework 4 and 4.5 using C#." Master's thesis, Česká zemědělská univerzita v Praze, 2015. http://www.nusl.cz/ntk/nusl-258694.

Full text

Abstract:

In this diploma thesis the author will elaborate on asynchronous and parallel programming in the .NET framework version 4 and version 4.5. The aim of this thesis will be to prove and provide better insight on the task-programming model that Microsoft introduced and compare different applications in terms of speed and lines of code used to write then and the differences between them using simple statistics. Using the literature gathered, the author will explain what would be the best ways to achieve parallelism on applications, write about design patterns used, and provide code snippets that will help the reader get better overall understanding of the Task Parallel Library and the benefits it gives in comparison of older methods and sequential programming.

APA, Harvard, Vancouver, ISO, and other styles

11

Hook, Nicola K. "A formal framework in VDM for the specification of parallel discrete event simulation." Thesis, University of East Anglia, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.296805.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Fernandes, Davi Teodoro. "Implementação de framework computacional de paralelização híbrida do Moving Particle Semi-implicit Method para modelagem de fluidos incompressíveis." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/3/3135/tde-06072014-221307/.

Full text

Abstract:

O Tanque de Provas Numérico (TPN) é um laboratório pioneiro em hidrodinâmica aplicada e fruto de uma colaboração entre a indústria brasileira de petróleo (PETROBRAS S.A.) e as principais instituições de pesquisa do país. Seu principal objetivo é atuar como parceiro da indústria offshore e de petróleo, colaborando para a obtenção da autossuficiência da produção nacional de petróleo como uma poderosa ferramenta para projeto e análise de sistemas flutuantes de produção de óleo e gás. O coração do TPN é um cluster de computadores SMP que é hoje um dos maiores agrupamentos do Brasil para fins de pesquisa. Um dos focos de atenção do TPN é a aplicação do Moving Particles Semi-implicit Method (MPS) na exploração de soluções para muitos problemas de Engenharia. Por trabalhar sem a necessidade do uso de malhas (método tradicional Euleriano), o método tem diversas aplicações na simulação de corpos flutuantes e na hidrodinâmica aplicada, sendo atualmente utilizado para realização de estudos sobre a influência do movimento de ondas em navios; simulações de fenômenos que envolvem fragmentações; superfícies livres; grandes deformações; dinâmica de fluidos em condições extremas, como é o caso em processos prospecção do petróleo onde muitas vezes é difícil e economicamente inviável fazer ensaios físicos. Devido ao grande número de partículas utilizadas na simulação de sistemas complexos pelo método MPS, é necessário aproveitar de forma eficiente os recursos computacionais disponíveis para a análise de modelos com o refinamento adequado às aplicações práticas. Com tera-FLOPS disponíveis na rede cluster do TPN para modelagem computacional, há uma grande necessidade de uma solução computacional paralela altamente escalonável que, além disto, seja fácil de manutenção e extensibilidade. Dentro desta linha de pesquisa, foi desenvolvida uma solução com essas características através do emprego de modernas técnicas de engenharia de software.
The Numerical Offshore Tank (TPN) is a pioneer laboratory in applied hydrodynamics and result of collaboration between the Brazilian oil (Petrobras SA) and the major research institutions in the country. Its main purpose is to act as a partner of industry and offshore oil, contributing to the achievement of self-sufficiency of domestic oil production as a powerful tool for design and analysis of floating production systems for oil and gas. The heart of TPN is a cluster of SMP computers that is now one of the largest groupings of Brazil for research purposes. One focus of attention of TPN is the application of Moving Particles Semi-implicit Method (MPS) in exploring solutions to many engineering problems. By working without the use of mesh (Eulerian traditional method), the method has several applications in the simulation of floating bodies and applied hydrodynamics, currently being used for studies on the influence of the movement of ships in waves; simulations of phenomena involving fragmentation; free surfaces, large deformations; fluid dynamics in extreme conditions, as is the case in processes where petroleum exploration is often difficult and uneconomical to do physical tests. Due to the high number of particles used in the simulation of complex systems by the MPS method, it is necessary to efficiently take advantage of the computational resources available for the analysis of models with the refinement suitable for practical applications. With tera-FLOPS available in the TPN network cluster for computational modeling, there is a great need for a parallel highly scalable solution which, moreover, must be easy maintenance and extensibility. Within this line of research, we developed a solution with these characteristics through the use of modern software engineering techniques.

APA, Harvard, Vancouver, ISO, and other styles

13

Krommydas, Konstantinos. "Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/77582.

Full text

Abstract:

The proliferation of a diverse set of heterogeneous computing platforms in conjunction with the plethora of programming languages and optimization techniques on each language for each underlying architecture exacerbate widespread adoption of such platforms. This is especially true for novice programmers and the non-technical-savvy masses that are largely precluded from enjoying the advantages of high-performance computing. Moreover, different groups within the heterogeneous computing community (e.g., hardware architects, tool developers, and programmers) are presented with new challenges with respect to performance, programmability, and portability (or the three P's) of heterogeneous computing. In this work we discuss such challenges and identify benchmarking techniques based on computation and communication patterns as an appropriate means for the systematic evaluation of heterogeneous computing with respect to the three P's. Our proposed approach is based on OpenCL implementations of the Berkeley dwarfs. We use our benchmark suite (OpenDwarfs) in characterizing performance of state-of-the-art parallel architectures, and as the main component of a methodology (Telescoping Architectures) for identifying trends in future heterogeneous architectures. Furthermore, we employ OpenDwarfs in a multi-faceted study on the gaps between the three P's in the context of the modern heterogeneous computing landscape. Our case-study spans a variety of compilers, languages, optimizations, and target architectures, including the CPU, GPU, MIC, and FPGA. Based on our insights, and extending aspects of prior research (e.g., in compilers, programming languages, and auto-tuning), we propose the introduction of grid-based data structures as the basis of programming frameworks and present a prototype unified framework (GLAF) that encompasses a novel visual programming environment with code generation, auto-parallelization, and auto-tuning capabilities. Our results, which span scientific domains, indicate that our holistic approach constitutes a viable alternative towards enhancing the three P's and further democratizing heterogeneous, parallel computing for non-programming-savvy audiences, and especially domain scientists.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

14

Kraemer, Eileen T. "A framework, tools, and methodology for the visualization of parallel and distributed systems." Diss., Georgia Institute of Technology, 1995. http://hdl.handle.net/1853/9214.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Schaefer, Linda Ruth. "Analysis of a coordination framework for mapping coarse-grain applications to distributed systems." PDXScholar, 1991. https://pdxscholar.library.pdx.edu/open_access_etds/4270.

Full text

Abstract:

A paradigm is presented for the parallelization of coarse-grain engineering and scientific applications. The coordination framework provides structure and an organizational strategy for a parallel solution in a distributed environment. Three categories of primitives which define the coordination framework are presented: structural, transformational. and operational. The prototype of the paradigm presented in this thesis is the first step towards a programming development tool. This tool will allow non-specialist programmers to parallelize existing sequential solutions through the distribution, synchronization and collection of tasks. The distributed control, multidimensional pipeline characteristics of the paradigm provide advantages which include load balancing through the use of self-directed workers, a simplified communication scheme ideally suited for infrequent task interaction, a simple programmer interface, and the ability of the programmer to use already existing code. Results for the parallelization of SPICE3Cl in a distributed system of fifteen SUN 3 workstations with one fileserver demonstrate linear speedup with slopes ranging from 0.7 to 0.9. A high-level abstraction of the system is presented in the form of a closed, single class, queuing network model. Using the Mean Value Analysis solution technique from queuing network theory, an expression for total execution time is obtained and is shown to be consistent with the well known Amdahl's Law. Our expression is in fact a refinement of Amdahl's Law which realistically captures the limitations of the system. We show that the portion of time spent executing serial code which cannot be enhanced by parallelization is a function of N, the number of workers in the system. Experiments reveal the critical nature of the communication scheme and the synchronization of the paradigm. Investigation of the synchronization center indicates that as N increases, visitations to the center increase and degrade system performance. Experimental data provides the information needed to characterize the impact of visitations on the perfoimance of the system. This characterization provides a mechanism for optimizing the speedup of an application. It is shown that the model replicates the system as well as predicts speedup over an extended range of processors, task count, and task size.

APA, Harvard, Vancouver, ISO, and other styles

16

Bangalore, Purushotham Venkataramaiah. "An open framework for developing distributed computing environments for multidisciplinary computational simulations." Diss., Mississippi State : Mississippi State University, 2003. http://library.msstate.edu/etd/show.asp?etd=etd-04082003-112124.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Ouali, Abdelkader. "Méthodes hybrides parallèles pour la résolution de problèmes d'optimisation combinatoire : application au clustering sous contraintes." Thesis, Normandie, 2017. http://www.theses.fr/2017NORMC215/document.

Full text

Abstract:

Les problèmes d’optimisation combinatoire sont devenus la cible de nombreuses recherches scientifiques pour leur importance dans la résolution de problèmes académiques et de problèmes réels rencontrés dans le domaine de l’ingénierie et dans l’industrie. La résolution de ces problèmes par des méthodes exactes ne peut être envisagée à cause des délais de traitement souvent exorbitants que nécessiteraient ces méthodes pour atteindre la (les) solution(s) optimale(s). Dans cette thèse, nous nous sommes intéressés au contexte algorithmique de résolution des problèmes combinatoires, et au contexte de modélisation de ces problèmes. Au niveau algorithmique, nous avons appréhendé les méthodes hybrides qui excellent par leur capacité à faire coopérer les méthodes exactes et les méthodes approchées afin de produire rapidement des solutions. Au niveau modélisation, nous avons travaillé sur la spécification et la résolution exacte des problématiques complexes de fouille des ensembles de motifs en étudiant tout particulièrement le passage à l’échelle sur des bases de données de grande taille. D'une part, nous avons proposé une première parallélisation de l'algorithme DGVNS, appelée CPDGVNS, qui explore en parallèle les différents clusters fournis par la décomposition arborescente en partageant la meilleure solution trouvée sur un modèle maître-travailleur. Deux autres stratégies, appelées RADGVNS et RSDGVNS, ont été proposées qui améliorent la fréquence d'échange des solutions intermédiaires entre les différents processus. Les expérimentations effectuées sur des problèmes combinatoires difficiles montrent l'adéquation et l'efficacité de nos méthodes parallèles. D'autre part, nous avons proposé une approche hybride combinant à la fois les techniques de programmation linéaire en nombres entiers (PLNE) et la fouille de motifs. Notre approche est complète et tire profit du cadre général de la PLNE (en procurant un haut niveau de flexibilité et d’expressivité) et des heuristiques spécialisées pour l’exploration et l’extraction de données (pour améliorer les temps de calcul). Outre le cadre général de l’extraction des ensembles de motifs, nous avons étudié plus particulièrement deux problèmes : le clustering conceptuel et le problème de tuilage (tiling). Les expérimentations menées ont montré l’apport de notre proposition par rapport aux approches à base de contraintes et aux heuristiques spécialisées
Combinatorial optimization problems have become the target of many scientific researches for their importance in solving academic problems and real problems encountered in the field of engineering and industry. Solving these problems by exact methods is often intractable because of the exorbitant time processing that these methods would require to reach the optimal solution(s). In this thesis, we were interested in the algorithmic context of solving combinatorial problems, and the modeling context of these problems. At the algorithmic level, we have explored the hybrid methods which excel in their ability to cooperate exact methods and approximate methods in order to produce rapidly solutions of best quality. At the modeling level, we worked on the specification and the exact resolution of complex problems in pattern set mining, in particular, by studying scaling issues in large databases. On the one hand, we proposed a first parallelization of the DGVNS algorithm, called CPDGVNS, which explores in parallel the different clusters of the tree decomposition by sharing the best overall solution on a master-worker model. Two other strategies, called RADGVNS and RSDGVNS, have been proposed which improve the frequency of exchanging intermediate solutions between the different processes. Experiments carried out on difficult combinatorial problems show the effectiveness of our parallel methods. On the other hand, we proposed a hybrid approach combining techniques of both Integer Linear Programming (ILP) and pattern mining. Our approach is comprehensive and takes advantage of the general ILP framework (by providing a high level of flexibility and expressiveness) and specialized heuristics for data mining (to improve computing time). In addition to the general framework for the pattern set mining, two problems were studied: conceptual clustering and the tiling problem. The experiments carried out showed the contribution of our proposition in relation to constraint-based approaches and specialized heuristics

APA, Harvard, Vancouver, ISO, and other styles

18

Rengasamy, Vasudevan. "A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU Systems." Thesis, 2014. http://etd.iisc.ac.in/handle/2005/3193.

Full text

Abstract:

The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications. Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems. In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation. In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications.

APA, Harvard, Vancouver, ISO, and other styles

19

Rengasamy, Vasudevan. "A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU Systems." Thesis, 2014. http://hdl.handle.net/2005/3193.

Full text

Abstract:

The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications. Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems. In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation. In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications.

APA, Harvard, Vancouver, ISO, and other styles

20

Jakadeesan, Gopinatha. "FT-PAS-A framework for pattern specific fault-tolerance in parallel programming." Thesis, 2009. http://spectrum.library.concordia.ca/976369/1/MR63279.pdf.

Full text

Abstract:

Fault-tolerance is an important requirement for long running parallel applications. Many approaches are discussed in various literatures about providing fault-tolerance for parallel systems. Most of them exhibit one or more of these shortcomings in delivering fault-tolerance: non-specific solution (i.e., the fault-tolerance solution is general), no separation-of-concern (i.e., the application developer's involvement in implementing the fault tolerance is significant) and limited to inbuilt fault-tolerance solution. In this thesis, we propose a different approach to deliver fault-tolerance to the parallel programs using a-priori knowledge about their patterns. Our approach is based on the observation that different patterns require different fault-tolerance techniques (specificity). Consequently, we have contributed by classifying patterns into sub-patterns based on fault-tolerance strategies. Moreover, the core functionalities of these fault-tolerance strategies can be abstracted and pre-implemented generically, independent of a specific application. Thus, the pre-packaged solution separates their implementation details from the application developer (separation-of-concern). One such fault-tolerance model is designed and implemented here to demonstrate our idea. The Fault-Tolerant Parallel Architectural Skeleton (FT-PAS) model implements various fault-tolerance protocols targeted for a collection of (frequently used) patterns in parallel-programming. Fault-tolerance protocol extension is another important contribution of this research. The FT-PAS model provides a set of basic building blocks as part of protocol extension in order to build new fault- tolerance protocols as needed for available patterns. Finally, the usages of the model from the perspective of two user categories (i.e., an application developer and a protocol designer) are illustrated through examples.

APA, Harvard, Vancouver, ISO, and other styles

21

Gardner, William Bennett. "CSP++ : an object-oriented application framework for software synthesis from CSP specifications." Thesis, 1999. https://dspace.library.uvic.ca//handle/1828/9350.

Full text

Abstract:

One of the useful formalisms for designing concurrent systems is the process algebra called CSP, or Communicating Sequential Processes. CSP statements can be used to model a system's control and data flow in an intuitive way, constituting a kind of hierarchical behavioral specification. Furthermore, when coupled with simulation and model-checking tools, these statements can be executed and debugged until the desired behavior has been accurately captured. Certain properties (such as absence of deadlocks) can be proved, to help verify the correctness of the design. To make the verified specifications executable in a practical sense, refinement to a programming language is required. In this work, an new object-oriented application framework is described which realizes the basic elements of CSP—processes, synchronizing events, and communication channels—in natural terms as C++ objects. In addition, a new software tool is provided to customize the framework by translating CSP statements into invocations of the framework elements. CSP specifications, thus reexpressed in C++ and compiled, form the control portion of a system, able to be linked with other software written in C++ that completes the functionality.
Graduate

APA, Harvard, Vancouver, ISO, and other styles

22

Marques, Hélder de Almeida. "Towards an algorithmic skeleton framework for programming the Intel R Xeon PhiTM processor." Master's thesis, 2014. http://hdl.handle.net/10362/14394.

Full text

Abstract:

The Intel R Xeon PhiTM is the first processor based on Intel’s MIC (Many Integrated Cores) architecture. It is a co-processor specially tailored for data-parallel computations, whose basic architectural design is similar to the ones of GPUs (Graphics Processing Units), leveraging the use of many integrated low computational cores to perform parallel computations. The main novelty of the MIC architecture, relatively to GPUs, is its compatibility with the Intel x86 architecture. This enables the use of many of the tools commonly available for the parallel programming of x86-based architectures, which may lead to a smaller learning curve. However, programming the Xeon Phi still entails aspects intrinsic to accelerator-based computing, in general, and to the MIC architecture, in particular. In this thesis we advocate the use of algorithmic skeletons for programming the Xeon Phi. Algorithmic skeletons abstract the complexity inherent to parallel programming, hiding details such as resource management, parallel decomposition, inter-execution flow communication, thus removing these concerns from the programmer’s mind. In this context, the goal of the thesis is to lay the foundations for the development of a simple but powerful and efficient skeleton framework for the programming of the Xeon Phi processor. For this purpose we build upon Marrow, an existing framework for the orchestration of OpenCLTM computations in multi-GPU and CPU environments. We extend Marrow to execute both OpenCL and C++ parallel computations on the Xeon Phi. We evaluate the newly developed framework, several well-known benchmarks, like Saxpy and N-Body, will be used to compare, not only its performance to the existing framework when executing on the co-processor, but also to assess the performance on the Xeon Phi versus a multi-GPU environment.
projects PTDC/EIA- EIA/113613/2009 (Synergy-VM) and PTDC/EEI-CTP/1837/2012 (SwiftComp) for financing the purchase of the Intel R Xeon PhiTM

APA, Harvard, Vancouver, ISO, and other styles

23

Tu, Yi-Hsuan, and 杜依璇. "EcoMap: An Interactive Framework for Parallel Execution of Functional Programming Commands on Wireless Sensor Networks." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/83562514036595020761.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
98
EcoMap is an execution framework that supports efficient over-the-air interactions with a network of wireless sensor nodes through parallel execution. It provides a command line interface in the full Python-based scripting environment on the host computer. A class library at a higher level is supported for the user to access the sensor network, and a set of commands are provided to perform interactive accesses. EcoMap extends the innovative ideas of EcoExec from a single node to a group of nodes by supporting efficient functional programming constructs in terms of map, reduce, and filter primitives while upporting several variants of synchrony and job control options. The interactivity features of EcoMap encourage experimentation during development and help users become familiar with how to use the system, thereby significantly increasing the productivity of WSN developers. Experimental results also show EcoMap to incur short delays, even when making major firmware changes and interacting with multiple nodes on resource-constrained wireless platforms.

APA, Harvard, Vancouver, ISO, and other styles

24

Fraga, António Fernando Crisóstomo. "Parallel Face Detection." Master's thesis, 2020. http://hdl.handle.net/10316/94026.

Full text

Abstract:

Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e Tecnologia
O reconhecimento de faces em imagens é atualmente feito em grande escala e as imagens utilizadas tende a ser cada vez mais de resolução mais elevadas. Isto pode ser um desafio complicado em arquiteturas sequenciais, pois, com o aumento do número total de pixels das imagens, o desempenho geral desse tipo de implementações tende a diminuir drasticamente. A tese apresentada descreve a implementação de uma framework baseada no artigo Viola-Jones “Rapid Object Detection using a Boosted Cascade of Simple Features” [2]. Desta forma, as arquiteturas paralelas (GPUs e GPUs de baixo consumo), emergem como a solução ideal já que oferecem elevados valores de poder computacional e números de cores que beneficiam o processamento de grandes quantidades de data em paralelo. Utilizando, assim, as vantagens destas arquiteturas para uma paralelização e otimização específica a esta implementação, obtendo, portanto, uma melhoria significativa na performance em comparação a arquiteturas sequenciais em imagens de alta resolução. Por sua vez, também é realizada uma análise dos resultados desta implementação, que acaba por ser bem-sucedida em diversas GPUs, com o objetivo de fazer uma análise conclusiva da influência dos recursos de GPU disponíveis (Power, CUDA cores, etc.) na aceleração geral da GPU. De referir ainda que este detetor de caras baseado em arquiteturas paralelas foi capaz de obter uma aceleração global de até 33 vezes superior em imagens de 8k em comparação com a versão sequencial inicialmente implementada.
Face detection is typically used millions of times per day in many different contexts and the resolution of the images has seen a significant increase. These high-resolution images can be a very defiant challenge in sequentially based architecture since with the rise in the number of pixels the overall performance of this type of implementation decreases drastically.The following paper describes the implementation of a framework of the Viola-Jones “Rapid Object Detection using a Boosted Cascade of Simple Features” [2] in parallel architectures such as GPUs and low-power GPUs. They emerge as natural candidates for the acceleration that we seek, offering a very high computational power and core numbers that enable the process of such large amounts of data in parallelIt also shows the parallelization and optimization of the implementation utilizing the advantages offered by these architectures to achieve an overall performance boost and speedup in high-resolution images when comparing to sequential architectures. An analysis of the results shows the successful implementation and the influence that the GPU resources available (Power, CUDA cores, etc.) have on the overall GPU speedup as well as in its performance. This parallel face detector implementation was able to obtain a global speedup as high as 33 times in 8k images in comparison with the sequential version. An analysis of the results shows the successful implementation and the influence that the GPU resources available (Power, CUDA cores, etc.) have on the overall GPU speedup as well as in its performance. This parallel face detector implementation was able to obtain a global speedup as high as 33 times in 8k images in comparison with the sequential version.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Parallel Programming Frameworks'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles