To see the other types of publications on this topic, follow the link: Parallel programming techniques.

Dissertations / Theses on the topic 'Parallel programming techniques'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 28 dissertations / theses for your research on the topic 'Parallel programming techniques.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Pereira, Marcio Machado 1959. "Scheduling and serialization techniques for transactional memories." [s.n.], 2015. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275547.

Full text
Abstract:
Orientadores: Guido Costa Souza de Araújo, José Nelson Amaral
Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-27T10:12:59Z (GMT). No. of bitstreams: 1 Pereira_MarcioMachado_D.pdf: 2922376 bytes, checksum: 9775914667eadf354d7e256fb2835859 (MD5) Previous issue date: 2015
Resumo: Nos últimos anos, Memórias Transacionais (Transactional Memories ¿ TMs) têm-se mostrado um modelo de programação paralela que combina, de forma eficaz, a melhoria de desempenho com a facilidade de programação. Além disso, a recente introdução de extensões para suporte a TM por grandes fabricantes de microprocessadores, também parece endossá-la como um modelo de programação para aplicações paralelas. Uma das questões centrais na concepção de sistemas de TM em Software (STM) é identificar mecanismos ou heurísticas que possam minimizar a contenção decorrente dos conflitos entre transações. Apesar de já terem sido propostos vários mecanismos para reduzir a contenção, essas técnicas têm um alcance limitado, uma vez que o conflito é evitado por interrupção ou serialização da execução da transação, impactando consideravelmente o desempenho do programa. Este trabalho explora uma abordagem complementar para melhorar o desempenho de STM através da utilização de escalonadores. Um escalonador de TM é um componente de software que decide quando uma determinada transação deve ser executada ou não. Sua eficácia é muito sensível às métricas usadas para prever o comportamento das transações, especialmente em cenários de alta contenção. Este trabalho propõe um novo escalonador, Dynamic Transaction Scheduler ¿ DTS, para selecionar a próxima transação a ser executada. DTS é baseada em uma política de "recompensa pelo sucesso" e utiliza uma métrica que mede com melhor precisão o trabalho realizado por uma transação. Memórias Transacionais em Hardware (HTMs) são mecanismos interessante para implementar TM porque integram o suporte a transações no nível da arquitetura. Por outro lado, aplicações que usam HTM podem ter o seu desempenho dificultado pela falta de escalabilidade e transbordamento da cache de dados. Este trabalho apresenta um extenso estudo de desempenho de aplicações que usam HTM na arquitetura Haswell da Intel. Ele avalia os pontos fortes e fracos desta nova arquitetura, realizando uma exploração das várias características das aplicações de TM. Este estudo detalhado revela as restrições impostas pela nova arquitetura e introduz uma política de serialização simples, porém eficaz, para garantir o progresso das transações, além de proporcionar melhor desempenho
Abstract: In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming model that can effectively combine performance improvement with ease of programming. Moreover, the recent introduction of (H)TM-based ISA extensions, by major microprocessor manufacturers, also seems to endorse TM as a programming model for today¿s parallel applications. One of the central issues in designing Software TM (STM) systems is to identify mechanisms or heuristics that can minimize contention arising from conflicting transactions. Although a number of mechanisms have been proposed to tackle contention, such techniques have a limited scope, because conflict is avoided by either interrupting or serializing transaction execution, thus considerably impacting performance. This work explores a complementary approach to boost the performance of STM through the use of schedulers. A TM scheduler is a software component that decides when a particular transaction should be executed. Their effectiveness is very sensitive to the accuracy of the metrics used to predict transaction behaviour, particularly in high-contention scenarios. This work proposes a new Dynamic Transaction Scheduler ¿ DTS to select a transaction to execute next, based on a new policy that rewards success and an improved metric that measures the amount of effective work performed by a transaction. Hardware TMs (HTM) are an interesting mechanism to implement TM as they integrate the support for transactions at the lowest, most efficient, architectural level. On the other hand, for some applications, HTMs can have their performance hindered by the lack of scalability and by limitations in cache store capacity. This work presents an extensive performance study of the implementation of HTM in the Haswell generation of Intel x86 core processors. It evaluates the strengths and weaknesses of this new architecture by exploring several dimensions in the space of TM application characteristics. This detailed performance study provides insights on the constraints imposed by the Intel¿s Transaction Synchronization Extension (Intel¿s TSX) and introduces a simple, but efficient, serialization policy for guaranteeing forward progress on top of the best-effort Intel¿s HTM which was critical to achieving performance
Doutorado
Ciência da Computação
Doutor em Ciência da Computação
APA, Harvard, Vancouver, ISO, and other styles
2

Hind, Alan. "Parallel simulation techniques for telecommunication network modelling." Thesis, Durham University, 1994. http://etheses.dur.ac.uk/5520/.

Full text
Abstract:
In this thesis, we consider the application of parallel simulation to the performance modelling of telecommunication networks. A largely automated approach was first explored using a parallelizing compiler to speed up the simulation of simple models of circuit-switched networks. This yielded reasonable results for relatively little effort compared with other approaches. However, more complex simulation models of packet- and cell-based telecommunication networks, requiring the use of discrete event techniques, need an alternative approach. A critical review of parallel discrete event simulation indicated that a distributed model components approach using conservative or optimistic synchronization would be worth exploring. Experiments were therefore conducted using simulation models of queuing networks and Asynchronous Transfer Mode (ATM) networks to explore the potential speed-up possible using this approach. Specifically, it is shown that these techniques can be used successfully to speed-up the execution of useful telecommunication network simulations. A detailed investigation has demonstrated that conservative synchronization performs very well for applications with good look ahead properties and sufficient message traffic density and, given such properties, will significantly outperform optimistic synchronization. Optimistic synchronization, however, gives reasonable speed-up for models with a wider range of such properties and can be optimized for speed-up and memory usage at run time. Thus, it is confirmed as being more generally applicable particularly as model development is somewhat easier than for conservative synchronization. This has to be balanced against the more difficult task of developing and debugging an optimistic synchronization kernel and the application models.
APA, Harvard, Vancouver, ISO, and other styles
3

Lu, Kang Hsin. "Modelling of saturated traffic flow using highly parallel systems." Thesis, University of Sheffield, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.245726.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Papadopoulos, George Angelos. "Parallel implementation of concurrent logic languages using graph rewriting techniques." Thesis, University of East Anglia, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.329340.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Nautiyal, Sunil Datt. "Parallel computing techniques for investigating three dimensional collapse of a masonry arch." Thesis, University of Cambridge, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.320031.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Webb, Craig Jonathan. "Parallel computation techniques for virtual acoustics and physical modelling synthesis." Thesis, University of Edinburgh, 2014. http://hdl.handle.net/1842/15779.

Full text
Abstract:
The numerical simulation of large-scale virtual acoustics and physical modelling synthesis is a computationally expensive process. Time stepping methods, such as finite difference time domain, can be used to simulate wave behaviour in models of three-dimensional room acoustics and virtual instruments. In the absence of any form of simplifying assumptions, and at high audio sample rates, this can lead to simulations that require many hours of computation on a standard Central Processing Unit (CPU). In recent years the video game industry has driven the development of Graphics Processing Units (GPUs) that are now capable of multi-teraflop performance using highly parallel architectures. Whilst these devices are primarily designed for graphics calculations, they can also be used for general purpose computing. This thesis explores the use of such hardware to accelerate simulations of three-dimensional acoustic wave propagation, and embedded systems that create physical models for the synthesis of sound. Test case simulations of virtual acoustics are used to compare the performance of workstation CPUs to that of Nvidia’s Tesla GPU hardware. Using representative multicore CPU benchmarks, such simulations can be accelerated in the order of 5X for single precision and 3X for double precision floating-point arithmetic. Optimisation strategies are examined for maximising GPU performance when using single devices, as well as for multiple device codes that can compute simulations using billions of grid points. This allows the simulation of room models of several thousand cubic metres at audio rates such as 44.1kHz, all within a useable time scale. The performance of alternative finite difference schemes is explored, as well as strategies for the efficient implementation of boundary conditions. Creating physical models of acoustic instruments requires embedded systems that often rely on sparse linear algebra operations. The performance efficiency of various sparse matrix storage formats is detailed in terms of the fundamental operations that are required to compute complex models, with an optimised storage system achieving substantial performance gains over more generalised formats. An integrated instrument model of the timpani drum is used to demonstrate the performance gains that are possible using the optimisation strategies developed through this thesis.
APA, Harvard, Vancouver, ISO, and other styles
7

Bayne, Ethan. "Accelerating digital forensic searching through GPGPU parallel processing techniques." Thesis, Abertay University, 2017. https://rke.abertay.ac.uk/en/studentTheses/702de12a-e10b-4daa-8baf-c2c57a501240.

Full text
Abstract:
Background: String searching within a large corpus of data is a critical component of digital forensic (DF) analysis techniques such as file carving. The continuing increase in capacity of consumer storage devices requires similar improvements to the performance of string searching techniques employed by DF tools used to analyse forensic data. As string searching is a trivially-parallelisable problem, general purpose graphic processing unit (GPGPU) approaches are a natural fit. Currently, only some of the research in employing GPGPU programming has been transferred to the field of DF, of which, a closed-source GPGPU framework was used— Complete Unified Device Architecture (CUDA). Findings from these earlier studies have found that local storage devices from which forensic data are read present an insurmountable performance bottleneck. Aim: This research hypothesises that modern storage devices no longer present a performance bottleneck to the currently used processing techniques of the field, and proposes that an open-standards GPGPU framework solution – Open Computing Language (OpenCL) – would be better suited to accelerate file carving with wider compatibility across an array of modern GPGPU hardware. This research further hypothesises that a modern multi-string searching algorithm may be better adapted to fulfil the requirements of DF investigation. Methods: This research presents a review of existing research and tools used to perform file carving and acknowledges related work within the field. To test the hypothesis, parallel file carving software was created using C# and OpenCL, employing both a traditional string searching algorithm and a modern multi-string searching algorithm to conduct an analysis of forensic data. A set of case studies that demonstrate and evaluate potential benefits of adopting various methods in conducting string searching on forensic data are given. This research concludes with a final case study which evaluates the performance to perform file carving with the best-proposed string searching solution and compares the result with an existing file carving tool— Foremost. Results: The results demonstrated from the research establish that utilising the parallelised OpenCL and Parallel Failureless Aho-Corasick (PFAC) algorithm solution demonstrates significantly greater processing improvements from the use of a single, and multiple, GPUs on modern hardware. In comparison to CPU approaches, GPGPU processing models were observed to minimised the amount of time required to search for greater amounts of patterns. Results also showed that employing PFAC also delivers significant performance increases over the BM algorithm. The method employed to read data from storage devices was also seen to have a significant effect on the time required to perform string searching and file carving. Conclusions: Empirical testing shows that the proposed string searching method is believed to be more efficient than the widely-adopted Boyer-Moore algorithms when applied to string searching and performing file carving. The developed OpenCL GPGPU processing framework was found to be more efficient than CPU counterparts when searching for greater amounts of patterns within data. This research also refutes claims that file carving is solely limited by the performance of the storage device, and presents compelling evidence that performance is bound by the combination of the performance of the storage device and processing technique employed.
APA, Harvard, Vancouver, ISO, and other styles
8

Srivastava, Rohit Kumar. "Modeling Performance of Tensor Transpose using Regression Techniques." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524080824154753.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Titos, Gil José Rubén. "Hardware Techniques for High-Performance Transactional Memory in Many-Core Chip Multiprocessors." Doctoral thesis, Universidad de Murcia, 2011. http://hdl.handle.net/10803/51473.

Full text
Abstract:
Esta tesis investiga la implementación hardware eficiente de los sistemas de memoria transaccional (HTM) en un chip multiprocesador escalable (CMP), identificando aspectos que limitan el rendimiento y proponiendo técnicas que solventan dichas patologías. Las contribuciones de la tesis son varios diseños HTM complementarios que alcanzan un rendimiento robusto y evitan comportamientos patológicos, mediante la introducción de flexibilidad y adaptabilidad, sin que dichas técnicas apenas supongan un incremento en la complejidad del sistema global. Esta disertación considera tanto sistemas HTM de política ansiosa como aquellos diseñados bajo el enfoque perezoso, y afrontamos las sobrecargas en el rendimiento que son inherentes a cada política. Quizá la contribución más relevante de esta tesis es ZEBRA, un sistema HTM de política híbrida que adapta su comportamiento en función de las características dinámicas de la carga de trabajo.
This thesis focuses on the hardware mechanisms that provide optimistic concurrency control with guarantees of atomicity and isolation, with the intent of achieving high-performance across a variety of workloads, at a reasonable cost in terms of design complexity. This thesis identifies key inefficiencies that impact the performance of several hardware implementations of TM, and proposes mechanisms to overcome such limitations. In this dissertation we consider both eager and lazy approaches to HTM system design, and address important sources of overhead that are inherent to each policy. This thesis presents a hybrid-policy, adaptable HTM system that combines the advantages of both eager and lazy approaches in a low complexity design. Furthermore, this thesis investigates the overheads of the simpler, fixed-policy HTM designs that leverage a distributed directory-based coherence protocol to detect data races over a scalable interconnect, and develops solutions that address some performance degrading factors.
APA, Harvard, Vancouver, ISO, and other styles
10

Protze, Joachim [Verfasser]. "Modular Techniques and Interfaces for Data Race Detection in Multi-Paradigm Parallel Programming / Joachim Protze." Berlin : epubli, 2021. http://d-nb.info/1239488076/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Albassam, Bader. "Enforcing Security Policies On GPU Computing Through The Use Of Aspect-Oriented Programming Techniques." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6165.

Full text
Abstract:
This thesis presents a new security policy enforcer designed for securing parallel computation on CUDA GPUs. We show how the very features that make a GPGPU desirable have already been utilized in existing exploits, fortifying the need for security protections on a GPGPU. An aspect weaver was designed for CUDA with the goal of utilizing aspect-oriented programming for security policy enforcement. Empirical testing verified the ability of our aspect weaver to enforce various policies. Furthermore, a performance analysis was performed to demonstrate that using this policy enforcer provides no significant performance impact over manual insertion of policy code. Finally, future research goals are presented through a plan of work. We hope that this thesis will provide for long term research goals to guide the field of GPU security.
APA, Harvard, Vancouver, ISO, and other styles
12

Potgieter, Andrew. "A Parallel Multidimensional Weighted Histogram Analysis Method." Thesis, University of Cape Town, 2014. http://pubs.cs.uct.ac.za/archive/00000986/.

Full text
Abstract:
The Weighted Histogram Analysis Method (WHAM) is a technique used to calculate free energy from molecular simulation data. WHAM recombines biased distributions of samples from multiple Umbrella Sampling simulations to yield an estimate of the global unbiased distribution. The WHAM algorithm iterates two coupled, non-linear, equations, until convergence at an acceptable level of accuracy. The equations have quadratic time complexity for a single reaction coordinate. However, this increases exponentially with the number of reaction coordinates under investigation, which makes multidimensional WHAM a computationally expensive procedure. There is potential to use general purpose graphics processing units (GPGPU) to accelerate the execution of the algorithm. Here we develop and evaluate a multidimensional GPGPU WHAM implementation to investigate the potential speed-up attained over its CPU counterpart. In addition, to avoid the cost of multiple Molecular Dynamics simulations and for validation of the implementations we develop a test system to generate samples analogous to Umbrella Sampling simulations. We observe a maximum problem size dependent speed-up of approximately 19 for the GPGPU optimized WHAM implementation over our single threaded CPU optimized version. We find that the WHAM algorithm is amenable to GPU acceleration, which provides the means to study ever more complex molecular systems in reduced time periods.
APA, Harvard, Vancouver, ISO, and other styles
13

Tristram, Waide Barrington. "Investigating tools and techniques for improving software performance on multiprocessor computer systems." Thesis, Rhodes University, 2012. http://hdl.handle.net/10962/d1006651.

Full text
Abstract:
The availability of modern commodity multicore processors and multiprocessor computer systems has resulted in the widespread adoption of parallel computers in a variety of environments, ranging from the home to workstation and server environments in particular. Unfortunately, parallel programming is harder and requires more expertise than the traditional sequential programming model. The variety of tools and parallel programming models available to the programmer further complicates the issue. The primary goal of this research was to identify and describe a selection of parallel programming tools and techniques to aid novice parallel programmers in the process of developing efficient parallel C/C++ programs for the Linux platform. This was achieved by highlighting and describing the key concepts and hardware factors that affect parallel programming, providing a brief survey of commonly available software development tools and parallel programming models and libraries, and presenting structured approaches to software performance tuning and parallel programming. Finally, the performance of several parallel programming models and libraries was investigated, along with the programming effort required to implement solutions using the respective models. A quantitative research methodology was applied to the investigation of the performance and programming effort associated with the selected parallel programming models and libraries, which included automatic parallelisation by the compiler, Boost Threads, Cilk Plus, OpenMP, POSIX threads (Pthreads), and Threading Building Blocks (TBB). Additionally, the performance of the GNU C/C++ and Intel C/C++ compilers was examined. The results revealed that the choice of parallel programming model or library is dependent on the type of problem being solved and that there is no overall best choice for all classes of problem. However, the results also indicate that parallel programming models with higher levels of abstraction require less programming effort and provide similar performance compared to explicit threading models. The principle conclusion was that the problem analysis and parallel design are an important factor in the selection of the parallel programming model and tools, but that models with higher levels of abstractions, such as OpenMP and Threading Building Blocks, are favoured.
APA, Harvard, Vancouver, ISO, and other styles
14

Gouin, Florian. "Méthodologie de placement d'algorithmes de traitement d'images sur architecture massivement parallèle." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEM075.

Full text
Abstract:
Dans le secteur industriel, la course à l’amélioration des définitions des capteurs vidéos se répercute directement dans le domaine du traitement d’images par une augmentation des quantités de données à traiter. Dans le cadre de l’embarqué, les mêmes algorithmes ont fréquemment pour contrainte supplémentaire de devoir supporter le temps réel. L’enjeu est alors de trouver une solution présentant une consommation énergétique modérée, une puissance calculatoire soutenue et une bande passante élevée pour l’acheminement des données.Le GPU est une architecture adaptée pour ce genre de tâches notamment grâce à sa conception basée sur le parallélisme massif. Cependant, le fait qu’un accélérateur tel que le GPU prenne place dans une architecture globale hétérogène, ou encore ait de multiples niveaux hiérarchiques, complexifient sa mise en œuvre. Ainsi, les transformations de code visant à placer un algorithme sur GPU tout en optimisant l’exploitation des capacités de ce dernier, ne sont pas des opérations triviales. Dans le cadre de cette thèse, nous avons développé une méthodologie permettant de porter des algorithmes sur GPU. Cette méthodologie est guidée par un ensemble de critères de transformations de programme. Certains d’entre-eux sont définis afin d’assurer la légalité du portage, tandis que d’autres sont utilisés pour améliorer les temps d’exécution sur cette architecture. En complément, nous avons étudié les performances des différentes mémoires ainsi que la gestion du parallélisme gros grain sur les architectures GPU Nvidia.Ces travaux sont une étape préalable à l’ajout de nouveaux critères dans notre méthodologie, visant à maximiser l’exploitation des capacités de ces GPUs. Les résultats expérimentaux obtenus montrent non seulement la fiabilité du placement mais aussi une accélération des temps d’exécution sur plusieurs applications industrielles de traitement d’images écrites en langage C ou C++
In industries, the curse of image sensors for higher definitions increases the amount of data to be processed in the image processing domain. The concerned algorithms, applied to embedded solutions, also have to frequently accept real-time constraints. So, the main issues are to moderate power consumption, to attain high performance computings and high memory bandwidth for data delivery.The massively parallel conception of GPUs is especially well adapted for this kind of tasks. However, this achitecture is complex to handle. Some reasons are its multiple memory and computation hierachical levels or the usage of this accelerator inside a global heterogeneous architecture. Therefore, mapping algorithms on GPUs, while exploiting high performance capacities of this architecture, aren’t trivial operations.In this thesis, we have developped a mapping methodology for sequential algorithms and designed it for GPUs. This methodology is made up of code analysis phases, mapping criteria verifications, code transformations and a final code generation phase. Part of the defined mapping criteria has been designed to assure the mapping legality, by considering GPU hardware specifities, whereas the other part are used to improve runtimes. In addition, we have studied GPU memories performances and the capacity of GPU to efficiently support coarse grain parallellism. This complementary work is a foundation for further improvments of GPU resources exploitation inside this mapping methodology.Last, the experimental results have revealed the functional reliability of the codes mapped on GPU and a speedup on the runtime of many C and C++ image processing applications used in industry
APA, Harvard, Vancouver, ISO, and other styles
15

Fernandez, Alonso Eduard. "Offloading Techniques to Improve Performance on MPI Applications in NoC-Based MPSoCs." Doctoral thesis, Universitat Autònoma de Barcelona, 2014. http://hdl.handle.net/10803/284889.

Full text
Abstract:
Probablement, el sistema-en-xip encastat futur estarà compost per desenes o centenars de nuclis de Propietat Intel·lectual heterogenis que executaran una aplicació paral·lela o fins i tot diverses aplicacions que funcionin en paral·lel. Aquests sistemes seran possible gràcies a l’evolució constant de la tecnologia que segueix la llei de Moore, que ens durà a integrar més transistors en un únic dau, o el mateix nombre de transistors en un dau més petit. En els sistemes MPSoC encastats, les xarxes intenrades (NoC) poden proporcionar una infraestructura de comunicació flexible, en què diversos components, com ara els nuclis microprocessadors, MCU, DSP, GPU, memòries i altres components IP, poden estar interconnectats. En primer lloc, en aquesta tesi presentem un procés de desenvolupament complet creat per desenvolupar MPSoC en clústers reconfigurables tot complementant el procés de desenvolupament SoC actual amb passos addicionals per admetre la programació paral·lela i l’optimització del software. Aquest treball explica de manera sistemàtica els problemes i les solucions per aconseguir un MPSoC basat en FPGA seguint el nostre flux sistemàtic, i s’ofereixen eines i tècniques per desenvolupar aplicacions paral·leles per a aquests sistemes. D’altra banda, descrivim diversos models de programació per a MPSoC encastats i proposem adoptar MPI per a aquests sistemes, i mostrem algunes implementacions creades en aquesta tesi amb arquitectures de memòria compartida i distribuïda. Finalment, ens centrem en la sobrecarrega de temps que produeix la llibreria MPI i intentarem trobar solucions per tal de minimitzar aquesta sobrecàrrega i, per tant, poder accelerar l’execució de l’aplicació, descarregant algunes parts del software stack al controlador d’interfície de la xarxa.
Future embedded System-on-Chip (SoC) will probably be made up of tens or hundreds of heterogeneous Intellectual Properties (IP) cores, which will execute one parallel application or even several applications running in parallel. These systems could be possible due to the constant evolution in technology that follows the Moore’s law, which will lead us to integrate more transistors on a single dice, or the same number of transistors in a smaller dice. In embedded MPSoC systems, NoCs can provide a flexible communication infrastructure, in which several components such as microprocessor cores, MCU, DSP, GPU, memories and other IP components can be interconnected. In this thesis, firstly, we present a complete development process created for developing MPSoCs on reconfigurable clusters by complementing the current SoC development process with additional steps to support parallel programming and software optimization. This work explains systematically problems and solutions to achieve a FPGA-based MPSoC following our systematic flow and offering tools and techniques to develop parallel applications for such systems. Additionally, we show several programming models for embedded MPSoCs and propose the adoption of MPI for such systems and show some implementations created in this thesis over shared and distributed memory architectures. Finally, the focus will be set on the overhead produced by MPI library and on trying to find solutions to minimize this overhead and then be able to accelerate the execution of the application, offloading some parts of the software stack to the Network Interface Controller.
APA, Harvard, Vancouver, ISO, and other styles
16

Menezes, Marlim Pereira. "Metodologia para paralelização e otimização de modelos matemáticos e computacionais, utilizando uma nova linguagem de programação." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/3/3143/tde-16122014-153819/.

Full text
Abstract:
Ao final desta pesquisa deseja-se que haja uma metodologia eficiente, cuja finalidade será auxiliar o usuário na transformação de modelos matemáticos e computacionais codificados para computadores sequenciais, em modelos paralelos otimizados para executarem em microcomputadores pessoais modernos, constituídos de CPU com múltiplos núcleos ou de híbridos (CPU + GPGPU) integrados no mesmo chip, com ou sem processadores gráficos (GPGPU) densamente paralelos instalados, mantendo a qualidade de seus resultados originais, com respeito à sua precisão numérica, mas com uma diminuição considerável no tempo de processamento. A emergência, em meados da década 2000, dessas novas arquiteturas de hardware elevou a capacidade de processamento dos microcomputadores pessoais aos patamares dos computadores de grande porte de apenas alguns anos atrás. Este trabalho de pesquisa apresenta duas metodologias, onde a primeira metodologia é composta de três partes e a segunda de duas partes. Somente a terceira parte da primeira metodologia é dependente de tecnologias de hardware.
At the end of this research project, an efficient methodology is expected with the purpose of assisting users in the processing of mathematical and computer models coded for sequential computers in parallel models that are optimized to run on modern personal computers, consisting of a CPU with multiple or hybrid (CPU + GPGPU) cores integrated into the same chip, with or without massively parallel graphics processors (GPGPU) installed. This will ensure the original quality of the results with respect to numerical accuracy, but with a considerable reduction in processing time. The emergence of these new hardware architectures in the mid-2000s increased the processing power of personal computers to the levels of mainframe computers from just a few years previously. This research work presents two methodologies, where the first methodology is composed of three parts and the second methodology is composed of two parts. Only the third part of the first methodology is dependent on hardware technologies.
APA, Harvard, Vancouver, ISO, and other styles
17

Protze, Joachim [Verfasser], Matthias S. [Akademischer Betreuer] Müller, Jesper L. [Akademischer Betreuer] Träff, and Martin [Akademischer Betreuer] Schulz. "Modular techniques and interfaces for data race detection in multi-paradigm parallel programming / Joachim Protze ; Matthias S. Müller, Jesper L. Träff, Martin Schulz." Aachen : Universitätsbibliothek der RWTH Aachen, 2021. http://d-nb.info/1241683255/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Ewen, Stephan [Verfasser], Volker [Akademischer Betreuer] Markl, Peter [Akademischer Betreuer] Pepper, Odej [Akademischer Betreuer] Kao, and Michael [Akademischer Betreuer] Carey. "Programming abstractions, compilation, and execution techniques for massively parallel data analysis / Stephan Ewen. Gutachter: Peter Pepper ; Volker Markl ; Odej Kao ; Michael Carey. Betreuer: Volker Markl." Berlin : Technische Universität Berlin, 2015. http://d-nb.info/1070580619/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Trevizan, Marcelo Porto. "Processamento paralelo na simulação de campos eletromagnéticos pelo método das diferenças finitas no domínio do tempo - FDTD." Universidade de São Paulo, 2007. http://www.teses.usp.br/teses/disponiveis/3/3142/tde-14052007-173206/.

Full text
Abstract:
São crescentes as pesquisas e os projetos envolvendo o eletromagnetismo. Tanto para as pesquisas quanto para os projetos, tem-se o recurso de realizar simulações computacionais dos problemas envolvidos, a fim de investigar o comportamento dos fenômenos eletromagnéticos diante da situação na qual encontram-se. Há casos, contudo, em que o problema pode ficar computacionalmente grande, requisitando maior quantidade de memória e maior tempo de processamento, devido às geometrias envolvidas ou à acuracidade desejada. Com o objetivo de contornar estas questões, tem-se o desenvolvimento da computação paralela. Uma das implementações possíveis de sistema paralelizado é por meio de uma rede de computadores e, empregando-se programas gratuitos, tem-se sua realização a custo praticamente nulo. O presente trabalho, utilizando o método FDTD, visa a implementação de tal sistema paralelizado. Entretanto, na etapa de desenvolvimento, uma especial atenção foi dada às boas práticas de programação, com o objetivo de garantir ao programa flexibilidade, modularidade e expansibilidade. Adicionalmente, desenvolveu-se uma ferramenta matemática para estimar o tempo de processamento total de uma simulação paralelizada, além de fornecer indicativos de ajustes de parâmetros para que este tempo seja o menor possível. Validam-se o código, o sistema paralelizado e a ferramenta matemática com alguns exemplos. Finalmente, realiza-se um estudo para uma aplicação prática de interesse com a ferramenta desenvolvida.
Researches and projects involving electromagnetic problems are continuously increasing. As much for researches as for projects, there is a resource of achieving computer simulations for the involved problems aiming to investigate the electromagnetic phenomenons behavior, in the situation they are. There are cases, however, the problem results in high computational size, requesting more memories sizes and high processing times, because of the given geometries or high accuracy wanted. With the intent of solving these questions, the parallel computation developing becomes interesting. One of the possible implementations of this parallel system is the use of a computer network. Besides, using free programms, the implementation has almost any costs. The present work, using the FDTD method, aims at the implementation of this parallel system. However, during the development stage, a special attention was given to the programming practices, with the intent of guaranteeing the flexibility, modularity and expansibility of the program. In addition, a mathematic tool was developed to estimate the total processing time of the parallel simulation and to predict indications for adjustments of parameters to reach the minimum time possible. The code, the parallel system and the mathematic tool are validated with some examples. Finally, a study for a practical aplication of interest is done with the developed tool.
APA, Harvard, Vancouver, ISO, and other styles
20

Pai, Satish. "Multiplexed pipelining : a cost effective loop transformation technique." PDXScholar, 1992. https://pdxscholar.library.pdx.edu/open_access_etds/4425.

Full text
Abstract:
Parallel processing has gained increasing importance over the last few years. A key aim of parallel processing is to improve the execution times of scientific programs by mapping them to many processors. Loops form an important part of most computational programs and must be processed efficiently to get superior performance in terms of execution times. Important examples of such programs include graphics algorithms, matrix operations (which are used in signal processing and image processing applications), particle simulation, and other scientific applications. Pipelining uses overlapped parallelism to efficiently reduce execution time.
APA, Harvard, Vancouver, ISO, and other styles
21

Santos, Walter Meneghette dos. "Estudo e desenvolvimento de paralelismo de inversores para aplicação fotovoltaica conectados à rede elétrica." Universidade Tecnológica Federal do Paraná, 2013. http://repositorio.utfpr.edu.br/jspui/handle/1/883.

Full text
Abstract:
Os sistemas fotovoltaicos tem se difundido mundialmente como uma tecnologia de energia limpa que pode ser utilizada na maior parte do planeta Terra. Isto o torna um sistema muito interessante para geração distribuída. A peça fundamental para o aproveitamento da energia fotovoltaica na geração distribuída é o inversor conectado a rede elétrica. Assim o rendimento deste equipamento influencia diretamente no aproveitamento da energia gerada pelos painéis fotovoltaicos e consequentemente no tempo em que o sistema se paga. O comportamento sazonal da geração de energia, onde o inversor trabalha na maior parte do tempo entre 10% e 90% da capacidade, principalmente em sistemas sem rastreamento, não permite que o inversor seja avaliado somente pelo seu rendimento em plena carga, mas pela curva de rendimento completa em toda faixa de operação. O método proposto para a melhora do rendimento do sistema em baixas potências é a utilização de inversores de baixa potência conectados a rede elétrica em paralelo trabalhando de maneira escalonada. Assim, em baixas potências o rendimento é mais elevado que se fosse utilizado um único inversor. Neste trabalho são avaliados também as consequências do paralelismo na taxa de distorção harmônica da corrente e as vantagens de ampliação na vida útil dos equipamentos e o recurso de redundância. Foram implementados 4 inversores de 300W de saída, na topologia ponte completa com frequência de comutação e amostragem de 21,6kHz, controlados cada um por um DSC 56F8014 da Freescale, e um dispositivo para monitoração dos inversores utilizando um microcontrolador PIC18F4520. Todos os dispositivos possuem interface de comunicação UART isolada com protocolo LIN. Os inversores foram testados em operação com modo de compartilhamento de potência contínuo, onde todos os inversores operam com parcelas identicas de potência, e no modo escalonado, onde os inversores entram em operação sob a demanda da potência a ser processada. Os resultados apresentam uma melhora de 3,7% no rendimento entre o sistema de compartilhamento de potência contínuo e escalonado, avaliados pelo rendimento ponderado do sistema (IEC-61836).
Photovoltaic systems have been spreading globally as a clean energy technology that can be used in most of the planet Earth. This makes it a very interesting system for distributed generation. The key to the use of photovoltaics in distributed generation inverter is connected to the power grid. Thus the performance of this equipment directly influences the use of energy generated by the photovoltaic panels and consequently the time that the system pays for itself. The seasonal behavior of power generation, where the drive works most of the time between 10% and 90% of capacity, especially in systems without tracking, does not allow the drive to be evaluated not only by their performance at full load, but the full yield curve throughout the operating range. The proposed method improves the system performance at low power is the use of low power inverters connected in parallel to mains electricity working in installments. Thus, in the low power output is higher than if a single drive were used. This work also evaluated the consequences of parallelism in the rate of harmonic current distortion and benefits of expanding the life of the equipment and the use of redundancy . We implemented four inverters 300W output full bridge topology with switching frequency of 21.6 kHz and sampling, each controlled by a Freescale 56F8014 DSC, and a device for monitoring the inverters using a PIC18F4520 microcontroler. All devices have isolated communication interface UART with LIN protocol. The inverters were tested in operation mode continuous power sharing , where all the inverters operate with identical plots power, and staggered where the inverters come into operation upon the demand of power being processed. The results show an improvement of 3,7% in revenue sharing system between the power and continued staggered valued at weighted yield of the system (IEC-61836).
5000
APA, Harvard, Vancouver, ISO, and other styles
22

Michael, Gavin Constantine. "Compilation techniques for multicomputers." Phd thesis, 1996. http://hdl.handle.net/1885/117301.

Full text
Abstract:
This thesis considers problems in process and data partitioning when compiling programs for distributed-memory parallel computers (or multicomputers). These partitions may be specified by the user through the use of language constructs, or automatically determined by the compiler. Data and process partitioning techniques are developed for two models of compilation. The first compilation model focusses on the loop nests present in a serial program. Executing the iterations of these loop nests in parallel accounts for a significant amount of the parallelism which can be exploited in these programs. The parallelism is exploited by applying a set of transformations to the loop nests. The iterations of the transformed loop nests are in a form which can be readily distributed amongst the processors of a multicomputer. The manner in which the arrays, referenced within these loop nests, are partitioned between the processors is determined by the distribution of the loop iterations. The second compilation model is based on the data parallel paradigm, in which operations are applied to many different data items collectively. High Performance Fortran is used as an example of this paradigm. Novel collective communication routines are developed, and are applied to provide the communication associated with the data partitions for both compilation models. Furthermore, it is shown that by using these routines the communication associated with partitioning data on a multicomputer is greatly simplified. These routines are developed as part of this thesis. The experimental context for this thesis is the development of a compiler for the Fujitsu AP1000 multicomputer. A prototype compiler is presented. Experimental results for a variety of applications are included.
APA, Harvard, Vancouver, ISO, and other styles
23

(9805715), Zhigang Huang. "A recursive algorithm for reliability assessment in water distribution networks with applications of parallel programming techniques." Thesis, 1994. https://figshare.com/articles/thesis/A_recursive_algorithm_for_reliability_assessment_in_water_distribution_networks_with_applications_of_parallel_programming_techniques/13425371.

Full text
Abstract:
Project models the reliability of an urban water distribution network.. Reliability is one of the fundamental considerations in the design of urban water distribution networks. The reliability of a network can be modelled by the probability of the connectedness of a stochastic graph. The enumeration of a set of cuts of the graph, and the calculation of the disjoint probability products of the cuts, are two fundamental steps in the network reliability assessment. An improved algorithm for the enumeration of all the minimal cutsets of a graph is presented. Based on this, a recursive algorithm for the enumeration of all Buzacott cuts (a particular set of ordered cuts) of a graph has been developed. The final algorithm presented in this thesis incorporates the enumeration of Buzacott cuts and the calculation of the disjoint probability products of the cuts to obtain the network reliability. As a result, it is tightly coupled, and very efficient. Experimental results show that this algorithm has a higher efficiency than other reported methods. The parallelism existing in the reliability assessment is investigated. The final algorithm has been implemented in a concurrent computer program. The effectiveness of parallel programming techniques in reducing the computing time required by the reliability assessment has also been discussed.
APA, Harvard, Vancouver, ISO, and other styles
24

Faraj, Ahmad A. Yuan Xin. "Automatic empirical techniques for developing efficient MPI collective communication routines." 2006. http://etd.lib.fsu.edu/theses/available/07072006-162046.

Full text
Abstract:
Thesis (Ph. D.)--Florida State University, 2006.
Advisor: Xin Yuan, Florida State University, College of Arts and Sciences, Dept. of Computer Science. Title and description from dissertation home page (viewed Sept. 19, 2006). Document formatted into pages; contains xiii, 162 pages. Includes bibliographical references.
APA, Harvard, Vancouver, ISO, and other styles
25

Abell, Stephen W. "Parallel acceleration of deadlock detection and avoidance algorithms on GPUs." Thesis, 2013. http://hdl.handle.net/1805/3653.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Current mainstream computing systems have become increasingly complex. Most of which have Central Processing Units (CPUs) that invoke multiple threads for their computing tasks. The growing issue with these systems is resource contention and with resource contention comes the risk of encountering a deadlock status in the system. Various software and hardware approaches exist that implement deadlock detection/avoidance techniques; however, they lack either the speed or problem size capability needed for real-time systems. The research conducted for this thesis aims to resolve issues present in past approaches by converging the two platforms (software and hardware) by means of the Graphics Processing Unit (GPU). Presented in this thesis are two GPU-based deadlock detection algorithms and one GPU-based deadlock avoidance algorithm. These GPU-based algorithms are: (i) GPU-OSDDA: A GPU-based Single Unit Resource Deadlock Detection Algorithm, (ii) GPU-LMDDA: A GPU-based Multi-Unit Resource Deadlock Detection Algorithm, and (iii) GPU-PBA: A GPU-based Deadlock Avoidance Algorithm. Both GPU-OSDDA and GPU-LMDDA utilize the Resource Allocation Graph (RAG) to represent resource allocation status in the system. However, the RAG is represented using integer-length bit-vectors. The advantages brought forth by this approach are plenty: (i) less memory required for algorithm matrices, (ii) 32 computations performed per instruction (in most cases), and (iii) allows our algorithms to handle large numbers of processes and resources. The deadlock detection algorithms also require minimal interaction with the CPU by implementing matrix storage and algorithm computations on the GPU, thus providing an interactive service type of behavior. As a result of this approach, both algorithms were able to achieve speedups over two orders of magnitude higher than their serial CPU implementations (3.17-317.42x for GPU-OSDDA and 37.17-812.50x for GPU-LMDDA). Lastly, GPU-PBA is the first parallel deadlock avoidance algorithm implemented on the GPU. While it does not achieve two orders of magnitude speedup over its CPU implementation, it does provide a platform for future deadlock avoidance research for the GPU.
APA, Harvard, Vancouver, ISO, and other styles
26

Nagarakatte, Santosh G. "Spill Code Minimization And Buffer And Code Size Aware Instruction Scheduling Techniques." Thesis, 2007. http://hdl.handle.net/2005/507.

Full text
Abstract:
Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions. In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks. The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches.
APA, Harvard, Vancouver, ISO, and other styles
27

Iyer, Neeraj. "Machine Vision Assisted In Situ Ichthyoplankton Imaging System." 2013. http://hdl.handle.net/1805/3368.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Recently there has been a lot of effort in developing systems for sampling and automatically classifying plankton from the oceans. Existing methods assume the specimens have already been precisely segmented, or aim at analyzing images containing single specimen (extraction of their features and/or recognition of specimens as single targets in-focus in small images). The resolution in the existing systems is limiting. Our goal is to develop automated, very high resolution image sensing of critically important, yet under-sampled, components of the planktonic community by addressing both the physical sensing system (e.g. camera, lighting, depth of field), as well as crucial image extraction and recognition routines. The objective of this thesis is to develop a framework that aims at (i) the detection and segmentation of all organisms of interest automatically, directly from the raw data, while filtering out the noise and out-of-focus instances, (ii) extract the best features from images and (iii) identify and classify the plankton species. Our approach focusses on utilizing the full computational power of a multicore system by implementing a parallel programming approach that can process large volumes of high resolution plankton images obtained from our newly designed imaging system (In Situ Ichthyoplankton Imaging System (ISIIS)). We compare some of the widely used segmentation methods with emphasis on accuracy and speed to find the one that works best on our data. We design a robust, scalable, fully automated system for high-throughput processing of the ISIIS imagery.
APA, Harvard, Vancouver, ISO, and other styles
28

Kriske, Jeffery Edward Jr. "A scalable approach to processing adaptive optics optical coherence tomography data from multiple sensors using multiple graphics processing units." Thesis, 2014. http://hdl.handle.net/1805/6458.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Adaptive optics-optical coherence tomography (AO-OCT) is a non-invasive method of imaging the human retina in vivo. It can be used to visualize microscopic structures, making it incredibly useful for the early detection and diagnosis of retinal disease. The research group at Indiana University has a novel multi-camera AO-OCT system capable of 1 MHz acquisition rates. Until this point, a method has not existed to process data from such a novel system quickly and accurately enough on a CPU, a GPU, or one that can scale to multiple GPUs automatically in an efficient manner. This is a barrier to using a MHz AO-OCT system in a clinical environment. A novel approach to processing AO-OCT data from the unique multi-camera optics system is tested on multiple graphics processing units (GPUs) in parallel with one, two, and four camera combinations. The design and results demonstrate a scalable, reusable, extensible method of computing AO-OCT output. This approach can either achieve real time results with an AO-OCT system capable of 1 MHz acquisition rates or be scaled to a higher accuracy mode with a fast Fourier transform of 16,384 complex values.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography