Dissertations / Theses: 'Architecture manycore'

1

Cho, Myong Hyon Ph D. Massachusetts Institute of Technology. "On-chip networks for manycore architecture." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/84885.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 109-116).
Over the past decade, increasing the number of cores on a single processor has successfully enabled continued improvements of computer performance. Further scaling these designs to tens and hundreds of cores, however, still presents a number of hard problems, such as scalability, power efficiency and effective programming models. A key component of manycore systems is the on-chip network, which faces increasing efficiency demands as the number of cores grows. In this thesis, we present three techniques for improving the efficiency of on-chip interconnects. First, we present PROM (Path-based, Randomized, Oblivious, and Minimal routing) and BAN (Bandwidth Adaptive Networks), techniques that offer efficient intercore communication for bandwith-constrained networks. Next, we present ENC (Exclusive Native Context), the first deadlock-free, fine-grained thread migration protocol developed for on-chip networks. ENC demonstrates that a simple and elegant technique in the on-chip network can provide critical functional support for higher-level application and system layers. Finally, we provide a realistic context by sharing our hands-on experience in the physical implementation of the on-chip network for the Execution Migration Machine, an ENC-based 110-core processor fabricated in 45nm ASIC technology.
by Myong Hyon Cho.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

2

Stubbfält, Erik. "Hardware Architecture Impact on Manycore Programming Model." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-441739.

Full text

Abstract:

This work investigates how certain processor architectures can affectthe implementation and performance of a parallel programming model.The Ericsson Many-Core Architecture (EMCA) is compared and contrastedto general-purpose multicore processors, highlighting differencesin their memory systems and processor cores. A proof-of-conceptimplementation of the Concurrency Building Blocks (CBB) programmingmodel is developed for x86-64 using MPI. Benchmark tests showhow CBB on EMCA handles compute-intensive and memory-intensivescenarios, compared to a high-end x86-64 machine running the proofof-concept implementation. EMCA shows its strengths in heavy computationswhile x86-64 performs at its best with high degrees of datareuse. Both systems are able to utilize locality in their memory systemsto achieve great performance benefits.

APA, Harvard, Vancouver, ISO, and other styles

3

Dévigne, Clément. "Exécution sécurisée de plusieurs machines virtuelles sur une plateforme Manycore." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066138/document.

Full text

Abstract:

Les architectures manycore, qui comprennent un grand nombre de cœurs, sont un moyen de répondre à l'augmentation continue de la quantité de données numériques à traiter par les infrastructures proposant des services de cloud computing. Ces données, qui peuvent concerner des entreprises aussi bien que des particuliers, sont sensibles par nature, et c'est pourquoi la problématique d'isolation est primordiale. Or, depuis le début du développement du cloud computing, des techniques de virtualisation sont de plus en plus utilisées pour permettre à différents utilisateurs de partager physiquement les mêmes ressources matérielles. Cela est d'autant plus vrai pour les architectures manycore, et il revient donc en partie aux architectures de garantir la confidentialité et l'intégrité des données des logiciels s'exécutant sur la plateforme. Nous proposons dans cette thèse un environnement de virtualisation sécurisé pour une architecture manycore. Notre mécanisme s'appuie sur des composants matériels et un logiciel hyperviseur pour isoler plusieurs systèmes d'exploitation s'exécutant sur la même architecture. L'hyperviseur est en charge de l'allocation des ressources pour les systèmes d'exploitation virtualisés, mais ne possède pas de droit d'accès sur les ressources allouées à ces systèmes. Ainsi, une faille de sécurité dans l'hyperviseur ne met pas en péril la confidentialité ou l'intégrité des données des systèmes virtualisés. Notre solution est évaluée en utilisant un prototype virtuel précis au cycle et a été implémentée dans une architecture manycore à mémoire partagée cohérente. Nos évaluations portent sur le surcoût matériel et sur la dégradation en performance induits par nos mécanismes. Enfin, nous analysons la sécurité apportée par notre solution
Manycore architectures, which comprise a lot of cores, are a way to answer the always growing demand for digital data processing, especially in a context of cloud computing infrastructures. These data, which can belong to companies as well as private individuals, are sensitive by nature, and this is why the isolation problematic is primordial. Yet, since the beginning of cloud computing, virtualization techniques are more and more used to allow different users to physically share the same hardware resources. This is all the more true for manycore architectures, and it partially comes down to the architectures to guarantee that data integrity and confidentiality are preserved for the software it executes. We propose in this thesis a secured virtualization environment for a manycore architecture. Our mechanism relies on hardware components and a hypervisor software to isolate several operating systems running on the same architecture. The hypervisor is in charge of allocating resources for the virtualized operating systems, but does not have the right to access the resources allocated to these systems. Thus, a security flaw in the hypervisor does not imperil data confidentiality and integrity of the virtualized systems. Our solution is evaluated on a cycle-accurate virtual prototype and has been implemented in a coherent shared memory manycore architecture. Our evaluations target the hardware and performance overheads added by our mechanisms. Finally, we analyze the security provided by our solution

APA, Harvard, Vancouver, ISO, and other styles

4

Azar, Céline. "On the design of a distributed adaptive manycore architecture for embedded systems." Lorient, 2012. http://www.theses.fr/2012LORIS268.

Full text

Abstract:

Des défis de conception ont émergé récemment à différents niveaux: l'augmentation du nombre de processeurs sur puce au niveau matériel, la complexité des modèles de programmation parallèles, et les exigences dynamiques des applications actuelles. Face à cette évolution, la thèse propose une architecture distribuée adaptative nommée CEDAR (Configurable Embedded Distributed ARchitecture) dont les principaux atouts sont la scalabilité, la flexibilité et la simplicité. La plateforme CEDAR est une matrice homogène de processeurs RISC, chacun connecté à ses quatre proches voisins via des buffers, partagés par les processeurs adjacents. Aucun contrôle global n'existe, celui-ci étant réparti entre les cœurs. Deux versions sont conçues pour la plateforme, avec un modèle de programmation simple. Une version logicielle, CEDAR-S (CEDAR-Software), est l'implémentation de base où les processeurs adjacents sont reliés via des buffers partagés. Un co-processeur, nommé DMC (Direct Management of Communications), est ajouté dans la version CEDAR-H (CEDAR-Hardware), afin d'optimiser le protocole de routage. Les DMCs sont interconnectés en mesh 2D. Deux nouveaux concepts sont proposés afin de gérer l'adaptabilité de CEDAR. En premier, un algorithme de routage bio-inspiré, dynamique et distribué gère le routage de manière non-supervisée, et est indépendant de la position physique des tâches communicantes. Le deuxième concept consiste en la migration distribuée et dynamique de tâches afin de répondre aux besoins du système et des applications. CEDAR présente des performances élevées avec sa stratégie de routage optimisée, par rapport à l'état de l'art des réseaux sur puce. Le coût de la migration est évalué et des protocoles adéquats sont présentés. CEDAR s'avère être un design prometteur pour les architectures manycœurs
Chip design challenges emerged lately at many levels: the increase of the number of cores at the hardware stage, the complexity of the parallel programming models at the software level, and the dynamic requirements of current applications. Facing this evolution, the PhD thesis aims to design a distributed adaptive manycore architecture, named CEDAR (Configurable Embedded Distributed ARchitecture), which main assets are scalability, flexibility and simplicity. The CEDAR platform is an array of homogeneous, small footprint, RISC processors, each connected to its four nearest neighbors. No global control exists, yet it is distributed among the cores. Two versions are designed for the platform, along with a user-familiar programming model. A software version, CEDAR-S, is the basic implementation where adjacent cores are connected to each other via shared buffers. A co-processor called DMC (Direct Management of Communications) is added in the CEDAR-H version, to optimize the routing protocol. The DMCs are interconnected in a mesh fashion. Two novel concepts are proposed to enhance the adaptiveness of CEDAR. First, a distributed dynamic routing strategy, based on a bio-inspired algorithm, handles routing in a non-supervised fashion, and is independent of the physical placement of communicating tasks. The second concept presents dynamic distributed task migration in response to several system and application requirements. Results show that CEDAR scores high performances with its optimized routing strategy, compared to state-of-art networks. The migration cost is evaluated and adequate protocols are presented. CEDAR is shown to be a promising design concept for future manycores

APA, Harvard, Vancouver, ISO, and other styles

5

Dévigne, Clément. "Exécution sécurisée de plusieurs machines virtuelles sur une plateforme Manycore." Electronic Thesis or Diss., Paris 6, 2017. http://www.theses.fr/2017PA066138.

Full text

Abstract:

Les architectures manycore, qui comprennent un grand nombre de cœurs, sont un moyen de répondre à l'augmentation continue de la quantité de données numériques à traiter par les infrastructures proposant des services de cloud computing. Ces données, qui peuvent concerner des entreprises aussi bien que des particuliers, sont sensibles par nature, et c'est pourquoi la problématique d'isolation est primordiale. Or, depuis le début du développement du cloud computing, des techniques de virtualisation sont de plus en plus utilisées pour permettre à différents utilisateurs de partager physiquement les mêmes ressources matérielles. Cela est d'autant plus vrai pour les architectures manycore, et il revient donc en partie aux architectures de garantir la confidentialité et l'intégrité des données des logiciels s'exécutant sur la plateforme. Nous proposons dans cette thèse un environnement de virtualisation sécurisé pour une architecture manycore. Notre mécanisme s'appuie sur des composants matériels et un logiciel hyperviseur pour isoler plusieurs systèmes d'exploitation s'exécutant sur la même architecture. L'hyperviseur est en charge de l'allocation des ressources pour les systèmes d'exploitation virtualisés, mais ne possède pas de droit d'accès sur les ressources allouées à ces systèmes. Ainsi, une faille de sécurité dans l'hyperviseur ne met pas en péril la confidentialité ou l'intégrité des données des systèmes virtualisés. Notre solution est évaluée en utilisant un prototype virtuel précis au cycle et a été implémentée dans une architecture manycore à mémoire partagée cohérente. Nos évaluations portent sur le surcoût matériel et sur la dégradation en performance induits par nos mécanismes. Enfin, nous analysons la sécurité apportée par notre solution
Manycore architectures, which comprise a lot of cores, are a way to answer the always growing demand for digital data processing, especially in a context of cloud computing infrastructures. These data, which can belong to companies as well as private individuals, are sensitive by nature, and this is why the isolation problematic is primordial. Yet, since the beginning of cloud computing, virtualization techniques are more and more used to allow different users to physically share the same hardware resources. This is all the more true for manycore architectures, and it partially comes down to the architectures to guarantee that data integrity and confidentiality are preserved for the software it executes. We propose in this thesis a secured virtualization environment for a manycore architecture. Our mechanism relies on hardware components and a hypervisor software to isolate several operating systems running on the same architecture. The hypervisor is in charge of allocating resources for the virtualized operating systems, but does not have the right to access the resources allocated to these systems. Thus, a security flaw in the hypervisor does not imperil data confidentiality and integrity of the virtualized systems. Our solution is evaluated on a cycle-accurate virtual prototype and has been implemented in a coherent shared memory manycore architecture. Our evaluations target the hardware and performance overheads added by our mechanisms. Finally, we analyze the security provided by our solution

APA, Harvard, Vancouver, ISO, and other styles

6

Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066747/document.

Full text

Abstract:

L’évolution des supercalculateurs, de leur origine dans les années 60 jusqu’à nos jours, a fait face à 3 révolutions : (i) l’arrivée des transistors pour remplacer les triodes, (ii) l’apparition des calculs vectoriels, et (iii) l’organisation en grappe (clusters). Ces derniers se composent actuellement de processeurs standards qui ont profité de l’accroissement de leur puissance de calcul via une augmentation de la fréquence, la multiplication des cœurs sur la puce et l’élargissement des unités de calcul (jeu d’instructions SIMD). Un exemple récent comportant un grand nombre de cœurs et des unités vectorielles larges (512 bits) est le co-proceseur Intel Xeon Phi. Pour maximiser les performances de calcul sur ces puces en exploitant aux mieux ces instructions SIMD, il est nécessaire de réorganiser le corps des nids de boucles en tenant compte des aspects irréguliers (flot de contrôle et flot de données). Dans ce but, cette thèse propose d’étendre la transformation nommée Deep Jam pour extraire de la régularité d’un code irrégulier et ainsi faciliter la vectorisation. Ce document présente notre extension et son application sur une mini-application d’hydrodynamique multi-matériaux HydroMM. Ces travaux montrent ainsi qu’il est possible d’obtenir un gain de performances significatif sur des codes irréguliers
Since the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes

APA, Harvard, Vancouver, ISO, and other styles

7

Bechara, Charly. "Study and design of a manycore architecture with multithreaded processors for dynamic embedded applications." Phd thesis, Université Paris Sud - Paris XI, 2011. http://tel.archives-ouvertes.fr/tel-00713536.

Full text

Abstract:

Embedded systems are getting more complex and require more intensive processing capabilities. They must be able to adapt to the rapid evolution of the high-end embedded applications that are characterized by their high computation-intensive workloads (order of TOPS: Tera Operations Per Second), and their high level of parallelism. Moreover, since the dynamism of the applications is becoming more significant, powerful computing solutions should be designed accordingly. By exploiting efficiently the dynamism, the load will be balanced between the computing resources, which will improve greatly the overall performance. To tackle the challenges of these future high-end massively-parallel dynamic embedded applications, we have designed the AHDAM architecture, which stands for "Asymmetric Homogeneous with Dynamic Allocator Manycore architecture". Its architecture permits to process applications with large data sets by efficiently hiding the processors' stall time using multithreaded processors. Besides, it exploits the parallelism of the applications at multiple levels so that they would be accelerated efficiently on dedicated resources, hence improving efficiently the overall performance. AHDAM architecture tackles the dynamism of these applications by dynamically balancing the load between its computing resources using a central controller to increase their utilization rate.The AHDAM architecture has been evaluated using a relevant embedded application from the telecommunication domain called "spectrum radio-sensing". With 136 cores running at 500 MHz, AHDAM architecture reaches a peak performance of 196 GOPS and meets the computation requirements of the application.

APA, Harvard, Vancouver, ISO, and other styles

8

Park, Seo Jin. "Analyzing performance and usability of broadcast-based inter-core communication (ATAC) on manycore architecture." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85219.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 55-56).
In this thesis, I analyze the performance and usability benefits of broadcast-based inter-core communication on manycore architecture. The problem of high communication cost on manycore architecture was tackled by a new architecture which allows ecient broadcasting by leveraging an on-chip optical network. I designed the new architecture and API for the new broadcasting feature and implemented them on a multicore simulator called Graphite. I also re-implemented common parallel APIs (barrier and work-stealing) which benet from the cheap broadcasting and showed their ease of use and superior performance versus existing parallel programming libraries through conducting famous benchmarks on the Graphite simulator.
by Seo Jin Park.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

9

Gao, Yang. "Contrôleur de cache générique pour une architecture manycore massivement parallèle à mémoire partagée cohérente." Paris 6, 2011. http://www.theses.fr/2011PA066296.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Karaoui, Mohamed Lamine. "Système de fichiers scalable pour architectures many-cores à faible empreinte énergétique." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066186/document.

Full text

Abstract:

Cette thèse porte sur l'étude des problèmes posés par l'implémentation d'un système de fichiers passant à l'échelle, pour un noyau de type UNIX sur une architecture manycore NUMA à cohérence de cache matérielle et à faible empreinte énergétique. Pour cette étude, nous prenons comme référence l'architecture manycore généraliste TSAR et le noyau de type UNIX ALMOS.L'architecture manycore visée pose trois problèmes pour lesquels nous apportons des réponses après avoir décrit les solutions existantes. L'un de ces problèmes est spécifique à l'architecture TSAR tandis que les deux autres sont généraux.Le premier problème concerne le support d'une mémoire physique plus grande que la mémoire virtuelle. Ceci est dû à l'espace d'adressage physique étendu de TSAR, lequel est 256 fois plus grand que l'espace d'adressage virtuel. Pour résoudre ce problème, nous avons profondément modifié la structure noyau pour le décomposer en plusieurs instances communicantes. La communication se fait alors principalement par passage de messages.Le deuxième problème concerne la stratégie de placement des structures du système de fichiers sur les nombreux bancs de mémoire. Pour résoudre ce problème nous avons implémenté une stratégie de distribution uniforme des données sur les différents bancs de mémoire.Le troisième problème concerne la synchronisation des accès concurrents. Pour résoudre ce problème, nous avons mis au point un mécanisme de synchronisation utilisant plusieurs mécanismes. En particulier, nous avons conçu un mécanisme lock-free efficace pour synchroniser les accès faits par plusieurs lecteurs et un écrivain. Les résultats expérimentaux montrent que : (1) l'utilisation d'une structure composée de plusieurs instances communicantes ne dégrade pas les performances du noyau et peut même les augmenter ; (2) l'ensemble des solutions utilisées permettent d'avoir des résultats qui passent mieux à l'échelle que le noyau NetBSD ; (3) la stratégie de placement la plus adaptée aux systèmes de fichiers pour les architectures manycore est celle distribuant uniformément les données
In this thesis we study the problems of implementing a UNIX-like scalable file system on a hardware cache coherent NUMA manycore architecture. To this end, we use the TSAR manycore architecture and ALMOS, a UNIX-like operating system.The TSAR architecture presents, from the operating system point of view, three problems to which we offer a set of solutions. One of these problems is specific to the TSAR architecture while the others are common to existing coherent NUMA manycore.The first problem concerns the support of a physical memory that is larger than the virtual memory. This is due to the extended physical address space of TSAR, which is 256 times bigger than the virtual address space. To resolve this problem, we modified the structure of the kernel to decompose it into multiple communicating units.The second problem is the placement strategy to be used on the file system structures. To solve this problem, we implemented a strategy that evenly distributes the data on the different memory banks.The third problem is the synchronization of concurrent accesses to the file system. Our solution to resolve this problem uses multiple mechanisms. In particular, the solution uses an efficient lock-free mechanism that we designed, which synchronizes the accesses between several readers and a single writer.Experimental results show that: (1) structuring the kernel into multiple units does not deteriorate the performance and may even improve them; (2) our set of solutions allow us to give performances that scale better than NetBSD; (3) the placement strategy which distributes evenly the data is the most adapted for manycore architectures

APA, Harvard, Vancouver, ISO, and other styles

11

Lööw, Andreas. "A Functional-Level Simulator for the Configurable (Many-Core) PRAM-Like REPLICA Architecture." Thesis, Linköpings universitet, Programvara och system, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-79049.

Full text

Abstract:

This master's thesis discusses the design and implementation of a simulator for the REPLICA architecture, a many-core PRAM-like machine. REPLICA provides a programming model that seemingly cannot be provided by mainstream hardware without significant slowdown compared to traditional models. This also implies that it is difficult to simulate REPLICA's programming model on mainstream hardware. Simulator design decisions are described and the resulting simulator is evaluated and compared to existing simulators, where we see that the simulator presented in this thesis is the fastest of them. As seen from the discussion focus in the thesis, most efforts were directed towards simulator execution speed rather than user-facing features.

APA, Harvard, Vancouver, ISO, and other styles

12

Bates, Daniel. "Exploiting tightly-coupled cores." Thesis, University of Cambridge, 2014. https://www.repository.cam.ac.uk/handle/1810/245179.

Full text

Abstract:

As we move steadily through the multicore era, and the number of processing cores on each chip continues to rise, parallel computation becomes increasingly important. However, parallelising an application is often difficult because of dependencies between different regions of code which require cores to communicate. Communication is usually slow compared to computation, and so restricts the opportunities for profitable parallelisation. In this work, I explore the opportunities provided when communication between cores has a very low latency and low energy cost. I observe that there are many different ways in which multiple cores can be used to execute a program, allowing more parallelism to be exploited in more situations, and also providing energy savings in some cases. Individual cores can be made very simple and efficient because they do not need to exploit parallelism internally. The communication patterns between cores can be updated frequently to reflect the parallelism available at the time, allowing better utilisation than specialised hardware which is used infrequently. In this dissertation I introduce Loki: a homogeneous, tiled architecture made up of many simple, tightly-coupled cores. I demonstrate the benefits in both performance and energy consumption which can be achieved with this arrangement and observe that it is also likely to have lower design and validation costs and be easier to optimise. I then determine exactly where the performance bottlenecks of the design are, and where the energy is consumed, and look into some more-advanced optimisations which can make parallelism even more profitable.

APA, Harvard, Vancouver, ISO, and other styles

13

Karaoui, Mohamed Lamine. "Système de fichiers scalable pour architectures many-cores à faible empreinte énergétique." Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066186.

Full text

Abstract:

Cette thèse porte sur l'étude des problèmes posés par l'implémentation d'un système de fichiers passant à l'échelle, pour un noyau de type UNIX sur une architecture manycore NUMA à cohérence de cache matérielle et à faible empreinte énergétique. Pour cette étude, nous prenons comme référence l'architecture manycore généraliste TSAR et le noyau de type UNIX ALMOS.L'architecture manycore visée pose trois problèmes pour lesquels nous apportons des réponses après avoir décrit les solutions existantes. L'un de ces problèmes est spécifique à l'architecture TSAR tandis que les deux autres sont généraux.Le premier problème concerne le support d'une mémoire physique plus grande que la mémoire virtuelle. Ceci est dû à l'espace d'adressage physique étendu de TSAR, lequel est 256 fois plus grand que l'espace d'adressage virtuel. Pour résoudre ce problème, nous avons profondément modifié la structure noyau pour le décomposer en plusieurs instances communicantes. La communication se fait alors principalement par passage de messages.Le deuxième problème concerne la stratégie de placement des structures du système de fichiers sur les nombreux bancs de mémoire. Pour résoudre ce problème nous avons implémenté une stratégie de distribution uniforme des données sur les différents bancs de mémoire.Le troisième problème concerne la synchronisation des accès concurrents. Pour résoudre ce problème, nous avons mis au point un mécanisme de synchronisation utilisant plusieurs mécanismes. En particulier, nous avons conçu un mécanisme lock-free efficace pour synchroniser les accès faits par plusieurs lecteurs et un écrivain. Les résultats expérimentaux montrent que : (1) l'utilisation d'une structure composée de plusieurs instances communicantes ne dégrade pas les performances du noyau et peut même les augmenter ; (2) l'ensemble des solutions utilisées permettent d'avoir des résultats qui passent mieux à l'échelle que le noyau NetBSD ; (3) la stratégie de placement la plus adaptée aux systèmes de fichiers pour les architectures manycore est celle distribuant uniformément les données
In this thesis we study the problems of implementing a UNIX-like scalable file system on a hardware cache coherent NUMA manycore architecture. To this end, we use the TSAR manycore architecture and ALMOS, a UNIX-like operating system.The TSAR architecture presents, from the operating system point of view, three problems to which we offer a set of solutions. One of these problems is specific to the TSAR architecture while the others are common to existing coherent NUMA manycore.The first problem concerns the support of a physical memory that is larger than the virtual memory. This is due to the extended physical address space of TSAR, which is 256 times bigger than the virtual address space. To resolve this problem, we modified the structure of the kernel to decompose it into multiple communicating units.The second problem is the placement strategy to be used on the file system structures. To solve this problem, we implemented a strategy that evenly distributes the data on the different memory banks.The third problem is the synchronization of concurrent accesses to the file system. Our solution to resolve this problem uses multiple mechanisms. In particular, the solution uses an efficient lock-free mechanism that we designed, which synchronizes the accesses between several readers and a single writer.Experimental results show that: (1) structuring the kernel into multiple units does not deteriorate the performance and may even improve them; (2) our set of solutions allow us to give performances that scale better than NetBSD; (3) the placement strategy which distributes evenly the data is the most adapted for manycore architectures

APA, Harvard, Vancouver, ISO, and other styles

14

Alnervik, Erik. "Evaluation of the Configurable Architecture REPLICA with Emulated Shared Memory." Thesis, Linköpings universitet, Programvara och system, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-104313.

Full text

Abstract:

REPLICA is a family of novel scalable chip multiprocessors with configurable emulated shared memory architecture, whose computation model is based on the PRAM (Parallel Random Access Machine) model. The purpose of this thesis is to, by benchmarking different types of computation problems on REPLICA, similar parallel architectures (SB-PRAM and XMT) and more diverse ones (Xeon X5660 and Tesla M2050), evaluate how REPLICA is positioned among other existing architectures, both in performance and programming effort. But it should also examine if REPLICA is more suited for any special kinds of computational problems. By using some of the well known Berkeley dwarfs, and input from unbiased sources, such as The University of Florida Sparse Matrix Collection and Rodinia benchmark suite, we have made sure that the benchmarks measure relevant computation problems. We show that today’s parallel architectures have some performance issues for applications with irregular memory access patterns, which the REPLICA architecture can solve. For example, REPLICA only need to be clocked with a few MHz to match both Xeon X5660 and Tesla M2050 for the irregular memory access benchmark breadth first search. By comparing the efficiency of REPLICA to a CPU (Xeon X5660), we show that it is easier to program REPLICA efficiently than today’s multiprocessors.
REPLICA är en grupp av konfigurerbara multiprocessorer som med hjälp utav ett emulerat delat minne realiserar PRAM modellen. Syftet med denna avhandling är att genom benchmarking av olika beräkningsproblem på REPLICA, liknande (SB-PRAM och XMT) och mindre lika (Xeon X5660 och Tesla M2050) parallella arkitekturer, utvärdera hur REPLICA står sig mot andra befintliga arkitekturer. Både prestandamässigt och hur enkel arkitekturen är att programmera effektiv, men även försöka ta reda på om REPLICA är speciellt lämpad för några särskilda typer av beräkningsproblem. Genom att använda välkända Berkeley dwarfs applikationer och opartisk indata från bland annat The University of Florida Sparse Matrix Collection och Rodinia benchmark suite, säkerställer vi att det är relevanta beräkningsproblem som utförs och mäts. Vi visar att dagens parallella arkitekturer har problem med prestandan för applikationer med oregelbundna minnesaccessmönster, vilken REPLICA arkitekturen kan vara en lösning på. Till exempel, så behöver REPLICA endast vara klockad med några få MHz för att matcha både Xeon X5660 och Tesla M2050 för algoritmen breadth first search, vilken lider av just oregelbunden minnesåtkomst. Genom att jämföra effektiviteten för REPLICA gentemot en CPU (Xeon X5660), visar vi att det är lättare att programmera REPLICA effektivt än dagens multiprocessorer.

APA, Harvard, Vancouver, ISO, and other styles

15

Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"." Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066747.

Full text

Abstract:

L’évolution des supercalculateurs, de leur origine dans les années 60 jusqu’à nos jours, a fait face à 3 révolutions : (i) l’arrivée des transistors pour remplacer les triodes, (ii) l’apparition des calculs vectoriels, et (iii) l’organisation en grappe (clusters). Ces derniers se composent actuellement de processeurs standards qui ont profité de l’accroissement de leur puissance de calcul via une augmentation de la fréquence, la multiplication des cœurs sur la puce et l’élargissement des unités de calcul (jeu d’instructions SIMD). Un exemple récent comportant un grand nombre de cœurs et des unités vectorielles larges (512 bits) est le co-proceseur Intel Xeon Phi. Pour maximiser les performances de calcul sur ces puces en exploitant aux mieux ces instructions SIMD, il est nécessaire de réorganiser le corps des nids de boucles en tenant compte des aspects irréguliers (flot de contrôle et flot de données). Dans ce but, cette thèse propose d’étendre la transformation nommée Deep Jam pour extraire de la régularité d’un code irrégulier et ainsi faciliter la vectorisation. Ce document présente notre extension et son application sur une mini-application d’hydrodynamique multi-matériaux HydroMM. Ces travaux montrent ainsi qu’il est possible d’obtenir un gain de performances significatif sur des codes irréguliers
Since the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes

APA, Harvard, Vancouver, ISO, and other styles

16

Savas, Süleyman. "Utilizing Heterogeneity in Manycore Architectures for Streaming Applications." Licentiate thesis, Högskolan i Halmstad, Centrum för forskning om inbyggda system (CERES), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-33792.

Full text

Abstract:

In the last decade, we have seen a transition from single-core to manycore in computer architectures due to performance requirements and limitations in power consumption and heat dissipation. The first manycores had homogeneous architectures consisting of a few identical cores. However, the applications, which are executed on these architectures, usually consist of several tasks requiring different hardware resources to be executed efficiently. Therefore, we believe that utilizing heterogeneity in manycores will increase the efficiency of the architectures in terms of performance and power consumption. However, development of heterogeneous architectures is more challenging and the transition from homogeneous to heterogeneous architectures will increase the difficulty of efficient software development due to the increased complexity of the architecture. In order to increase the efficiency of hardware and software development, new hardware design methods and software development tools are required. Additionally, there is a lack of knowledge on the performance of applications when executed on manycore architectures. The transition began with a shift from single-core architectures to homogeneous multicore architectures consisting of a few identical cores. It now continues with a shift from homogeneous architectures with identical cores to heterogeneous architectures with different types of cores specialized for different purposes. However, this transition has increased the complexity of architectures and hence the complexity of software development and execution. In order to decrease the complexity of software development, new software tools are required. Additionally, there is a lack of knowledge on what kind of heterogeneous manycore design is most efficient for different applications and what are the performances of these applications when executed on current commercial manycores. This thesis studies manycore architectures in order to reveal possible uses of heterogeneity in manycores and facilitate choice of architecture for software and hardware developers. It defines a taxonomy for manycore architectures that is based on the levels of heterogeneity they contain and discusses benefits and drawbacks of these levels. Additionally, it evaluates several applications, a dataflow language (CAL), a source-to-source compilation framework (Cal2Many), and a commercial manycore architecture (Epiphany). The compilation framework takes implementations written in the dataflow language as input and generates code targetting different manycore platforms. Based on these evaluations, the thesis identifies the bottlenecks of the architecture. It finally presents a methodology for developing heterogeneoeus manycore architectures which target specific application domains. Our studies show that using different types of cores in manycore architectures has the potential to increase the performance of streaming applications. If we add specialized hardware blocks to a core, the performance easily increases by 15x for the target application while the core size increases by 40-50% which can be optimized further. Other results prove that dataflow languages, together with software development tools, decrease software development efforts significantly (25-50%) while having a small impact (2-17%) on the performance.
HiPEC (High Performance Embedded Computing)
NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)

APA, Harvard, Vancouver, ISO, and other styles

17

Farjallah, Asma. "Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation." Thesis, Versailles-St Quentin en Yvelines, 2014. http://www.theses.fr/2014VERS0050/document.

Full text

Abstract:

La caractérisation des applications en vue de les préparer pour les nouvelles architectures et les porter sur des systèmes très étendus est une étape importante pour pouvoir anticiper les modifications nécessaires. Comme les machines Exascale sont prévues pour la période 2018-2020, l'étude des applications et leur préparation pour ces machines s'avèrent donc essentielles. Nous nous intéressons aux applications d'imagerie sismique et en particulier à l'application Reverse Time Migration (RTM) car elle est très utilisée par les pétroliers dans le cadre de l'exploration sismique.La première partie de nos travaux a porté sur l'étude du cœur de calcul de l'application RTM qui consiste en un calcul de différences finies dans le domaine temporel (FDTD). Nous avons caractérisé cette partie de l'application en soulevant les aspects architecturaux des machines actuelles ayant un fort impact sur la performance, notamment les caches, les bandes passantes et le prefetching. Cette étude a abouti à l'élaboration d'un modèle de performance permettant de prédire le trafic DRAM des FDTD. La deuxième partie de la thèse se focalise sur l'impact de l'hétérogénéité et le parallélisme sur la FDTD et sur RTM. Nous avons choisi l'architecture manycore d’Intel, Xeon Phi, et nous avons étudié une implémentation "native" et une implémentation hétérogène et hybride, la version "symmetric". Enfin, nous avons porté l'application RTM sur un cluster hétérogène, Stampede du Texas Advanced Computing Center (TACC), où nous avons effectué des tests de scalabilité allant jusqu'à 64 nœuds contenant des coprocesseurs Xeon Phi et des processeurs Sandy Bridge ce qui correspond à presque 5000 cœurs
As we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous cores

APA, Harvard, Vancouver, ISO, and other styles

18

Amstel, Duco van. "Optimisation de la localité des données sur architectures manycœurs." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM019/document.

Full text

Abstract:

L'évolution continue des architectures des processeurs a été un moteur important de la recherche en compilation. Une tendance dans cette évolution qui existe depuis l'avènement des ordinateurs modernes est le rapport grandissant entre la puissance de calcul disponible (IPS, FLOPS, ...) et la bande-passante correspondante qui est disponible entre les différents niveaux de la hiérarchie mémoire (registres, cache, mémoire vive). En conséquence la réduction du nombre de communications mémoire requis par un code donnée a constitué un sujet de recherche important. Un principe de base en la matière est l'amélioration de la localité temporelle des données: regrouper dans le temps l'ensemble des accès à une donnée précise pour qu'elle ne soit requise que pendant peu de temps et pour qu'elle puisse ensuite être transféré vers de la mémoire lointaine (mémoire vive) sans communications supplémentaires.Une toute autre évolution architecturale a été l'arrivée de l'ère des multicoeurs et au cours des dernières années les premières générations de processeurs manycoeurs. Ces architectures ont considérablement accru la quantité de parallélisme à la disposition des programmes et algorithmes mais ceci est à nouveau limité par la bande-passante disponible pour les communications entres coeurs. Ceci a amené dans le monde de la compilation et des techniques d'optimisation des problèmes qui étaient jusqu'à là uniquement connus en calcul distribué.Dans ce texte nous présentons les premiers travaux sur une nouvelle technique d'optimisation, le pavage généralisé qui a l'avantage d'utiliser un modèle abstrait pour la réutilisation des données et d'être en même temps utilisable dans un grand nombre de contextes. Cette technique trouve son origine dans le pavage de boucles, une techniques déjà bien connue et qui a été utilisée avec succès pour l'amélioration de la localité des données dans les boucles imbriquées que ce soit pour les registres ou pour le cache. Cette nouvelle variante du pavage suit une vision beaucoup plus large et ne se limite pas au cas des boucles imbriquées. Elle se base sur une nouvelle représentation, le graphe d'utilisation mémoire, qui est étroitement lié à un nouveau modèle de besoins en termes de mémoire et de communications et qui s'applique à toute forme de code exécuté itérativement. Le pavage généralisé exprime la localité des données comme un problème d'optimisation pour lequel plusieurs solutions sont proposées. L'abstraction faite par le graphe d'utilisation mémoire permet la résolution du problème d'optimisation dans différents contextes. Pour l'évaluation expérimentale nous montrons comment utiliser cette nouvelle technique dans le cadre des boucles, imbriquées ou non, ainsi que dans le cas des programmes exprimés dans un langage à flot-de-données. En anticipant le fait d'utiliser le pavage généralisé pour la distribution des calculs entre les cœurs d'une architecture manycoeurs nous donnons aussi des éléments de réponse pour modéliser les communications et leurs caractéristiques sur ce genre d'architectures. En guise de point final, et pour montrer l'étendue de l'expressivité du graphe d'utilisation mémoire et le modèle de besoins en mémoire et communications sous-jacent, nous aborderons le sujet du débogage de performances et l'analyse des traces d'exécution. Notre but est de fournir un retour sur le potentiel d'amélioration en termes de localité des données du code évalué. Ce genre de traces peut contenir des informations au sujet des communications mémoire durant l'exécution et a de grandes similitudes avec le problème d'optimisation précédemment étudié. Ceci nous amène à une brève introduction dans le monde de l'algorithmique des graphes dirigés et la mise-au-point de quelques nouvelles heuristiques pour le problème connu de joignabilité mais aussi pour celui bien moins étudié du partitionnement convexe
The continuous evolution of computer architectures has been an important driver of research in code optimization and compiler technologies. A trend in this evolution that can be traced back over decades is the growing ratio between the available computational power (IPS, FLOPS, ...) and the corresponding bandwidth between the various levels of the memory hierarchy (registers, cache, DRAM). As a result the reduction of the amount of memory communications that a given code requires has been an important topic in compiler research. A basic principle for such optimizations is the improvement of temporal data locality: grouping all references to a single data-point as close together as possible so that it is only required for a short duration and can be quickly moved to distant memory (DRAM) without any further memory communications.Yet another architectural evolution has been the advent of the multicore era and in the most recent years the first generation of manycore designs. These architectures have considerably raised the bar of the amount of parallelism that is available to programs and algorithms but this is again limited by the available bandwidth for communications between the cores. This brings some issues thatpreviously were the sole preoccupation of distributed computing to the world of compiling and code optimization techniques.In this document we present a first dive into a new optimization technique which has the promise of offering both a high-level model for data reuses and a large field of potential applications, a technique which we refer to as generalized tiling. It finds its source in the already well-known loop tiling technique which has been applied with success to improve data locality for both register and cache-memory in the case of nested loops. This new "flavor" of tiling has a much broader perspective and is not limited to the case of nested loops. It is build on a new representation, the memory-use graph, which is tightly linked to a new model for both memory usage and communication requirements and which can be used for all forms of iterate code.Generalized tiling expresses data locality as an optimization problem for which multiple solutions are proposed. With the abstraction introduced by the memory-use graph it is possible to solve this optimization problem in different environments. For experimental evaluations we show how this new technique can be applied in the contexts of loops, nested or not, as well as for computer programs expressed within a dataflow language. With the anticipation of using generalized tiling also to distributed computations over the cores of a manycore architecture we also provide some insight into the methods that can be used to model communications and their characteristics on such architectures.As a final point, and in order to show the full expressiveness of the memory-use graph and even more the underlying memory usage and communication model, we turn towards the topic of performance debugging and the analysis of execution traces. Our goal is to provide feedback on the evaluated code and its potential for further improvement of data locality. Such traces may contain information about memory communications during an execution and show strong similarities with the previously studied optimization problem. This brings us to a short introduction to the algorithmics of directed graphs and the formulation of some new heuristics for the well-studied topic of reachability and the much less known problem of convex partitioning

APA, Harvard, Vancouver, ISO, and other styles

19

Petit, Eric. "Vers un partitionnement automatique d'applications en codelets spéculatifs pour les systèmes hétérogènes à mémoires distribuées." Phd thesis, Université Rennes 1, 2009. http://tel.archives-ouvertes.fr/tel-00445512.

Full text

Abstract:

Devant les difficultés croissantes liées au coût en développement, en consommation, en surface de silicium, nécessaires aux nouvelles optimisations des architectures monocœur, on assiste au retour en force du parallélisme et des coprocesseurs spécialisés dans les architectures. Cette technique apporte le meilleur compromis entre puissance de calcul élevée et utilisations des ressources. Afin d'exploiter efficacement toutes ces architectures, il faut partitionner le code en tâches, appelées codelet, avant de les distribuer aux différentes unités de calcul. Ce partionnement est complexe et l'espace des solutions est vaste. Il est donc nécessaire de développer des outils d'automatisation efficaces pour le partitionnement du code séquentiel. Les travaux présentés dans cette thèse portent sur l'élaboration d'un tel processus de partitionnement. L'approche d'Astex est basée sur la spéculation, en effet les codelets sont construits à partir des profils d'exécution de l'application. La spéculation permet un grand nombre d'optimisations inexistantes ou impossibles statiquement. L'élaboration, la gestion dynamique et l'usage que l'on peut faire de la spéculation sont un vaste sujet d'étude. La deuxième contribution de cette thèse porte sur l'usage de la spéculation dans l'optimisation des communications entre processeur et coprocesseur et traite en particulier du cas du GPGPU, i.e. l'utilisation d'un processeur graphique comme coprocesseur de calcul intensif.

APA, Harvard, Vancouver, ISO, and other styles

20

Giroudot, Frédéric. "NoC-based Architectures for Real-Time Applications : Performance Analysis and Design Space Exploration." Thesis, Toulouse, INPT, 2019. https://oatao.univ-toulouse.fr/25921/1/Giroudot_Frederic.pdf.

Full text

Abstract:

Les architectures mono-processeur montrent leurs limites en termes de puissance de calcul face aux besoins des systèmes actuels. Bien que les architectures multi-cœurs résolvent partiellement ce problème, elles utilisent en général des bus pour interconnecter les cœurs, et cette solution ne passe pas à l'échelle. Les architectures dites pluri-cœurs ont été proposées pour palier les limitations des processeurs multi-cœurs. Elles peuvent réunir jusqu'à des centaines de cœurs sur une seule puce, organisés en dalles contenant une ou plusieurs entités de calcul. La communication entre les cœurs se fait généralement au moyen d'un réseau sur puce constitué de routeurs reliés les uns aux autres et permettant les échanges de données entre dalles. Cependant, ces architectures posent de nombreux défis, en particulier pour les applications temps-réel. D'une part, la communication via un réseau sur puce provoque des scénarios de blocage entre flux, ce qui complique l'analyse puisqu'il devient difficile de déterminer le pire cas. D'autre part, exécuter de nombreuses applications sur des systèmes sur puce de grande taille comme des architectures pluri-cœurs rend la conception de tels systèmes particulièrement complexe. Premièrement, cela multiplie les possibilités d'implémentation qui respectent les contraintes fonctionnelles, et l'exploration d'architecture résultante est plus longue. Deuxièmement, une fois une architecture matérielle choisie, décider de l'attribution de chaque tâche des applications à exécuter aux différents cœurs est un problème difficile, à tel point que trouver une une solution optimale en un temps raisonnable n'est pas toujours possible. Ainsi, nos premières contributions s'intéressent à cette nécessité de pouvoir calculer des bornes fiables sur le pire cas des latences de transmission des flux de données empruntant des réseaux sur puce dits "wormhole". Nous proposons un modèle analytique, BATA, prenant en compte la taille des mémoires tampon des routeurs et applicable à une configuration de flux de données périodiques générant un paquet à la fois. Nous étendons ensuite le domaine d'applicabilité de BATA pour couvrir un modèle de traffic plus général ainsi que des architectures hétérogènes. Cette nouvelle méthode, appelée G-BATA, est basée sur une structure de graphe pour capturer les interférences possibles entre flux de données. Elle permet également de diminuer le temps de calcul de l'analyse, améliorant la capacité de l'approche à passer à l'échelle. Dans une seconde partie, nous proposons une méthode pour la conception d'applications temps-réel s'exécutant sur des plateformes pluri-cœurs. Cette méthode intègre notre modèle d'analyse G-BATA dans un processus de conception systématique, faisant en outre intervenir un outil de modélisation et de simulation de systèmes reposant sur des concepts d'ingénierie dirigée par les modèles, TTool, et un logiciel pour l'analyse de performance pire-cas des réseaux, WoPANets. Enfin, nous proposons une validation de nos contributions grâce à (a) une série d'expériences sur une plateforme physique et (b) deux études de cas d'applications réelle; le système de contrôle d'un véhicule autonome et une application de décodeur 5G
Monoprocessor architectures have reached their limits in regard to the computing power they offer vs the needs of modern systems. Although multicore architectures partially mitigate this limitation and are commonly used nowadays, they usually rely on intrinsically non-scalable buses to interconnect the cores. The manycore paradigm was proposed to tackle the scalability issue of bus-based multicore processors. It can scale up to hundreds of processing elements (PEs) on a single chip, by organizing them into computing tiles (holding one or several PEs). Intercore communication is usually done using a Network-on-Chip (NoC) that consists of interconnected onchip routers allowing communication between tiles. However, manycore architectures raise numerous challenges, particularly for real-time applications. First, NoC-based communication tends to generate complex blocking patterns when congestion occurs, which complicates the analysis, since computing accurate worst-case delays becomes difficult. Second, running many applications on large Systems-on-Chip such as manycore architectures makes system design particularly crucial and complex. On one hand, it complicates Design Space Exploration, as it multiplies the implementation alternatives that will guarantee the desired functionalities. On the other hand, once a hardware architecture is chosen, mapping the tasks of all applications on the platform is a hard problem, and finding an optimal solution in a reasonable amount of time is not always possible. Therefore, our first contributions address the need for computing tight worst-case delay bounds in wormhole NoCs. We first propose a buffer-aware worst-case timing analysis (BATA) to derive upper bounds on the worst-case end-to-end delays of constant-bit rate data flows transmitted over a NoC on a manycore architecture. We then extend BATA to cover a wider range of traffic types, including bursty traffic flows, and heterogeneous architectures. The introduced method is called G-BATA for Graph-based BATA. In addition to covering a wider range of assumptions, G-BATA improves the computation time; thus increases the scalability of the method. In a second part, we develop a method addressing design and mapping for applications with real-time constraints on manycore platforms. It combines model-based engineering tools (TTool) and simulation with our analytical verification technique (G-BATA) and tools (WoPANets) to provide an efficient design space exploration framework. Finally, we validate our contributions on (a) a serie of experiments on a physical platform and (b) two case studies taken from the real world: an autonomous vehicle control application, and a 5G signal decoder application

APA, Harvard, Vancouver, ISO, and other styles

21

Xypolitidis, Benard, and Rudin Shabani. "Architectural Design Space Exploration of Heterogeneous Manycores." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-29528.

Full text

Abstract:

Exploring the benefits of heterogeneous architectures is becoming more desirable dueto migration from single core to manycore architectural systems. A fast way to explorethe heterogeneity is through an architectural design space exploration (ADSE) tool,which gives the designer the option to explore design alternatives before the actualimplementation. Heracles Designer is an ADSE tool which allows the user to modifylarge aspects of the architecture. At present, Heracles Designer is equipped with asingle type of processing core, a MIPS CPU.We have extended the Heracles System in order to enable the system to model het-erogeneity. Our system is called the Heterogeneous Heracles System (HHS), where adifferent type of processing core, the OpenRISC CPU, is interfaced into the HeraclesSystem. Test programs are executed on both the MIPS and OpenRISC CPUs, whichhave provided promising results. In order to provide the designer with the option tomodify the system architecture without changing the source code, a GUI named AD-SET was created. ADSET provides the designer with the ability to modify the coresettings, memory system configuration and network topology configuration.In the HHS the MIPS core can only execute basic instructions, while the OpenRISCcan execute more advanced instructions, giving a designer the option to explore theeffects of heterogeneity based on the big little architectural concept. The results of ourwork provides an infrastructure on how to integrate different types of processing coresinto the HHS.

APA, Harvard, Vancouver, ISO, and other styles

22

Prasad, Rohit <1991&gt. "Integrated Programmable-Array accelerator to design heterogeneous ultra-low power manycore architectures." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amsdottorato.unibo.it/9983/1/PhD_thesis__20_January_2022_.pdf.

Full text

Abstract:

There is an ever-increasing demand for energy efficiency (EE) in rapidly evolving Internet-of-Things end nodes. This pushes researchers and engineers to develop solutions that provide both Application-Specific Integrated Circuit-like EE and Field-Programmable Gate Array-like flexibility. One such solution is Coarse Grain Reconfigurable Array (CGRA). Over the past decades, CGRAs have evolved and are competing to become mainstream hardware accelerators, especially for accelerating Digital Signal Processing (DSP) applications. Due to the over-specialization of computing architectures, the focus is shifting towards fitting an extensive data representation range into fewer bits, e.g., a 32-bit space can represent a more extensive data range with floating-point (FP) representation than an integer representation. Computation using FP representation requires numerous encodings and leads to complex circuits for the FP operators, decreasing the EE of the entire system. This thesis presents the design of an EE ultra-low-power CGRA with native support for FP computation by leveraging an emerging paradigm of approximate computing called transprecision computing. We also present the contributions in the compilation toolchain and system-level integration of CGRA in a System-on-Chip, to envision the proposed CGRA as an EE hardware accelerator. Finally, an extensive set of experiments using real-world algorithms employed in near-sensor processing applications are performed, and results are compared with state-of-the-art (SoA) architectures. It is empirically shown that our proposed CGRA provides better results w.r.t. SoA architectures in terms of power, performance, and area.

APA, Harvard, Vancouver, ISO, and other styles

23

Rakotoarivelo, Hoby. "Contributions au co-design de noyaux irréguliers sur architectures manycore : cas du remaillage anisotrope multi-échelle en mécanique des fluides numérique." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLE012/document.

Full text

Abstract:

La simulation numérique d'écoulements complexes telles que les turbulences ou la propagation d'ondes de choc implique un temps de calcul conséquent pour une précision industrielle acceptable. Pour accélérer ces simulations, deux recours peuvent être combinés : l'adaptation de maillages afin de réduire le nombre de points d'une part, et le parallélisme pour absorber la charge de calcul d'autre part. Néanmoins réaliser un portage efficient des noyaux adaptatifs sur des architectures massivement parallèles n'est pas triviale. Non seulement chaque tâche relative à un voisinage local du domaine doit être propagée, mais le fait de traiter une tâche peut générer d'autres tâches potentiellement conflictuelles. De plus, les tâches en question sont caractérisées par une faible intensité arithmétique ainsi qu'une faible réutilisation de données déjà accédées. Par ailleurs, l'avènement de nouveaux types de processeurs dans le paysage du calcul haute performance implique un certain nombre de contraintes algorithmiques. Dans un contexte de réduction de la consommation électrique, ils sont caractérisés par de multiples cores faiblement cadencés et une hiérarchie mémoire profonde impliquant un coût élevé et asymétrique des accès-mémoire. Ainsi maintenir un rendement optimal des cores implique d'exposer un parallélisme très fin et élevé d'une part, ainsi qu'un fort taux de réutilisation de données en cache d'autre part. Ainsi la vraie question est de savoir comment structurer ces noyaux data-driven et data-intensive de manière à respecter ces contraintes ?Dans ce travail, nous proposons une approche qui concilie les contraintes de localité et de convergence en termes d'erreur et qualité de mailles. Plus qu'une parallélisation, elle s'appuie une re-conception des noyaux guidée par les contraintes hardware en préservant leur précision. Plus précisément, nous proposons des noyaux locality-aware pour l'adaptation anisotrope de variétés différentielles triangulées, ainsi qu'une parallélisation lock-free et massivement multithread de noyaux irréguliers. Bien que complémentaires, ces deux axes proviennent de thèmes de recherche distinctes mêlant informatique et mathématiques appliquées. Ici, nous visons à montrer que nos stratégies proposées sont au niveau de l'état de l'art pour ces deux axes
Numerical simulations of complex flows such as turbulence or shockwave propagation often require a huge computational time to achieve an industrial accuracy level. To speedup these simulations, two alternatives may be combined : mesh adaptation to reduce the number of required points on one hand, and parallel processing to absorb the computation workload on the other hand. However efficiently porting adaptive kernels on massively parallel architectures is far from being trivial. Indeed each task related to a local vicintiy need to be propagated, and it may induce new conflictual tasks though. Furthermore, these tasks are characterized by a low arithmetic intensity and a low reuse rate of already cached data. Besides, new kind of accelerators have arised in high performance computing landscape, involving a number of algorithmic constraints. In a context of electrical power consumption reduction, they are characterized by numerous underclocked cores and a deep hierarchy memory involving asymmetric expensive memory accesses. Therefore, kernels must expose a high degree of concurrency and high cached-data reuse rate to maintain an optimal core efficiency. The real issue is how to structure these data-driven and data-intensive kernels to match these constraints ?In this work, we provide an approach which conciliates both locality constraints and convergence in terms of mesh error and quality. More than a parallelization, it relies on redesign of kernels guided by hardware constraints while preserving accuracy. In fact, we devise a set of locality-aware kernels for anisotropic adaptation of triangulated differential manifold, as well as a lock-free and massively multithread parallelization of irregular kernels. Although being complementary, those axes come from distinct research themes mixing informatics and applied mathematics. Here, we aim to show that our devised schemes are as efficient as the state-of-the-art for both axes

APA, Harvard, Vancouver, ISO, and other styles

24

Chang, Tao. "Evaluation of programming models for manycore and / or heterogeneous architectures for Monte Carlo neutron transport codes." Thesis, Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAX099.

Full text

Abstract:

Dans cette thèse nous nous proposons d’évaluer les différents modèles de programmation disponibles pour adresser les architectures de type manycore et/ou hétérogènes dans le cadre des codes de transport Monte Carlo. On considèrera dans un premier temps un cas test d’application simple mais représentatif pour couvrir un éventail assez large de solutions et les comparer en terme de performance, de portabilité de la performance, de facilité de mise en œuvre et de maintenabilité. Les architectures cibles sont les CPU `classique', Intel Xeon Phi et GPU. Les modèles de programmation les plus pertinents seront ensuite mis en place dans un code de transport Monte Carlo
In this thesis we propose to evaluate the different programming models available for addressing manycore and / or heterogeneous architectures within the framework of the Monte Carlo transport codes. A simple but representative application test case will be considered in order to cover a fairly wide range of solutions and compare them in terms of performance, portability of performance, ease of implementation and maintainability. The target architectures are `classic' CPUs, Intel Xeon Phi and GPUs. The most relevant programming models will then be set up in a Monte Carlo transport code

APA, Harvard, Vancouver, ISO, and other styles

25

Nguyen, Tri Minh. "Exploring Data Compression and Random-access Reduction to Mitigate the Bandwidth Wall for Manycore Architectures." Thesis, Princeton University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10936059.

Full text

Abstract:

The performance gap between computer processors and memory bandwidth is severely limiting the throughput of modern and future multi-core and manycore architectures. To handle this growing gap, commercial processors such as the Intel Xeon Phi and NVIDIA or AMD GPUs have needed to use expensive memory solutions like high-bandwidth memory (HBM) and 3D-stacked memory to satisfy the bandwidth demand of the growing core-count over each product generation. Without a scalable solution for the memory bandwidth issue, throughput-oriented computation cannot be improved. This problem is widely known as the bandwidth-wall.

Data compression and random-access reduction are promising approaches to increase bandwidth without raising costs. This thesis makes three specific contributions to the state-of-the-art. First, to reduce cache misses, we propose an on-chip cache compression method that drastically increases compression performance and cache hit rate over prior work. Second, to improve direct compression of off-chip bandwidth and make it more scalable, we propose a novel link compression framework that exploits the on-chip caches themselves as a massive and scalable compression dictionary. Last, to overcome poor random-access performance of nonvolatile memory (NVM) and make it more attractive as a DRAM replacement with crash consistency, we propose a multi-undo logging scheme that seamlessly logs memory writes sequentially to maximize NVM I/O operations per second (IOPS).

As a common principle, this thesis seeks to overcome the bandwidth wall for manycore architectures not through expensive memory technologies but by assessing and exploiting workload behavior, and not through burdening programmers with specialized semantics but by implementing software-transparent architectural improvements.

APA, Harvard, Vancouver, ISO, and other styles

26

Dao, Van Toan. "Calcul à haute performance et simulations stochastiques : Etude de la reproductibiité numérique sur architectures multicore et manycore." Thesis, Université Clermont Auvergne‎ (2017-2020), 2017. http://www.theses.fr/2017CLFAC005/document.

Full text

Abstract:

La reproductibilité des expériences numériques sur les systèmes de calcul à haute performance est parfois négligée. De plus, les méthodes numériques employées pour une parallélisation rigoureuse des simulations stochastiques sont souvent méconnues. En effet, les résultats obtenus pour une simulation stochastique utilisant des systèmes de calcul à hautes performances peuvent être différents d’une exécution à l’autre, et ce pour les mêmes paramètres et les même contextes d’exécution du fait de l’impact des nouvelles architectures, des accélérateurs, des compilateurs, des systèmes d’exploitation ou du changement de l’ordre d’exécution en parallèle des opérations en arithmétique flottantes au sein des micro-processeurs. En cas de non répétabilité des expériences numériques, comment mettre au point les applications ? Quel crédit peut-on apporter au logiciel parallèle ainsi développé ? Dans cette thèse, nous faisons une synthèse des causes de non-reproductibilité pour une simulation stochastique parallèle utilisant des systèmes de calcul à haute performance. Contrairement aux travaux habituels du parallélisme, nous ne nous consacrons pas à l’amélioration des performances, mais à l’obtention de résultats numériquement répétables d’une expérience à l’autre. Nous présentons la reproductibilité et ses apports dans la science numérique expérimentale. Nous proposons dans cette thèse quelques contributions, notamment : pour vérifier la reproductibilité et la portabilité des générateurs modernes de nombres pseudo-aléatoires ; pour détecter la corrélation entre flux parallèles issus de générateurs de nombres pseudo-aléatoires ; pour répéter et reproduire les résultats numériques de simulations stochastiques parallèles indépendantes
The reproducibility of numerical experiments on high performance computing systems is sometimes overlooked. Moreover, the numerical methods used for rigorous parallelization of stochastic simulations are often unknown. Indeed, the results obtained for a stochastic simulation using high performance computing systems can be different from run to run with the same parameters and the same execution contexts due to the impact of new architectures, accelerators, compilers, operating systems or a changing of the order of execution of the floating arithmetic operations within the micro-processors for parallelizing optimizations. In the case of non-repeatability of numerical experiments, how can we seriously develop a scientific application? What credit can be given to the parallel software thus developed? In this thesis, we synthesize the main causes of non-reproducibility for a parallel stochastic simulation using high performance computing systems. Unlike the usual parallelism works, we do not focus on improving performance, but on obtaining numerically repeatable results from one experiment to another. We present the reproducibility and its contributions to the science of experimental and numerical computing. Furthermore, we propose some contributions, in particular: to verify the reproducibility and portability of top modern pseudo-random number generators, to detect the correlation between parallel streams issued from such generators, to repeat and reproduce the numerical results of independent parallel stochastic simulations

APA, Harvard, Vancouver, ISO, and other styles

27

Rajamanikkam, Chidhambaranathan. "Understanding Security Threats of Emerging Computing Architectures and Mitigating Performance Bottlenecks of On-Chip Interconnects in Manycore NTC System." DigitalCommons@USU, 2019. https://digitalcommons.usu.edu/etd/7453.

Full text

Abstract:

Emerging computing architectures such as, neuromorphic computing and third party intellectual property (3PIP) cores, have attracted significant attention in the recent past. Neuromorphic Computing introduces an unorthodox non-von neumann architecture that mimics the abstract behavior of neuron activity of the human brain. They can execute more complex applications, such as image processing, object recognition, more efficiently in terms of performance and energy than the traditional microprocessors. However, focus on the hardware security aspects of the neuromorphic computing at its nascent stage. 3PIP core, on the other hand, have covertly inserted malicious functional behavior that can inflict range of harms at the system/application levels. This dissertation examines the impact of various threat models that emerges from neuromorphic architectures and 3PIP cores. Near-Threshold Computing (NTC) serves as an energy-efficient paradigm by aggressively operating all computing resources with a supply voltage closer to its threshold voltage at the cost of performance. Therefore, STC system is scaled to many-core NTC system to reclaim the lost performance. However, the interconnect performance in many-core NTC system pose significant bottleneck that hinders the performance of many-core NTC system. This dissertation analyzes the interconnect performance, and further, propose a novel technique to boost the interconnect performance of many-core NTC system.

APA, Harvard, Vancouver, ISO, and other styles

28

Stan, Oana. "Placement of tasks under uncertainty on massively multicore architectures." Thesis, Compiègne, 2013. http://www.theses.fr/2013COMP2116/document.

Full text

Abstract:

Ce travail de thèse de doctorat est dédié à l'étude de problèmes d'optimisation combinatoire du domaine des architectures massivement parallèles avec la prise en compte des données incertaines tels que les temps d'exécution. On s'intéresse aux programmes sous contraintes probabilistes dont l'objectif est de trouver la meilleure solution qui soit réalisable avec un niveau de probabilité minimal garanti. Une analyse quantitative des données incertaines à traiter (variables aléatoires dépendantes, multimodales, multidimensionnelles, difficiles à caractériser avec des lois de distribution usuelles), nous a conduit à concevoir une méthode qui est non paramétrique, intitulée "approche binomiale robuste". Elle est valable quelle que soit la loi jointe et s'appuie sur l'optimisation robuste et sur des tests d'hypothèse statistique. On propose ensuite une méthodologie pour adapter des algorithmes de résolution de type approchée pour résoudre des problèmes stochastiques en intégrant l'approche binomiale robuste afin de vérifier la réalisabilité d'une solution. La pertinence pratique de notre démarche est enfin validée à travers deux problèmes issus de la compilation des applications de type flot de données pour les architectures manycore. Le premier problème traite du partitionnement stochastique de réseaux de processus sur un ensemble fixé de nœuds, en prenant en compte la charge de chaque nœud et les incertitudes affectant les poids des processus. Afin de trouver des solutions robustes, un algorithme par construction progressive à démarrages multiples a été proposé ce qui a permis d'évaluer le coût des solution et le gain en robustesse par rapport aux solutions déterministes du même problème. Le deuxième problème consiste à traiter de manière globale le placement et le routage des applications de type flot de données sur une architecture clustérisée. L'objectif est de placer les processus sur les clusters en s'assurant de la réalisabilité du routage des communications entre les tâches. Une heuristique de type GRASP a été conçue pour le cas déterministe, puis adaptée au cas stochastique clustérisé
This PhD thesis is devoted to the study of combinatorial optimization problems related to massively parallel embedded architectures when taking into account uncertain data (e.g. execution time). Our focus is on chance constrained programs with the objective of finding the best solution which is feasible with a preset probability guarantee. A qualitative analysis of the uncertain data we have to treat (dependent random variables, multimodal, multidimensional, difficult to characterize through classical distributions) has lead us to design a non parametric method, the so-called "robust binomial approach", valid whatever the joint distribution and which is based on robust optimization and statistical hypothesis testing. We also propose a methodology for adapting approximate algorithms for solving stochastic problems by integrating the robust binomial approach when verifying for solution feasibility. The paractical relevance of our approach is validated through two problems arising in the compilation of dataflow application for manycore platforms. The first problem treats the stochastic partitioning of networks of processes on a fixed set of nodes, by taking into account the load of each node and the uncertainty affecting the weight of the processes. For finding stochastic solutions, a semi-greedy iterative algorithm has been proposed which allowed measuring the robustness and cost of the solutions with regard to those for the deterministic version of the problem. The second problem consists in studying the global placement and routing of dataflow applications on a clusterized architecture. The purpose being to place the processes on clusters such that it exists a feasible routing, a GRASP heuristic has been conceived first for the deterministic case and afterwards extended for the chance constrained variant of the problem

APA, Harvard, Vancouver, ISO, and other styles

29

Dahmani, Safae. "Modèles et protocoles de cohérence de données, décision et optimisation à la compilation pour des architectures massivement parallèles." Thesis, Lorient, 2015. http://www.theses.fr/2015LORIS384/document.

Full text

Abstract:

Le développement des systèmes massivement parallèles de type manycores permet d'obtenir une très grande puissance de calcul à bas coût énergétique. Cependant, l'exploitation des performances de ces architectures dépend de l'efficacité de programmation des applications. Parmi les différents paradigmes de programmation existants, celui à mémoire partagée est caractérisé par une approche intuitive dans laquelle tous les acteurs disposent d'un accès à un espace d'adressage global. Ce modèle repose sur l'efficacité du système à gérer les accès aux données partagées. Le système définit les règles de gestion des synchronisations et de stockage de données qui sont prises en charge par les protocoles de cohérence. Dans le cadre de cette thèse nous avons montré qu'il n'y a pas un unique protocole adapté aux différents contextes d'application et d'exécution. Nous considérons que le choix d'un protocole adapté doit prendre en compte les caractéristiques de l'application ainsi que des objectifs donnés pour une exécution. Nous nous intéressons dans ces travaux de thèse au choix des protocoles de cohérence en vue d'améliorer les performances du système. Nous proposons une plate-forme de compilation pour le choix et le paramétrage d'une combinaison de protocoles de cohérence pour une même application. Cette plate- forme est constituée de plusieurs briques. La principale brique développée dans cette thèse offre un moteur d'optimisation pour la configuration des protocoles de cohérence. Le moteur d'optimisation, inspiré d'une approche évolutionniste multi-objectifs (i.e. Fast Pareto Genetic Algorithm), permet d'instancier les protocoles de cohérence affectés à une application. L'avantage de cette technique est un coût de configuration faible permettant d'adopter une granularité très fine de gestion de la cohérence, qui peut aller jusqu'à associer un protocole par accès. La prise de décision sur les protocoles adaptés à une application est orientée par le mode de performance choisi par l'utilisateur (par exemple, l'économie d'énergie). Le modèle de décision proposé est basé sur la caractérisation des accès aux données partagées selon différentes métriques (par exemple: la fréquence d'accès, les motifs d'accès à la mémoire, etc). Les travaux de thèse traitent également des techniques de gestion de données dans la mémoire sur puce. Nous proposons deux protocoles basés sur le principe de coopération entre les caches répartis du système: Un protocole de glissement des données ainsi qu'un protocole inspiré du modèle physique du masse-ressort
Manycores architectures consist of hundreds to thousands of embedded cores, distributed memories and a dedicated network on a single chip. In this context, and because of the scale of the processor, providing a shared memory system has to rely on efficient hardware and software mechanisms and data consistency protocols. Numerous works explored consistency mechanisms designed for highly parallel architectures. They lead to the conclusion that there won't exist one protocol that fits to all applications and hardware contexts. In order to deal with consistency issues for this kind of architectures, we propose in this work a multi-protocol compilation toolchain, in which shared data of the application can be managed by different protocols. Protocols are chosen and configured at compile time, following the application behaviour and the targeted architecture specifications. The application behaviour is characterized with a static analysis process that helps to guide the protocols assignment to each data access. The platform offers a protocol library where each protocol is characterized by one or more parameters. The range of possible values of each parameter depends on some constraints mainly related to the targeted platform. The protocols configuration relies on a genetic-based engine that allows to instantiate each protocol with appropriate parameters values according to multiple performance objectives. In order to evaluate the quality of each proposed solution, we use different evaluation models. We first use a traffic analytical model which gives some NoC communication statistics but no timing information. Therefore, we propose two cycle- based evaluation models that provide more accurate performance metrics while taking into account contention effect due to the consistency protocols communications.We also propose a cooperative cache consistency protocol improving the cache miss rate by sliding data to less stressed neighbours. An extension of this protocol is proposed in order to dynamically define the sliding radius assigned to each data migration. This extension is based on the mass-spring physical model. Experimental validation of different contributions uses the sliding based protocols versus a four-state directory-based protocol

APA, Harvard, Vancouver, ISO, and other styles

30

Berger, Karl-Eduard. "Placement de graphes de tâches de grande taille sur architectures massivement multicoeurs." Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLV026/document.

Full text

Abstract:

Ce travail de thèse de doctorat est dédié à l'étude d'un problème de placement de tâches dans le domaine de la compilation d'applications pour des architectures massivement parallèles. Ce problème vient en réponse à certains besoins industriels tels que l'économie d'énergie, la demande de performances pour les applications de type flots de données synchrones. Ce problème de placement doit être résolu dans le respect de trois critères: les algorithmes doivent être capable de traiter des applications de tailles variables, ils doivent répondre aux contraintes de capacités des processeurs et prendre en compte la topologie des architectures cibles. Dans cette thèse, les tâches sont organisées en réseaux de communication, modélisés sous forme de graphes. Pour évaluer la qualité des solutions produites par les algorithmes, les placements obtenus sont comparés avec un placement aléatoire. Cette comparaison sert de métrique d'évaluation des placements des différentes méthodes proposées. Afin de résoudre à ce problème, deux algorithmes de placement de réseaux de tâches de grande taille sur des architectures clusterisées de processeurs de type many-coeurs ont été développés. Ils s'appliquent dans des cas où les poids des tâches et des arêtes sont unitaires. Le premier algorithme, nommé Task-wise Placement, place les tâches une par une en se servant d'une notion d'affinité entre les tâches. Le second, intitulé Subgraph-wise Placement, rassemble les tâches en groupes puis place les groupes de tâches sur les processeurs en se servant d'une relation d'affinité entre les groupes et les tâches déjà affectées. Ces algorithmes ont été testés sur des graphes, représentants des applications, possédant des topologies de types grilles ou de réseaux de portes logiques. Les résultats des placements sont comparés avec un algorithme de placement, présent dans la littérature qui place des graphes de tailles modérée et ce à l'aide de la métrique définie précédemment. Les cas d'application des algorithmes de placement sont ensuite orientés vers des graphes dans lesquels les poids des tâches et des arêtes sont variables similairement aux valeurs qu'on peut retrouver dans des cas industriels. Une heuristique de construction progressive basée sur la théorie des jeux a été développée. Cet algorithme, nommé Regret Based Approach, place les tâches une par une. Le coût de placement de chaque tâche en fonction des autres tâches déjà placées est calculée. La phase de sélection de la tâche se base sur une notion de regret présente dans la théorie des jeux. La tâche qu'on regrettera le plus de ne pas avoir placée est déterminée et placée en priorité. Afin de vérifier la robustesse de l'algorithme, différents types de graphes de tâches (grilles, logic gate networks, series-parallèles, aléatoires, matrices creuses) de tailles variables ont été générés. Les poids des tâches et des arêtes ont été générés aléatoirement en utilisant une loi bimodale paramétrée de manière à obtenir des valeurs similaires à celles des applications industrielles. Les résultats de l'algorithme ont également été comparés avec l'algorithme Task-Wise Placement, qui a été spécialement adapté pour les valeurs non unitaires. Les résultats sont également évalués en utilisant la métrique de placement aléatoire
This Ph.D thesis is devoted to the study of the mapping problem related to massively parallel embedded architectures. This problem arises from industrial needs like energy savings, performance demands for synchronous dataflow applications. This problem has to be solved considering three criteria: heuristics should be able to deal with applications with various sizes, they must meet the constraints of capacities of processors and they have to take into account the target architecture topologies. In this thesis, tasks are organized in communication networks, modeled as graphs. In order to determine a way of evaluating the efficiency of the developed heuristics, mappings, obtained by the heuristics, are compared to a random mapping. This comparison is used as an evaluation metric throughout this thesis. The existence of this metric is motivated by the fact that no comparative heuristics can be found in the literature at the time of writing of this thesis. In order to address this problem, two heuristics are proposed. They are able to solve a dataflow process network mapping problem, where a network of communicating tasks is placed into a set of processors with limited resource capacities, while minimizing the overall communication bandwidth between processors. They are applied on task graphs where weights of tasks and edges are unitary set. The first heuristic, denoted as Task-wise Placement, places tasks one after another using a notion of task affinities. The second algorithm, named Subgraph-wise Placement, gathers tasks in small groups then place the different groups on processors using a notion of affinities between groups and processors. These algorithms are tested on tasks graphs with grid or logic gates network topologies. Obtained results are then compared to an algorithm present in the literature. This algorithm maps task graphs with moderated size on massively parallel architectures. In addition, the random based mapping metric is used in order to evaluate results of both heuristics. Then, in a will to address problems that can be found in industrial cases, application cases are widen to tasks graphs with tasks and edges weights values similar to those that can be found in the industry. A progressive construction heuristic named Regret Based Approach, based on game theory, is proposed. This heuristic maps tasks one after another. The costs of mapping tasks according to already mapped tasks are computed. The process of task selection is based on a notion of regret, present in game theory. The task with the highest value of regret for not placing it, is pointed out and is placed in priority. In order to check the strength of the algorithm, many types of task graphs (grids, logic gates networks, series-parallel, random, sparse matrices) with various size are generated. Tasks and edges weights are randomly chosen using a bimodal law parameterized in order to have similar values than industrial applications. Obtained results are compared to the Task Wise placement, especially adapted for non-unitary values. Moreover, results are evaluated using the metric defined above

APA, Harvard, Vancouver, ISO, and other styles

31

Chi-Neng, Wen, and 文啟能. "A Unified and Scalable Manycore Testing/Debugging/Tracing Architecture." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/09623940949115010219.

Full text

Abstract:

博士
國立中正大學
資訊工程研究所
99
Traditional debug methodologies are limited in their ability to provide debugging support for multicore parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with software debugging tools. Most traditional debugging approaches rely on globally synchronized signals, but these pose problems in terms of scalability. Moreover, another urgent issue when developing parallel programs on multicore systems is to reproduce the same faulted circumstance after faults are. Tracing is a popular solution in current multicore system; however, it is limited by the on-chip storage and also by the tracing bandwidth. As a result, an intelligent recording scheme followed by a replaying system is the key to the future many-core debugging problems. The first contribution of this work is to propose a novel non-uniform debug architecture (NUDA) based on a ring interconnection schema and the non-uniform memory. Our approach makes debugging both feasible and scalable for many-core processing scenarios. The key idea is to distribute the debugging support structures across a set of hierarchical clusters while avoiding address overlap. This allows the address space to be monitored using non-uniform protocols. Our second contribution is a non-intrusive approach to race detection supported by the NUDA. A non-uniform page-based monitoring cache in each NUDA node is used to watch the access footprints. The union of all the caches can serve as a race detection probe. The third contribution is that this work also developed a non-intrusive run-time assertion (RunAssert) for parallel program development based on NUDA. Our approaches are as follows: (a) a current language extension for parallel program debugging (b) corresponding non-intrusive hardware configuration logic and checking methodologies and (c) several reality cases using the extensions mentioned above. In general, the target program can be executed at its original speed without altering the parallel sequences, thereby eliminating the possibility of Heisenberg effect. Furthermore, this work reuse the test access mechanism (TAM) for NUDA to contribute a unified manycore testing/debugging/tracing architecture, called TAM-NUDA (Test Access Mechanism based Non-Uniform Debugging Architecture). The TAM bus is switched between testing during test mode and debugging during normal mode. Finally, this work also proposes a new post-processing solution between record and replay phases to significantly reduce the trace over-head, while enabling faster replaying. The key point is that we perform dependence analysis and minimize the event trace at static time. Then a deterministic replayer, called Dini, can be directed by the new trace at a faster speed. Three trace analysis strategies are also considered in this paper such that a faster parallel replayer on multicore host is possible to ensure correct sequence of the faulted execution. Our experimental results demonstrate that the proposed trace reduction technologies reduce the trace size from 5X to 12.5X and still remain a fast replay speed. Using the proposed approaches, we show that parallel race bugs can be precisely captured, and that most false-positive alerts can be efficiently eliminated at an average slowdown cost of only 1.4%~3.6%. The net hardware cost is relatively low, so that the NUDA can readily scale increasingly complex many-core systems. Our experimental results also demonstrate that the proposed TAM-NUDA method only scarifies an average of 5.9% testing time for debug/trace modes reuses. Moreover, this work also realizes state-of-art race detectors and trace monitors at total area overhead of only 1.49%. Finally, experimental results shows that the proposed trace reduction technologies reduce the trace size from 5X to 12.5X and still remain a fast replay speed.

APA, Harvard, Vancouver, ISO, and other styles

32

"Application-aware Performance Optimization for Software Managed Manycore Architectures." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.53524.

Full text

Abstract:

abstract: One of the main goals of computer architecture design is to improve performance without much increase in the power consumption. It cannot be achieved by adding increasingly complex intelligent schemes in the hardware, since they will become increasingly less power-efficient. Therefore, parallelism comes up as the solution. In fact, the irrevocable trend of computer design in near future is still to keep increasing the number of cores while reducing the operating frequency. However, it is not easy to scale number of cores. One important challenge is that existing cores consume too much power. Another challenge is that cache-based memory hierarchy poses a serious limitation due to the rapidly increasing demand of area and power for coherence maintenance. In this dissertation, opportunities to resolve the aforementioned issues were explored in two aspects. Firstly, the possibility of removing hardware cache altogether, and replacing it with scratchpad memory with software management was explored. Scratchpad memory consumes much less power than caches. However, as data management logic is completely shifted to Software, how to reduce software overhead is challenging. This thesis presents techniques to manage scratchpad memory judiciously by exploiting application semantics and knowledge of data access patterns, thereby enabling optimization of data movement across the memory hierarchy. Experimental results show that the optimization was able to reduce stack data management overhead by 13X, produce better code mapping in more than 80% of the case, and improve performance by 83% in heap management. Secondly, the possibility of using software branch hinting to replace hardware branch prediction to completely eliminate power consumption on corresponding hardware components was explored. As branch predictor is removed from hardware, software logic is responsible for reducing branch penalty. Techniques to minimize the branch penalty by optimizing branch hint placement were proposed, which can reduce branch penalty by 35.4% over the state-of-the-art.
Dissertation/Thesis
Doctoral Dissertation Computer Science 2019

APA, Harvard, Vancouver, ISO, and other styles

33

"Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.27434.

Full text

Abstract:

abstract: As the number of cores per chip increases, maintaining cache coherence becomes prohibitive for both power and performance. Non Coherent Cache (NCC) architectures do away with hardware-based cache coherence, but they become difficult to program. Some existing architectures provide a middle ground by providing some shared memory in the hardware. Specifically, the 48-core Intel Single-chip Cloud Computer (SCC) provides some off-chip (DRAM) shared memory some on-chip (SRAM) shared memory. We call such architectures Hybrid Shared Memory, or HSM, manycore architectures. However, how to efficiently execute multi-threaded programs on HSM architectures is an open problem. To be able to execute a multi-threaded program correctly on HSM architectures, the compiler must: i) identify all the shared data and map it to the shared memory, and ii) map the frequently accessed shared data to the on-chip shared memory. This work presents a source-to-source translator written using CETUS that identifies a conservative superset of all the shared data in a multi-threaded application and maps it to the shared memory such that it enables execution on HSM architectures.
Dissertation/Thesis
Masters Thesis Computer Science 2014

APA, Harvard, Vancouver, ISO, and other styles

34

Baral, Pushpa. "Static Analysis and Dynamic Monitoring of Program Flow on REDEFINE Manycore Processor." Thesis, 2021. https://etd.iisc.ac.in/handle/2005/5577.

Full text

Abstract:

Manycore heterogeneous architectures are becoming the promising choice for high-performance computing applications. Multiple parallel tasks run concurrently across different processor cores sharing the same communication and memory resources, thus providing high performance and throughput. But, with the increase in the number of cores, the complexity of the system also increases. Such systems provide very limited visibility of internal processing behaviour. This poses a challenge for a software developer as it is difficult to monitor the program in execution. Monitoring the program behaviour is required for exploiting the architecture to the maximum, debugging, security, performance analysis, and so on. REDEFINE is one such macro dataflow execution based manycore co-processor designed to accelerate the compute-intensive part of an application. In this work, we propose a static and dynamic monitoring framework in REDEFINE. First, we present a software instrumentation based approach to generate the execution traces with negligible hardware overhead. The generated trace data is processed through an offline analysis tool to get the execution graph. This graph can be utilized to analyse the program behaviour statically. Although this approach is flexible and easy to implement, the offline analysis might result in delayed detection of any unexpected program behaviour. To address this, we then present a run-time monitoring framework. In this approach, a separate hardware monitoring unit runs in parallel to the processor core. The monitor is programmed with the monitoring graph obtained from static analysis of the program binary. It matches the instructions executed with the corresponding information in the monitoring graph and flags any unintended program behaviour. This aids in the identification of any intrusion or attack that alters the control flow integrity of the program executed and further helps in the realisation of a secure computing platform that is a primary requirement today in the area of 5G network processing, autonomous vehicles and avionics applications.

APA, Harvard, Vancouver, ISO, and other styles

35

"Scratchpad Management in Software Managed Manycore Architectures." Doctoral diss., 2017. http://hdl.handle.net/2286/R.I.46214.

Full text

Abstract:

abstract: Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average.
Dissertation/Thesis
Doctoral Dissertation Computer Science 2017

APA, Harvard, Vancouver, ISO, and other styles

36

"Optimizing Heap Data Management on Software Managed Manycore Architectures." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.45507.

Full text

Abstract:

abstract: Caches pose a serious limitation in scaling many-core architectures since the demand of area and power for maintaining cache coherence increases rapidly with the number of cores. Scratch-Pad Memories (SPMs) provide a cheaper and lower power alternative that can be used to build a more scalable many-core architecture. The trade-off of substituting SPMs for caches is however that the data must be explicitly managed in software. Heap management on SPM poses a major challenge due to the highly dynamic nature of of heap data access. Most existing heap management techniques implement a software caching scheme on SPM, emulating the behavior of hardware caches. The state-of-the-art heap management scheme implements a 4-way set-associative software cache on SPM for a single program running with one thread on one core. While the technique works correctly, it suffers from signifcant performance overhead. This paper presents a series of compiler-based efficient heap management approaches that reduces heap management overhead through several optimization techniques. Experimental results on benchmarks from MiBenchGuthaus et al. (2001) executed on an SMM processor modeled in gem5Binkert et al. (2011) demonstrate that our approach (implemented in llvm v3.8Lattner and Adve (2004)) can improve execution time by 80% on average compared to the previous state-of-the-art.
Dissertation/Thesis
Masters Thesis Computer Science 2017

APA, Harvard, Vancouver, ISO, and other styles

37

Yadav, Satyendra Singh. "Development of Wireless Communication Algorithms on Multicore/Manycore Architectures." Thesis, 2018. http://ethesis.nitrkl.ac.in/9426/1/2018_PhD_SSYadav_512EC1014_Development.pdf.

Full text

Abstract:

Wireless communication is one of the most rapid growing technology. To support new services demanding high data rate, the system design requires sophisticated signal processing at the transmitter and receiver. The signal processing techniques include Fast Fourier Transform (FFT), Inverse Fast Fourier Transform (IFFT) computation, Maximal Ratio Receive Combining (MRRC) scheme, Peak-to-Average Power Ratio (PAPR) reduction techniques, resource allocation algorithms, optimization algorithms, detection algorithms and assignment algorithms to name a few. These signal processing techniques makes wireless communication system computationally complex leading to performance limitation. In this work we implemented some of the mentioned signal processing techniques under Graphics Processing Unit (GPU) environment to harness the parallel implementation. The huge computational capabilities of GPU have been used to reduce the execution time of complex systems. Modern wireless communication heavily relies on FFT and IFFT computation. This research work attempts to address implementation of FFT and IFFT through High Performance Computing (HPC) with GPU. PAPR is another area which requires high computational complexity. Here, it is also evaluated on GPU architecture. In Orthogonal Frequency Division Multiple Access (OFDMA) downlink system, allocation of resource to user at Base Station (BS) demand computational efficient algorithm to fulfill the user’s data rate requirement. Hence, this thesis proposes Resource Allocation and Subcarrier Assignment(RASA) algorithm for multi-core architectures with Open Multiprocessing(OpenMP) support. Further in this thesis, subcarrier assignment problem using well know Hungarian algorithm is analyzed and implemented on Compute Unified Device Architecture(CUDA). Strong computational advantages are observed in parallel implementation of above problems in the area of wireless communication applications. Hence GPU based HPC can provide the promising solution for complex wireless systems. The findings have been extensively evaluated through computer simulation with GPU processor.

APA, Harvard, Vancouver, ISO, and other styles

38

Κολώνιας, Βασίλειος. "Παράλληλοι αλγόριθμοι και εφαρμογές σε πολυπύρηνες μονάδες επεξεργασίας γραφικών." Thesis, 2014. http://hdl.handle.net/10889/8315.

Full text

Abstract:

Στην παρούσα διατριβή παρουσιάζονται παράλληλοι αλγόριθμοι και εφαρμογές σε πολυπύρηνες μονάδες επεξεργασίας γραφικών. Πιο συγκεκριμένα, εξετάζονται οι μέθοδοι σχεδίασης ενός παράλληλου αλγορίθμου για την επίλυση τόσο απλών και κοινών προβλημάτων, όπως η ταξινόμηση, όσο και υπολογιστικά απαιτητικών προβλημάτων, έτσι ώστε να εκμεταλλευτούμε πλήρως την τεράστια υπολογιστική δύναμη που προσφέρουν οι σύγχρονες μονάδες επεξεργασίας γραφικών. Πρώτο πρόβλημα που εξετάστηκε είναι η ταξινόμηση, η οποία είναι ένα από τα πιο συνηθισμένα προβλήματα στην επιστήμη των υπολογιστών. Υπάρχει σαν εσωτερικό πρόβλημα σε πολλές εφαρμογές, επομένως πετυχαίνοντας πιο γρήγορη ταξινόμηση πετυχαίνουμε πιο καλή απόδοση γενικότερα. Στο Κεφάλαιο 3 περιγράφονται όλα τα βήματα σχεδιασμού για την εκτέλεση ενός αλγορίθμου ταξινόμησης για ακεραίους, της count sort, σε μια μονάδα επεξεργασίας γραφικών. Σημαντική επίδραση στην απόδοση είχε η αποφυγή του συγχρονισμού των νημάτων στο τελευταίο βήμα του αλγορίθμου. Στη συνέχεια παρουσιάζονται εφαρμογές παράλληλων αλγορίθμων σε υπολογιστικά απαιτητικά προβλήματα. Στο Κεφάλαιο 4, εξετάζεται το πρόβλημα χρονοπρογραμματισμού εξετάσεων Πανεπιστημίων, το οποίο είναι ένα πρόβλημα συνδυαστικής βελτιστοποίησης. Για την επίλυσή του χρησιμοποιείται ένας υβριδικός εξελικτικός αλγόριθμος, ο οποίος εκτελείται εξ' ολοκλήρου στην μονάδα επεξεργασίας γραφικών. Η τεράστια υπολογιστική δύναμη της GPU και ο παράλληλος προγραμματισμός δίνουν τη δυνατότητα χρήσης μεγάλων πληθυσμών έτσι ώστε να εξερευνήσουμε καλύτερα τον χώρο λύσεων και να πάρουμε καλύτερα ποιοτικά αποτελέσματα. Στο επόμενο κεφάλαιο γίνεται επίλυση του προβλήματος σχεδιασμού κίνησης για υποθαλάσσια οχήματα με βραχίονα. Εξετάζεται το πρόβλημα τόσο του ολικού σχεδιασμού όσο και του τοπικού. Στην πρώτη περίπτωση είναι σημαντική η καλή λύση και η ακρίβεια και ο παράλληλος αλγόριθμος που χρησιμοποιείται για την αναπαράσταση του περιβάλλοντος εργασίας σε μια Bump-επιφάνεια βοηθάει προς αυτή την κατεύθυνση. Στη δεύτερη περίπτωση, το πρόβλημα είναι πρόβλημα πραγματικού χρόνου και μας ενδιαφέρει η ταχύτητα εύρεσης της επόμενης θέσης του οχήματος. Ο παράλληλος προγραμματισμός και η GPU βοηθούν σημαντικά σε αυτό. Τελευταία εφαρμογή που εξετάστηκε είναι η μελέτη ενός συστήματος ημιφθοριωμένων αλκανίων με την μοριακή προσομοίωση Monte Carlo. Η παραλληλοποίηση ενός μέρους, του πιο χρονοβόρου, του αλγορίθμου έδωσε τη δυνατότητα εξέτασης ενός πολύ μεγαλύτερου συστήματος σε αποδεκτό χρόνο. Σε γενικές γραμμές, γίνεται φανερό ότι ο παράλληλος προγραμματισμός και οι σύγχρονες πολυπύρηνες αρχιτεκτονικές, όπως οι μονάδες επεξεργασίας γραφικών, δίνουν νέες δυνατότητες στην αντιμετώπιση καθημερινών προβλημάτων, προβλημάτων πραγματικού χρόνου και προβλημάτων συνδυαστικής βελτιστοποίησης.
In this thesis, parallel algorithms and applications in manycore graphics processing units are presented. More specifically, we examine methods of designing a parallel algorithm for solving both simple and common problems such as sorting, and computationally demanding problems, so as to fully exploit the enormous computing power of modern graphics processing units (GPUs). First problem considered is sorting, which is one of the most common problems in computer science. It exists as an internal problem in many applications. Therefore, sorting faster, results in better performance in general. Chapter 3 describes all design options for the implementation of a sorting algorithm for integers, count sort, on a graphics processing unit. The elimination of thread synchronization in the last step of the algorithm had a significant effect on the performance. Chapter 4 addresses the examination timetabling problem for Universities, which is a combinatorial optimization problem. A hybrid evolutionary algorithm, which runs entirely on GPU, was used to solve the problem. The tremendous computing power of GPU and parallel programming enable the use of large populations in order to explore better the solution space and get better quality results. In the next chapter, the problem of motion planning for underwater vehicle manipulator systems is examined. In the gross motion planning problem, it is important to achieve a good solution with high accuracy. The parallel algorithm used for the representation of the working environment in a Bump-surface is a step towards this direction. In the local motion planning problem, which is a real-time problem, the time needed to find the next configuration of the vehicle is crucial. Parallel programming and the GPU greatly assist in this online problem. Last application considered is the atomistic Monte Carlo simulation of semifluorinated alkanes. The parallelization of part of the algorithm, the most time-consuming, enabled the study of a much larger system in an acceptable execution time. In general, it becomes obvious that parallel programming and new novel manycore architectures, such as graphics processing units, give new capabilities for solving everyday problems, real time and combinatorial optimization problems.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Architecture manycore'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles