Dissertations / Theses: 'Multi-Core and many-Core'

1

Kanellou, Eleni. "Data structures for current multi-core and future many-core architectures." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S171/document.

Full text

Abstract:

Actuellement, la majorité des architectures de processeurs sont fondées sur une mémoire partagée avec cohérence de caches. Des prototypes intégrant de grandes quantités de cœurs, reliés par une infrastructure de transmission de messages, indiquent que, dans un proche avenir, les architectures de processeurs vont probablement avoir ces caractéristiques. Ces deux tendances exigent que les processus s'exécutent en parallèle et rendent la programmation concurrente nécessaire. Cependant, la difficulté inhérente du raisonnement sur la concurrence peut rendre ces nouvelles machines difficiles à programmer. Nous explorons trois approches ayant pour but de faciliter la programmation concurrente. Nous proposons WFR-TM, une approche fondé sur la mémoire transactionnelle (TM), un paradigme de programmation concurrente qui utilise des transactions afin de synchroniser l'accès aux données partagées. Une transaction peut soit terminer (commit), rendant visibles ses modifications, soit échouer (abort), annulant toutes ses modifications. WFR-TM tente de combiner des caractéristiques désirables des TM optimistes et pessimistes. Une TM pessimiste n'échoue jamais aucune transaction; néanmoins les algorithmes existants utilisent des verrous pour exécuter de manière séquentielle les transactions qui contiennent des opérations d'écriture. Les algorithmes TM optimistes exécutent toutes les transactions en parallèle mais les terminent seulement si elles n'ont pas rencontré de conflit au cours de leur exécution. WFR-TM fournit des transactions en lecture seule qui sont wait-free, sans jamais exécuter d'opérations de synchronisation coûteuse (par ex. CAS, LL\SC, etc) ou sacrifier le parallélisme entre les transactions d'écriture. Nous présentons également Dense, une implémentation concurrente de graphe. Les graphes sont des structures de données polyvalentes qui permettent la mise en oeuvre d'une variété d'applications. Cependant, des applications multi-processus qui utilisent des graphes utilisent encore largement des versions séquentielles. Nous introduisons un nouveau modèle de graphes concurrents, permettant l'ajout ou la suppression de n'importe quel arc du graphe, ainsi que la traversée atomique d'une partie (ou de l'intégralité) du graphe. Dense offre la possibilité d'effectuer un snapshot partiel d'un sous-ensemble du graphe défini dynamiquement. Enfin, nous ciblons les futures architectures. Dans l'intérêt de la réutilisation du code il existe depuis quelques temps une tentative d'adaptation des environnements d'exécution de logiciel - comme par ex. JVM, l'environnement d'exécution de Java - initialement prévus pour mémoire partagée, à des machines sans cohérence de caches. Nous étudions des techniques générales pour implémenter des structures de données distribuées en supposant qu'elles vont être utilisées sur des architectures many-core, qui n'offrent qu'une cohérence partielle de caches, voir pas de cohérence du tout
Though a majority of current processor architectures relies on shared, cache-coherent memory, current prototypes that integrate large amounts of cores, connected through a message-passing substrate, indicate that architectures of the near future may have these characteristics. Either of those tendencies requires that processes execute in parallel, making concurrent programming a necessary tool. The inherent difficulty of reasoning about concurrency, however, may make the new processor architectures hard to program. In order to deal with issues such as this, we explore approaches for providing ease of programmability. We propose WFR-TM, an approach based on transactional memory (TM), which is a concurrent programming paradigm that employs transactions in order to synchronize the access to shared data. A transaction may either commit, making its updates visible, or abort, discarding its updates. WFR-TM combines desirable characteristics of pessimistic and optimistic TM. In a pessimistic TM, no transaction ever aborts; however, in order to achieve that, existing TM algorithms employ locks in order to execute update transactions sequentially, decreasing the degree of achieved parallelism. Optimistic TMs execute all transactions concurrently but commit them only if they have encountered no conflict during their execution. WFR-TM provides read-only transactions that are wait-free, without ever executing expensive synchronization operations (like CAS, LL/SC, etc), or sacrificing the parallelism between update transactions. We further present Dense, a concurrent graph implementation. Graphs are versatile data structures that allow the implementation of a variety of applications. However, multi-process applications that rely on graphs still largely use a sequential implementation. We introduce an innovative concurrent graph model that provides addition and removal of any edge of the graph, as well as atomic traversals of a part (or the entirety) of the graph. Dense achieves wait-freedom by relying on light-weight helping and provides the inbuilt capability of performing a partial snapshot on a dynamically determined subset of the graph. We finally aim at predicted future architectures. In the interest of ode reuse and of a common paradigm, there is recent momentum towards porting software runtime environments, originally intended for shared-memory settings, onto non-cache-coherent machines. JVM, the runtime environment of the high-productivity language Java, is a notable example. Concurrent data structure implementations are important components of the libraries that environments like these incorporate. With the goal of contributing to this effort, we study general techniques for implementing distributed data structures assuming they have to run on many-core architectures that offer either partially cache-coherent memory or no cache coherence at all and present implementations of stacks, queues, and lists

APA, Harvard, Vancouver, ISO, and other styles

2

Serpa, Matheus da Silva. "Source code optimizations to reduce multi core and many core performance bottlenecks." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2018. http://hdl.handle.net/10183/183139.

Full text

Abstract:

Atualmente, existe uma variedade de arquiteturas disponíveis não apenas para a indústria, mas também para consumidores finais. Processadores multi-core tradicionais, GPUs, aceleradores, como o Xeon Phi, ou até mesmo processadores orientados para eficiência energética, como a família ARM, apresentam características arquiteturais muito diferentes. Essa ampla gama de características representa um desafio para os desenvolvedores de aplicações. Os desenvolvedores devem lidar com diferentes conjuntos de instruções, hierarquias de memória, ou até mesmo diferentes paradigmas de programação ao programar para essas arquiteturas. Para otimizar uma aplicação, é importante ter uma compreensão profunda de como ela se comporta em diferentes arquiteturas. Os trabalhos relacionados provaram ter uma ampla variedade de soluções. A maioria deles se concentrou em melhorar apenas o desempenho da memória. Outros se concentram no balanceamento de carga, na vetorização e no mapeamento de threads e dados, mas os realizam separadamente, perdendo oportunidades de otimização. Nesta dissertação de mestrado, foram propostas várias técnicas de otimização para melhorar o desempenho de uma aplicação de exploração sísmica real fornecida pela Petrobras, uma empresa multinacional do setor de petróleo. Os experimentos mostram que loop interchange é uma técnica útil para melhorar o desempenho de diferentes níveis de memória cache, melhorando o desempenho em até 5,3 e 3,9 nas arquiteturas Intel Broadwell e Intel Knights Landing, respectivamente. Ao alterar o código para ativar a vetorização, o desempenho foi aumentado em até 1,4 e 6,5 . O balanceamento de carga melhorou o desempenho em até 1,1 no Knights Landing. Técnicas de mapeamento de threads e dados também foram avaliadas, com uma melhora de desempenho de até 1,6 e 4,4 . O ganho de desempenho do Broadwell foi de 22,7 e do Knights Landing de 56,7 em comparação com uma versão sem otimizações, mas, no final, o Broadwell foi 1,2 mais rápido que o Knights Landing.
Nowadays, there are several different architectures available not only for the industry but also for final consumers. Traditional multi-core processors, GPUs, accelerators such as the Xeon Phi, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. Related work proved to have a wide variety of solutions. Most of then focused on improving only memory performance. Others focus on load balancing, vectorization, and thread and data mapping, but perform them separately, losing optimization opportunities. In this master thesis, we propose several optimization techniques to improve the performance of a real-world seismic exploration application provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we show that loop interchange is a useful technique to improve the performance of different cache memory levels, improving the performance by up to 5.3 and 3.9 on the Intel Broadwell and Intel Knights Landing architectures, respectively. By changing the code to enable vectorization, performance was increased by up to 1.4 and 6.5 . Load Balancing improved the performance by up to 1.1 on Knights Landing. Thread and data mapping techniques were also evaluated, with a performance improvement of up to 1.6 and 4.4 . We also compared the best version of each architecture and showed that we were able to improve the performance of Broadwell by 22.7 and Knights Landing by 56.7 compared to a naive version, but, in the end, Broadwell was 1.2 faster than Knights Landing.

APA, Harvard, Vancouver, ISO, and other styles

3

Martins, Andr? Lu?s Del Mestre. "Multi-objective resource management for many-core systems." Pontif?cia Universidade Cat?lica do Rio Grande do Sul, 2018. http://tede2.pucrs.br/tede2/handle/tede/8096.

Full text

Abstract:

Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-05-22T12:22:46Z No. of bitstreams: 1 ANDR?_LU?S_DEL_MESTRE_MARTINS_TES.pdf: 10284806 bytes, checksum: 089cdc5e5c91b6ab23816b94fdbe3d1d (MD5)
Approved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2018-06-04T11:21:09Z (GMT) No. of bitstreams: 1 ANDR?_LU?S_DEL_MESTRE_MARTINS_TES.pdf: 10284806 bytes, checksum: 089cdc5e5c91b6ab23816b94fdbe3d1d (MD5)
Made available in DSpace on 2018-06-04T11:37:12Z (GMT). No. of bitstreams: 1 ANDR?_LU?S_DEL_MESTRE_MARTINS_TES.pdf: 10284806 bytes, checksum: 089cdc5e5c91b6ab23816b94fdbe3d1d (MD5) Previous issue date: 2018-03-19
Sistemas many-core integram m?ltiplos cores em um chip, fornecendo alto desempenho para v?rios segmentos de mercado. Novas tecnologias introduzem restri??es de pot?ncia conhecidos como utilization-wall ou dark-silicon, onde a dissipa??o de pot?ncia no chip impede que todos os PEs sejam utilizados simultaneamente em m?ximo desempenho. A carga de trabalho (workload) em sistemas many-core inclui aplica??es tempo real (RT), com restri??es de vaz?o e temporiza??o. Al?m disso, workloads t?picos geram vales e picos de utiliza??o de recursos ao longo do tempo. Este cen?rio, sistemas complexos de alto desempenho sujeitos a restri??es de pot?ncia e utiliza??o, exigem um gerenciamento de recursos (RM) multi-objetivos capaz de adaptar dinamicamente os objetivos do sistema, respeitando as restri??es impostas. Os trabalhos relacionados que tratam aplica??es RT aplicam uma an?lise em tempo de projeto com o workload esperado, para atender ?s restri??es de vaz?o e temporiza??o. Para abordar esta limita??o do estado-da-arte, ecis?es em tempo de projeto, esta Tese prop?e um gerenciamento hier?rquico de energia (REM), sendo o primeiro trabalho que considera a execu??o de aplica??es RT e ger?ncia de recursos sujeitos a restri??es de pot?ncia, sem uma an?lise pr?via do conjunto de aplica??es. REM emprega diferentes heur?sticas de mapeamento e de DVFS para reduzir o consumo de energia. Al?m de n?o incluir as aplica??es RT, os trabalhos relacionados n?o consideram um workload din?mico, propondo RMs com um ?nico objetivo a otimizar. Para tratar esta segunda limita??o do estado-da-arte, RMs com objetivo ?nico a otimizar, esta Tese apresenta um gerenciamento de recursos multi-objetivos adaptativo e hier?rquico (MORM) para sistemas many-core com restri??es de pot?ncia, considerando workloads din?micos com picos e vales de utiliza??o. MORM pode mudar dinamicamente os objetivos, priorizando energia ou desempenho, de acordo com o comportamento do workload. Ambos RMs (REM e MORM) s?o abordagens multi-objetivos. Esta Tese emprega o paradigma Observar-Decidir-Atuar (ODA) como m?todo de projeto para implementar REM e MORM. A Observa??o consiste em caracterizar os cores e integrar monitores de hardware para fornecer informa??es precisas e r?pidas relacionadas ? energia. A Atua??o configura os atuadores do sistema em tempo de execu??o para permitir que os RMs atendam ?s decis?es multi-objetivos. A Decis?o corresponde ? implementa??o do REM e do MORM, os quais compartilham os m?todos de Observa??o e Atua??o. REM e MORM destacam-se dos trabalhos relacionados devido ?s suas caracter?sticas de escalabilidade, abrang?ncia e estimativa de pot?ncia e energia precisas. As avalia??es utilizando REM em manycores com at? 144 cores reduzem o consumo de energia entre 15% e 28%, mantendo as viola??es de temporiza??o abaixo de 2,5%. Resultados mostram que MORM pode atender dinamicamente a objetivos distintos. Comparado MORM com um RM estado-da-arte, MORM otimiza o desempenho em vales de workload em 11,56% e em picos workload em at? 49%.
Many-core systems integrate several cores in a single die to provide high-performance computing in multiple market segments. The newest technology nodes introduce restricted power caps so that results in the utilization-wall (also known as dark silicon), i.e., the on-chip power dissipation prevents the use of all resources at full performance simultaneously. The workload of many-core systems includes real-time (RT) applications, which bring the application throughput as another constraint to meet. Also, dynamic workloads generate valleys and peaks of resources utilization over the time. This scenario, complex high-performance systems subject to power and performance constraints, creates the need for multi-objective resource management (RM) able to dynamically adapt the system goals while respecting the constraints. Concerning RT applications, related works apply a design-time analysis of the expected workload to ensure throughput constraints. To cover this limitation, design-time decisions, this Thesis proposes a hierarchical Runtime Energy Management (REM) for RT applications as the first work to link the execution of RT applications and RM under a power cap without design-time analysis of the application set. REM employs different mapping and DVFS (Dynamic Voltage Frequency Scaling) heuristics for RT and non-RT tasks to save energy. Besides not considering RT applications, related works do not consider the workload variation and propose single-objective RMs. To tackle this second limitation, single-objective RMs, this Thesis presents a hierarchical adaptive multi-objective resource management (MORM) for many-core systems under a power cap. MORM addresses dynamic workloads with peaks and valleys of resources utilization. MORM can dynamically shift the goals to prioritize energy or performance according to the workload behavior. Both RMs (REM and MORM), are multi-objective approaches. This Thesis employs the Observe-Decide-Act (ODA) paradigm as the design methodology to implement REM and MORM. The Observing consists on characterizing the cores and on integrating hardware monitors to provide accurate and fast power-related information for an efficient RM. The Actuation configures the system actuators at runtime to enable the RMs to follow the multi-objective decisions. The Decision corresponds to REM and MORM, which share the Observing and Actuation infrastructure. REM and MORM stand out from related works regarding scalability, comprehensiveness, and accurate power and energy estimation. Concerning REM, evaluations on many-core systems up to 144 cores show energy savings from 15% to 28% while keeping timing violations below 2.5%. Regarding MORM, results show it can drive applications to dynamically follow distinct objectives. Compared to a stateof- the-art RM targeting performance, MORM speeds up the workload valley by 11.56% and the workload peak by up to 49%.

APA, Harvard, Vancouver, ISO, and other styles

4

Jelena, Tekić. "Оптимизација CFD симулације на групама вишејезгарних хетерогених архитектура." Phd thesis, Univerzitet u Novom Sadu, Prirodno-matematički fakultet u Novom Sadu, 2019. https://www.cris.uns.ac.rs/record.jsf?recordId=110976&source=NDLTD&language=en.

Full text

Abstract:

Предмет истраживања тезе је из области паралелног програмирања,имплементација CFD (Computational Fluid Dynamics) методе на вишехетерогених вишејезгарних уређаја истовремено. У раду је приказанонеколико алгоритама чији је циљ убрзање CFD симулације на персоналним рачунарима. Показано је да описано решење постиже задовољавајуће перформансе и на HPC уређајима (Тесла графичким картицама). Направљена је симулација у микросервис архитектури која је портабилна и флексибилна и додатно олакшава рад на персоналним рачунарима.
Predmet istraživanja teze je iz oblasti paralelnog programiranja,implementacija CFD (Computational Fluid Dynamics) metode na višeheterogenih višejezgarnih uređaja istovremeno. U radu je prikazanonekoliko algoritama čiji je cilj ubrzanje CFD simulacije na personalnim računarima. Pokazano je da opisano rešenje postiže zadovoljavajuće performanse i na HPC uređajima (Tesla grafičkim karticama). Napravljena je simulacija u mikroservis arhitekturi koja je portabilna i fleksibilna i dodatno olakšava rad na personalnim računarima.
The case study of this dissertation belongs to the field of parallel programming, the implementation of CFD (Computational Fluid Dynamics) method on several heterogeneous multiple core devices simultaneously. The paper presents several algorithms aimed at accelerating CFD simulation on common computers. Also it has been shown that the described solution achieves satisfactory performance onHPC devices (Tesla graphic cards). Simulation is created in micro-service architecture that is portable and flexible and makes it easy to test CFDsimulations on common computers.

APA, Harvard, Vancouver, ISO, and other styles

5

Singh, Ajeet. "GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading." Thesis, Virginia Tech, 2009. http://hdl.handle.net/10919/34264.

Full text

Abstract:

Hardware-acceleration techniques continue to be used to boost the performance of scientific codes. To do so, software developers identify portions of these codes that are amenable for offloading and map them to hardware accelerators. However, offloading such tasks to specialized hardware accelerators is non-trivial. Furthermore, these accelerators can add significant cost to a computing system.

Consequently, this thesis proposes a framework called GePSeA (General Purpose Software Acceleration Framework), which uses a small fraction of the computational power on multi-core architectures to offload complex application-specific tasks. Specifically, GePSeA provides a lightweight process that acts as a helper agent to the application by executing application-specific tasks asynchronously and efficiently. GePSeA is not meant to replace hardware accelerators but to extend them. GePSeA provide several utilities called core components that offload tasks on to the core or to the special-purpose hardware when available in a way that is transparent to the application. Examples of such core components include reliable communication service, distributed lock management, global memory management, dynamic load distribution and network protocol processing. We then apply the GePSeA framework to two applications, namely mpiBLAST, an open-source computational biology application and Reliable Blast UDP (RBUDP) based file transfer application. We observe significant speed-up for both applications.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

6

Singh, Kunal. "High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524089757826551.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Lo, Moustapha. "Application des architectures many core dans les systèmes embarqués temps réel." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAM002/document.

Full text

Abstract:

Les processeurs mono-coeurs traditionnels ne sont plus suffisants pour répondre aux besoins croissants en performance des fonctions avioniques. Les processeurs multi/many-coeurs ont emergé ces dernières années afin de pouvoir intégrer plusieurs fonctions et de bénéficier de la puissance par Watt disponible grâce aux partages de ressources. En revanche, tous les processeurs multi/many-coeurs ne répondent pas forcément aux besoins des fonctions avioniques. Nous préférons avoir plus de déterminisme que de puissance de calcul car la certification de ces processeurs passe par la maîtrise du déterminisme. L’objectif de cette thèse est d’évaluer le processeur many-coeur (MPPA-256) de Kalray dans un contexte industriel aéronautique. Nous avons choisi la fonction de maintenance HMS (Health Monitoring System) qui a un besoin important en bande passante et un besoin de temps de réponse borné.Par ailleurs, cette fonction est également dotée de propriétés de parallélisme car elle traite des données de vibration venant de capteurs qui sont fonctionnellement indépendants, et par conséquent leur traitement peut être parallélisé sur plusieurs coeurs. La particularité de cette étude est qu’elle s’intéresse au déploiement d’une fonction existante séquentielle sur une architecture many-coeurs en partant de l’acquisition des données jusqu’aux calculs des indicateurs de santé avec un fort accent sur le fluxd’entrées/sorties des données. Nos travaux de recherche ont conduit à 5 contributions:• Transformation des algorithmes existants en algorithmes incrémentaux capables de traiter les données au fur et mesure qu’elles arrivent des capteurs.• Gestion du flux d’entrées des échantillons de vibrations jusqu’aux calculs des indicateurs de santé,la disponibilité des données dans le cluster interne, le moment où elles sont consommées et enfinl’estimation de la charge de calcul.• Mesures de temps pas très intrusives directement sur le MPPA-256 en ajoutant des timestamps dans le flow de données.• Architecture logicielle qui respecte les contraintes temps-réel même dans les pires cas. Elle estbasée sur une pipeline à 3 étages.• Illustration des limites de la fonction existante: nos expériences ont montré que les paramètres contextuels de l’hélicoptère tels que la vitesse du rotor doivent être corrélés aux indicateurs de santé pour réduire les fausses alertes
Traditional single-cores are no longer sufficient to meet the growing needs of performance in avionics domain. Multi-core and many-core processors have emerged in the recent years in order to integrate several functions thanks to the resource sharing. In contrast, all multi-core and many-core processorsdo not necessarily satisfy the avionic constraints. We prefer to have more determinism than computing power because the certification of such processors depends on mastering the determinism.The aim of this thesis is to evaluate the many-core processor (MPPA-256) from Kalray in avionic context. We choose the maintenance function HMS (Health Monitoring System) which requires an important bandwidth and a response time guarantee. In addition, this function has also parallelism properties. It computes data from sensors that are functionally independent and, therefore their processing can be parallelized in several cores. This study focuses on deploying the existing sequential HMS on a many-core processor from the data acquisition to the computation of the health indicators with a strongemphasis on the input flow.Our research led to five main contributions:• Transformation of the global existing algorithms into a real-time ones which can process data as soon as they are available.• Management of the input flow of vibration samples from the sensors to the computation of the health indicators, the availability of raw vibration data in the internal cluster, when they are consumed and finally the workload estimation.• Implementing a lightweight Timing measurements directly on the MPPA-256 by adding timestamps in the data flow.• Software architecture that respects real-time constraints even in the worst cases. The software architecture is based on three pipeline stages.• Illustration of the limits of the existing function: our experiments have shown that the contextual parameters of the helicopter such as the rotor speed must be correlated with the health indicators to reduce false alarms

APA, Harvard, Vancouver, ISO, and other styles

8

Lukarski, Dimitar [Verfasser]. "Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms : Parallel Solvers and Preconditioners / Dimitar Lukarski." Karlsruhe : KIT-Bibliothek, 2012. http://d-nb.info/1020663480/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Júnior, Manoel Baptista da Silva. "Portabilidade com eficiência de trechos da dinâmica do modelo BRAMS entre arquiteturas multi-core e many-core." Instituto Nacional de Pesquisas Espaciais (INPE), 2015. http://urlib.net/sid.inpe.br/mtc-m21b/2015/04.28.19.21.

Full text

Abstract:

O aumento contínuo da resolução espacial e temporal dos modelos meteorológicos demanda cada vez mais velocidade e capacidade de processamento. Executar esses modelos em tempo hábil requer o uso de supercomputadores com centenas ou milhares de nós. Atualmente estes modelos são executados em produção no CPTEC em um supercomputador com nós compostos por CPUs com dezenas de núcleos (multi-core). Gerações mais recentes de supercomputadores apresentam nós com CPUs acopladas a aceleradores de processamento, tipicamente placas gráficas (GPGPUs), compostas de centenas de núcleos (many-core). Alterar o código do modelo de forma a usar com alguma eficiência nós com ou sem placas gráficas (código portátil) é um desafio. A interface de programação OpenMP é o padrão estabelecido há décadas para explorar eficientemente as arquiteturas multi-core. Uma nova interface de programação, o OpenACC, foi recentemente proposta para explorar as arquiteturas many-core. Ambas interfaces são semelhantes, baseadas em diretivas de paralelização para execução concorrente de threads. Este trabalho demonstra que é possível escrever um único código paralelizado com as duas interfaces que apresente eficiência aceitável, de forma a poder ser executado num nó com arquitetura multi-core ou então em um nó com arquitetura many-core. O código escolhido como estudo de caso é a advecção de escalares, um trecho da dinâmica do modelo meteorológico regional BRAMS (Brazilian Regional Atmospheric Modelling System).
The continuous growth of spatial and temporal resolutions in current meteorological models demands increasing processing power. The prompt execution of these models requires the use of supercomputers with hundreds or thousands of nodes. Currently, these models are executed at the operational environment of CPTEC on a supercomputer composed of nodes with CPUs with tens of cores (multi-core). Newer supercomputer generations have nodes with CPUs coupled to processing accelerators, typically graphics cards (GPGPUs), containing hundreds of cores (many-core). The rewriting of the model codes in order to use such nodes efficiently, with or without graphics cards (portable code), represents a challenge. The OpenMP programming interface proposed decades ago is a standard for decades to efficiently exploit multi-core architectures. A new programming interface, OpenACC, proposed decades ago is the many-core architectures. These two programming interfaces are similar, since they are based on parallelization directives for the concurrent execution of threads. This work shows the feasibility of writing a single code imbedding both interfaces and presenting acceptable efficiency. When executed on nodes with multi-core or many-core architecture. The code chosen as a case study is the advection of scalars, a part of the dynamics of the regional meteorological model BRAMS (Brazilian Regional Atmospheric Modeling System).

APA, Harvard, Vancouver, ISO, and other styles

10

Thucanakkenpalayam, Sundararajan Karthik. "Energy efficient cache architectures for single, multi and many core processors." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/9916.

Full text

Abstract:

With each technology generation we get more transistors per chip. Whilst processor frequencies have increased over the past few decades, memory speeds have not kept pace. Therefore, more and more transistors are devoted to on-chip caches to reduce latency to data and help achieve high performance. On-chip caches consume a significant fraction of the processor energy budget but need to deliver high performance. Therefore cache resources should be optimized to meet the requirements of the running applications. Fixed configuration caches are designed to deliver low average memory access times across a wide range of potential applications. However, this can lead to excessive energy consumption for applications that do not require the full capacity or associativity of the cache at all times. Furthermore, in systems where the clock period is constrained by the access times of level-1 caches, the clock frequency for all applications is effectively limited by the cache requirements of the most demanding phase within the most demanding application. This motivates the need for dynamic adaptation of cache configurations in order to optimize performance while minimizing energy consumption, on a per-application basis. First, this thesis proposes an energy-efficient cache architecture for a single core system, along with a run-time support framework for dynamic adaptation of cache size and associativity through the use of machine learning. The machine learning model, which is trained offline, profiles the application’s cache usage and then reconfigures the cache according to the program’s requirement. The proposed cache architecture has, on average, 18% better energy-delay product than the prior state-of-the-art cache architectures proposed in the literature. Next, this thesis proposes cooperative partitioning, an energy-efficient cache partitioning scheme for multi-core systems that share the Last Level Cache (LLC), with a core to LLC cache way ratio of 1:4. The proposed cache partitioning scheme uses small auxiliary tags to capture each core’s cache requirements, and partitions the LLC according to the individual cores cache requirement. The proposed partitioning uses a way-aligned scheme that helps in the reduction of both dynamic and static energy. This scheme, on an average offers 70% and 30% reduction in dynamic and static energy respectively, while maintaining high performance on par with state-of-the-art cache partitioning schemes. Finally, when Last Level Cache (LLC) ways are equal to or less than the number of cores present in many-core systems, cooperative partitioning cannot be used for partitioning the LLC. This thesis proposes a region aware cache partitioning scheme as an energy-efficient approach for many core systems that share the LLC, with a core to LLC way ratio of 1:2 and 1:1. The proposed partitioning, on an average offers 68% and 33% reduction in dynamic and static energy respectively, while again maintaining high performance on par with state-of-the-art LLC cache management techniques.

APA, Harvard, Vancouver, ISO, and other styles

11

Baskaran, Muthu Manikandan. "Compile-time and Run-time Optimizations for Enhancing Locality and Parallelism on Multi-core and Many-core Systems." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1253557044.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Zhang, Jing. "Transforming and Optimizing Irregular Applications for Parallel Architectures." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/82069.

Full text

Abstract:

Parallel architectures, including multi-core processors, many-core processors, and multi-node systems, have become commonplace, as it is no longer feasible to improve single-core performance through increasing its operating clock frequency. Furthermore, to keep up with the exponentially growing desire for more and more computational power, the number of cores/nodes in parallel architectures has continued to dramatically increase. On the other hand, many applications in well-established and emerging fields, such as bioinformatics, social network analysis, and graph processing, exhibit increasing irregularities in memory access, control flow, and communication patterns. While multiple techniques have been introduced into modern parallel architectures to tolerate these irregularities, many irregular applications still execute poorly on current parallel architectures, as their irregularities exceed the capabilities of these techniques. Therefore, it is critical to resolve irregularities in applications for parallel architectures. However, this is a very challenging task, as the irregularities are dynamic, and hence, unknown until runtime. To optimize irregular applications, many approaches have been proposed to improve data locality and reduce irregularities through computational and data transformations. However, there are two major drawbacks in these existing approaches that prevent them from achieving optimal performance. First, these approaches use local optimizations that exploit data locality and regularity locally within a loop or kernel. However, in many applications, there is hidden locality across loops or kernels. Second, these approaches use "one-size-fits-all'' methods that treat all irregular patterns equally and resolve them with a single method. However, many irregular applications have complex irregularities, which are mixtures of different types of irregularities and need differentiated optimizations. To overcome these two drawbacks, we propose a general methodology that includes a taxonomy of irregularities to help us analyze the irregular patterns in an application, and a set of adaptive transformations to reorder data and computation based on the characteristics of the application and architecture. By extending our adaptive data-reordering transformation on a single node, we propose a data-partitioning framework to resolve the load imbalance problem of irregular applications on multi-node systems. Unlike existing frameworks, which use "one-size-fits-all" methods to partition the input data by a single property, our framework provides a set of operations to transform the input data by multiple properties and generates the desired data-partitioning codes by composing these operations into a workflow.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

13

Gibson, Michael John. "Genetic programming and cellular automata for fast flood modelling on multi-core CPU and many-core GPU computers." Thesis, University of Exeter, 2015. http://hdl.handle.net/10871/20364.

Full text

Abstract:

Many complex systems in nature are governed by simple local interactions, although a number are also described by global interactions. For example, within the field of hydraulics the Navier-Stokes equations describe free-surface water flow, through means of the global preservation of water volume, momentum and energy. However, solving such partial differential equations (PDEs) is computationally expensive when applied to large 2D flow problems. An alternative which reduces the computational complexity, is to use a local derivative to approximate the PDEs, such as finite difference methods, or Cellular Automata (CA). The high speed processing of such simulations is important to modern scientific investigation especially within urban flood modelling, as urban expansion continues to increase the number of impervious areas that need to be modelled. Large numbers of model runs or large spatial or temporal resolution simulations are required in order to investigate, for example, climate change, early warning systems, and sewer design optimisation. The recent introduction of the Graphics Processor Unit (GPU) as a general purpose computing device (General Purpose Graphical Processor Unit, GPGPU) allows this hardware to be used for the accelerated processing of such locally driven simulations. A novel CA transformation for use with GPUs is proposed here to make maximum use of the GPU hardware. CA models are defined by the local state transition rules, which are used in every cell in parallel, and provide an excellent platform for a comparative study of possible alternative state transition rules. Writing local state transition rules for CA systems is a difficult task for humans due to the number and complexity of possible interactions, and is known as the ‘inverse problem’ for CA. Therefore, the use of Genetic Programming (GP) algorithms for the automatic development of state transition rules from example data is also investigated in this thesis. GP is investigated as it is capable of searching the intractably large areas of possible state transition rules, and producing near optimal solutions. However, such population-based optimisation algorithms are limited by the cost of many repeated evaluations of the fitness function, which in this case requires the comparison of a CA simulation to given target data. Therefore, the use of GPGPU hardware for the accelerated learning of local rules is also developed. Speed-up factors of up to 50 times over serial Central Processing Unit (CPU) processing are achieved on simple CA, up to 5-10 times speedup over the fully parallel CPU for the learning of urban flood modelling rules. Furthermore, it is shown GP can generate rules which perform competitively when compared with human formulated rules. This is achieved with generalisation to unseen terrains using similar input conditions and different spatial/temporal resolutions in this important application domain.

APA, Harvard, Vancouver, ISO, and other styles

14

Olsen, Daniel. "PERFORMANCE-AWARE RESOURCE MANAGEMENT OF MULTI-THREADED APPLICATIONS FOR MANY-CORE SYSTEMS." OpenSIUC, 2016. https://opensiuc.lib.siu.edu/theses/1975.

Full text

Abstract:

Future integrated systems will contain billions of transistors, composing tens to hundreds of IP cores. Modern computing platforms take advantage of this manufacturing technology advancement and are moving from Multi-Processor Systems-on-Chip (MPSoC) towards Many-Core architectures employing high numbers of processing cores. These hardware changes are also driven by application changes. The main characteristic of modern applications is the increased parallelism and the need for data storage and transfer. Resource management is a key technology for the successful use of such many-core platforms. The thread to core mapping can deal with the run-time dynamics of applications and platforms. Thus, the efficient resource management enables the efficient usage of the platform resources. maximizing platform utilization, minimizing interconnection network communication load and energy budget. In this thesis, we present a performance-aware resource management scheme for many- core architectures. Particular, the developed framework takes as input parallel applications and performs an application profiling. Based on that profile information, a thread to core mapping algorithm finds (i) the appropriate number of threads that this application will have in order to maximize the utilization of the system and (ii) the best mapping for maximizing the performance of the application under the selected number of threads. In order to validate the proposed algorithm, we used and extended the Sniper, state-of-art, many-core simulator. Last, we developed a discrete event simulator, on top of Sniper simulator, in order to test and validate multiple scenarios faster. The results show that the the proposed methodology, achieves on average a gain of 23% compared to a performance oriented mapping presented and each application completes its workload 18% faster on average.

APA, Harvard, Vancouver, ISO, and other styles

15

Martinez, Arroyo Gabriel Ernesto. "Cu2cl: a Cuda-To-Opencl Translator for Multi- and Many-Core Architectures." Thesis, Virginia Tech, 2011. http://hdl.handle.net/10919/34233.

Full text

Abstract:

The use of graphics processing units (GPUs) in high-performance parallel computing continues to steadily become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendor-neutral programming environment and run-time system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a â write once, run anywhereâ ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is almost straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to-OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in the CUDA Runtime API, and we have successfully translated several applications from the CUDA SDK and Rodinia benchmark suite. CU2CLâ s translation times are reasonable, allowing for many applications to be translated at once. The number of manual changes required after executing our translator on CUDA source is minimal, with some compiling and working with no changes at all. The performance of our automatically translated applications via CU2CL is on par with their manually ported counterparts.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

16

Méndez, Real Maria. "Spatial Isolation against Logical Cache-based Side-Channel Attacks in Many-Core Architectures." Thesis, Lorient, 2017. http://www.theses.fr/2017LORIS454/document.

Full text

Abstract:

L’évolution technologique ainsi que l’augmentation incessante de la puissance de calcul requise par les applications font des architectures ”many-core” la nouvelle tendance dans la conception des processeurs. Ces architectures sont composées d’un grand nombre de ressources de calcul (des centaines ou davantage) ce qui offre du parallélisme massif et un niveau de performance très élevé. En effet, les architectures many-core permettent d’exécuter en parallèle un grand nombre d’applications, venant d’origines diverses et de niveaux de sensibilité et de confiance différents, tout en partageant des ressources physiques telles que des ressources de calcul, de mémoire et de communication. Cependant, ce partage de ressources introduit également des vulnérabilités importantes en termes de sécurité. En particulier, les applications sensibles partageant des mémoires cache avec d’autres applications, potentiellement malveillantes, sont vulnérables à des attaques logiques de type canaux cachés basées sur le cache. Ces attaques, permettent à des applications non privilégiées d’accéder à des informations secrètes sensibles appartenant à d’autres applications et cela malgré des méthodes de partitionnement existantes telles que la protection de la mémoire et la virtualisation. Alors que d’importants efforts ont été faits afin de développer des contremesures à ces attaques sur des architectures multicoeurs, ces solutions n’ont pas été originellement conçues pour des architectures many-core récemment apparues et nécessitent d’être évaluées et/ou revisitées afin d’être applicables et efficaces pour ces nouvelles technologies. Dans ce travail de thèse, nous proposons d’étendre les services du système d’exploitation avec des mécanismes de déploiement d’applications et d’allocation de ressources afin de protéger les applications s’exécutant sur des architectures many-core contre les attaques logiques basées sur le cache. Plusieurs stratégies de déploiement sont proposées et comparées à travers différents indicateurs de performance. Ces contributions ont été implémentées et évaluées par prototypage virtuel basé sur SystemC et sur la technologie ”Open Virtual Platforms” (OVP)
The technological evolution and the always increasing application performance demand have made of many-core architectures the necessary new trend in processor design. These architectures are composed of a large number of processing resources (hundreds or more) providing massive parallelism and high performance. Indeed, many-core architectures allow a wide number of applications coming from different sources, with a different level of sensitivity and trust, to be executed in parallel sharing physical resources such as computation, memory and communication infrastructure. However, this resource sharing introduces important security vulnerabilities. In particular, sensitive applications sharing cache memory with potentially malicious applications are vulnerable to logical cache-based side-channel attacks. These attacks allow an unprivileged application to access sensitive information manipulated by other applications despite partitioning methods such as memory protection and virtualization. While a lot of efforts on countering these attacks on multi-core architectures have been done, these have not been designed for recently emerged many-core architectures and require to be evaluated, and/or revisited in order to be practical for these new technologies. In this thesis work, we propose to enhance the operating system services with security-aware application deployment and resource allocation mechanisms in order to protect sensitive applications against cached-based attacks. Different application deployment strategies allowing spatial isolation are proposed and compared in terms of several performance indicators. Our proposal is evaluated through virtual prototyping based on SystemC and Open Virtual Platforms(OVP) technology

APA, Harvard, Vancouver, ISO, and other styles

17

Yang, Simei. "Run-Time Management for Energy Efficiency of Cluster-Based Multi/Many Core Systems." Thesis, Nantes, 2020. http://www.theses.fr/2020NANT4004.

Full text

Abstract:

Les plates-formes multi/multi-cœurs organisées en clusters représentent des solutions prometteuses pour fournir des performances de calcul élevées et une efficacité énergétique optimisée dans les systèmes embarqués modernes. Ces plates-formes prennent souvent en charge la gestion dynamique tension/fréquence (DVFS) par cluster, ce qui permet à différents clusters de modifier leurs propres niveaux tension/ fréquence indépendamment. La complexité et le dynamisme croissants des applications sur ces plates-formes rendent nécessaire la gestion en ligne des ressources. Ce mémoire se concentre sur les méthodes de gestion en ligne des applications sur des systèmes multi/multi-cœurs en cluster pour améliorer l'efficacité énergétique. Cette thèse présente différentes stratégies de gestion qui permettent d'estimer l'influence mutuelle entre l'allocation des applications et la configuration en tension/fréquence des clusters afin d'obtenir respectivement une optimisation locale au sein d'un cluster et une optimisation globale dans l'ensemble du système. Les stratégies proposées permettent d'obtenir des solutions de gestion optimisées avec moins de complexité que les stratégies existantes. De plus, cette thèse présente une nouvelle approche de modélisation et de simulation qui permet d'évaluer les stratégies de gestion en ligne dans les systèmes multi/multi-cœurs pour garantir que les contraintes du système sont pleinement respectées. L'approche de simulation proposée est validée à l'aide d'un cadre de modélisation et de simulation industrielle
Cluster-based multi/many-core platforms represent promising solutions to deliver high computing performance and energy efficiency in modern embedded systems. These platforms often support per-cluster Dynamic Voltage/Frequency Scaling (DVFS), allowing different clusters to change their own v/f levels independently. The increasing application complexity and application dynamism on such platforms arise the need for run-time management. This dissertation focuses on the run-time management of applications on clusterbased multi/many-core systems to improve energy efficiency. Towards the run-time management purpose, this dissertation presents different management strategies that estimate the mutual influence between application mapping and cluster v/f configurations to respectively achieve local optimization within a cluster and global optimization in the overall system. The proposed management strategies can achieve near-optimal management solutions with less strategy complexity compared to state-of-theart strategies. ln addition, this dissertation presents a new modelling and simulation approach that allows the evaluation of run-time management strategies in multi/many-core systems to guarantee that system constraints are fully met. The proposed simulation approach is validated using an industrial modelling and simulation framework

APA, Harvard, Vancouver, ISO, and other styles

18

Binotto, Alécio Pedro Delazari. "A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2011. http://hdl.handle.net/10183/34768.

Full text

Abstract:

Atualmente, o computador pessoal (PC) moderno poder ser considerado como um cluster heterogênedo de um nodo, o qual processa simultâneamente inúmeras tarefas provenientes das aplicações. O PC pode ser composto por Unidades de Processamento (PUs) assimétricas, como a Unidade Central de Processamento (CPU), composta de múltiplos núcleos, a Unidade de Processamento Gráfico (GPU), composta por inúmeros núcleos e que tem sido um dos principais co-processadores que contribuiram para a computação de alto desempenho em PCs, entre outras. Neste sentido, uma plataforma de execução heterogênea é formada em um PC para efetuar cálculos intensivos em um grande número de dados. Na perspectiva desta tese, a distribuição da carga de trabalho de uma aplicação nas PUs é um fator importante para melhorar o desempenho das aplicações e explorar tal heterogeneidade. Esta questão apresenta desafios uma vez que o custo de execução de uma tarefa de alto nível em uma PU é não-determinístico e pode ser afetado por uma série de parâmetros não conhecidos a priori, como o tamanho do domínio do problema e a precisão da solução, entre outros. Nesse escopo, esta pesquisa de doutorado apresenta um sistema sensível ao contexto e de adaptação em tempo de execução com base em um compromisso entre a redução do tempo de execução das aplicações - devido a um escalonamento dinâmico adequado de tarefas de alto nível - e o custo de computação do próprio escalonamento aplicados em uma plataforma composta de CPU e GPU. Esta abordagem combina um modelo para um primeiro escalonamento baseado em perfis de desempenho adquiridos em préprocessamento com um modelo online, o qual mantém o controle do tempo de execução real de novas tarefas e escalona dinâmicamente e de modo eficaz novas instâncias das tarefas de alto nível em uma plataforma de execução composta de CPU e de GPU. Para isso, é proposto um conjunto de heurísticas para escalonar tarefas em uma CPU e uma GPU e uma estratégia genérica e eficiente de escalonamento que considera várias unidades de processamento. A abordagem proposta é aplicada em um estudo de caso utilizando uma plataforma de execução composta por CPU e GPU para computação de métodos iterativos focados na solução de Sistemas de Equações Lineares que se utilizam de um cálculo de stencil especialmente concebido para explorar as características das GPUs modernas. A solução utiliza o número de incógnitas como o principal parâmetro para a decisão de escalonamento. Ao escalonar tarefas para a CPU e para a GPU, um ganho de 21,77% em desempenho é obtido em comparação com o escalonamento estático de todas as tarefas para a GPU (o qual é utilizado por modelos de programação atuais, como OpenCL e CUDA para Nvidia) com um erro de escalonamento de apenas 0,25% em relação à combinação exaustiva.
A modern personal computer can be now considered as a one-node heterogeneous cluster that simultaneously processes several applications’ tasks. It can be composed by asymmetric Processing Units (PUs), like the multi-core Central Processing Unit (CPU), the many-core Graphics Processing Units (GPUs) - which have become one of the main co-processors that contributed towards high performance computing - and other PUs. This way, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In the perspective of this thesis, to improve the performance of applications and explore such heterogeneity, a workload distribution over the PUs plays a key role in such systems. This issue presents challenges since the execution cost of a task at a PU is non-deterministic and can be affected by a number of parameters not known a priori, like the problem size domain and the precision of the solution, among others. Within this scope, this doctoral research introduces a context-aware runtime and performance tuning system based on a compromise between reducing the execution time of the applications - due to appropriate dynamic scheduling of high-level tasks - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. This approach combines a model for a first scheduling based on an off-line task performance profile benchmark with a runtime model that keeps track of the tasks’ real execution time and efficiently schedules new instances of the high-level tasks dynamically over the CPU/GPU execution platform. For that, it is proposed a set of heuristics to schedule tasks over one CPU and one GPU and a generic and efficient scheduling strategy that considers several processing units. The proposed approach is applied in a case study using a CPU-GPU execution platform for computing iterative solvers for Systems of Linear Equations using a stencil code specially designed to explore the characteristics of modern GPUs. The solution uses the number of unknowns as the main parameter for assignment decision. By scheduling tasks to the CPU and to the GPU, it is achieved a performance gain of 21.77% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as OpenCL and CUDA for Nvidia) with a scheduling error of only 0.25% compared to exhaustive search.

APA, Harvard, Vancouver, ISO, and other styles

19

Khizakanchery, Natarajan Surya Narayanan. "Modeling performance of serial and parallel sections of multi-threaded programs in many-core era." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S015/document.

Full text

Abstract:

Ce travail a été effectué dans le contexte d'un projet financé par l'ERC, Defying Amdahl's Law (DAL), dont l'objectif est d'explorer les techniques micro-architecturales améliorant la performance des processeurs multi-cœurs futurs. Le projet prévoit que malgré les efforts investis dans le développement de programmes parallèles, la majorité des codes auront toujours une quantité signifiante de code séquentiel. Pour cette raison, il est primordial de continuer à améliorer la performance des sections séquentielles des-dits programmes. Le travail de recherche de cette thèse porte principalement sur l'étude des différences entre les sections parallèles et les sections séquentielles de programmes multithreadés (MT) existants. L'exploration de l'espace de conception des futurs processeurs multi-cœurs est aussi traitée, tout en gardant à l'esprit les exigences concernant ces deux types de sections ainsi que le compromis performance-surface
This thesis work is done in the general context of the ERC, funded Defying Amdahl's Law (DAL) project which aims at exploring the micro-architectural techniques that will enable high performance on future many-core processors. The project envisions that despite future huge investments in the development of parallel applications and porting it to the parallel architectures, most applications will still exhibit a significant amount of sequential code sections and, hence, we should still focus on improving the performance of the serial sections of the application. In this thesis, the research work primarily focuses on studying the difference between parallel and serial sections of the existing multi-threaded (MT) programs and exploring the design space with respect to the processor core requirement for the serial and parallel sections in future many-core with area-performance tradeoff as a primary goal

APA, Harvard, Vancouver, ISO, and other styles

20

Laville, Guillaume. "Exécution efficace de systèmes Multi-Agents sur GPU." Thesis, Besançon, 2014. http://www.theses.fr/2014BESA2016/document.

Full text

Abstract:

Ces dernières années ont consacré l’émergence du parallélisme dans la plupart des branches de l’informatique.Au niveau matériel, tout d’abord, du fait de la stagnation des fréquences de fonctionnement des unités decalcul. Au niveau logiciel, ensuite, avec la popularisation de nombreuses plates-formes d’exécution parallèle.Une forme de parallélisme est également présente dans les systèmes multi-agents, qui facilitent la description desystèmes complexes comme ensemble d’entités en interaction. Si l’adéquation entre ce parallélisme d’exécutionlogiciel et conceptuel semble naturelle, la parallélisation reste une démarche difficile, du fait des nombreusesadaptations devant être effectuées et des dépendances présentes explicitement dans de très nombreux systèmesmulti-agents.Dans cette thèse, nous proposons une solution pour faciliter l’implémentation de ces modèles sur une plateformed’exécution parallèle telle que le GPU. Notre bibliothèque MCMAS vient répondre à cette problématiqueau moyen de deux interfaces de programmation, une couche de bas niveau MCM permettant l’accès direct àOpenCL et un ensemble de plugins utilisables sans connaissances GPU. Nous étudions ensuite l’utilisation decette bibliothèque sur trois systèmes multi-agents existants : le modèle proie-prédateur, le modèle MIOR etle modèle Collemboles. Pour montrer l’intérêt de cette approche, nous présentons une étude de performancede chacun de ces modèles et une analyse des facteurs contribuant à une exécution efficace sur GPU. Nousdressons enfin un bilan du travail et des réflexions présentées dans notre mémoire, avant d’évoquer quelquespistes d’amélioration possibles de notre solution
These last years have seen the emergence of parallelism in many fields of computer science. This is explainedby the stagnation of the frequency of execution units at the hardware level and by the increasing usage ofparallel platforms at the software level. A form of parallelism is present in multi-agent systems, that facilitatethe description of complex systems as a collection of interacting entities. If the similarity between this softwareand this logical parallelism seems obvious, the parallelization process remains difficult in this case because ofthe numerous dependencies encountered in many multi-agent systems.In this thesis, we propose a common solution to facilitate the adaptation of these models on a parallel platformsuch as GPUs. Our library, MCMAS, provides access to two programming interface to facilitate this adaptation:a low-level layer providing direct access to OpenCL, MCM, and a high-level set of plugins not requiring anyGPU-related knowledge.We study the usage of this library on three existing multi-agent models : predator-prey,MIOR and Collembola. To prove the interest of the approach we present a performance study for each modeland an analysis of the various factors contributing to an efficient execution on GPUs. We finally conclude on aoverview of the work and results presented in the report and suggest future directions to enhance our solution

APA, Harvard, Vancouver, ISO, and other styles

21

Gupta, Vishakha. "Coordinated system level resource management for heterogeneous many-core platforms." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42750.

Full text

Abstract:

A challenge posed by future computer architectures is the efficient exploitation of their many and sometimes heterogeneous computational cores. This challenge is exacerbated by the multiple facilities for data movement and sharing across cores resident on such platforms. To answer the question of how systems software should treat heterogeneous resources, this dissertation describes an approach that (1) creates a common manageable pool for all the resources present in the platform, and then (2) provides virtual machines (VMs) with multiple `personalities', flexibly mapped to and efficiently run on the heterogeneous underlying hardware. A VM's personality is its execution context on the different types of available processing resources usable by the VM. We provide mechanisms for making such platforms manageable and evaluate coordinated scheduling policies for mapping different VM personalities on heterogeneous hardware. Towards that end, this dissertation contributes technologies that include (1) restructuring hypervisor and system functions to create high performance environments that enable flexibility of execution and data sharing, (2) scheduling and other resource management infrastructure for supporting diverse application needs and heterogeneous platform characteristics, and (3) hypervisor level policies to permit efficient and coordinated resource usage and sharing. Experimental evaluations on multiple heterogeneous platforms, like one comprised of x86-based cores with attached NVIDIA accelerators and others with asymmetric elements on chip, demonstrate the utility of the approach and its ability to efficiently host diverse applications and resource management methods.

APA, Harvard, Vancouver, ISO, and other styles

22

Wang, Boqian. "High-Performance Network-on-Chip Design for Many-Core Processors." Licentiate thesis, KTH, Elektronik och inbyggda system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-283517.

Full text

Abstract:

With the development of on-chip manufacturing technologies and the requirements of high-performance computing, the core count is growing quickly in Chip Multi/Many-core Processors (CMPs) and Multiprocessor System-on-Chip (MPSoC) to support larger scale parallel execution. Network-on-Chip (NoC) has become the de facto solution for CMPs and MPSoCs in addressing the communication challenge. In the thesis, we tackle a few key problems facing high-performance NoC designs. For general-purpose CMPs, we encompass a full system perspective to design high-performance NoC for multi-threaded programs. By exploring the cache coherence under the whole system scenario, we present a smart communication service called Advance Virtual Channel Reservation (AVCR) to provide a highway to target packets, which can greatly reduce their contention delay in NoC. AVCR takes advantage of the fact that we can know or predict the destination of some packets ahead of their arrival at the Network Interface (NI). Exploiting the time interval before a packet is ready, AVCR establishes an end-to-end highway from the source NI to the destination NI. This highway is built up by reserving the Virtual Channel (VC) resources ahead of the target packet transmission and offering priority service to flits in the reserved VC in the wormhole router, which can avoid the target packets’ VC allocation and switch arbitration delay. Besides, we also propose an admission control method in NoC with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node using the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. For application-specific MPSoCs, we focus on developing high-performance NoC and NI compatible with the common AMBA AXI4 interconnect protocol. To offer the possibility of utilizing the AXI4 based processors and peripherals in the on-chip network based system, we propose a whole system architecture solution to make the AXI4 protocol compatible with the NoC based communication interconnect in the many-core system. Due to possible out-of-order transmission in the NoC interconnect, which conflicts with the ordering requirements specified by the AXI4 protocol, in the first place, we especially focus on the design of the transaction ordering units, realizing a high-performance and low cost solution to the ordering requirements. The microarchitectures and the functionalities of the transaction ordering units are also described and explained in detail for ease of implementation. Then, we focus on the NI and the Quality of Service (QoS) support in NoC. In our design, the NI is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. The NoC based communication architecture is designed to support high-performance multiple QoS schemes. The NoC system contains Time Division Multiplexing (TDM) and VC subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for traffic distribution between two subnetworks. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets’ round-trip transfer in NoC.
Med utvecklingen av tillverkningsteknologi av on-chip och kraven på högpresterande da-toranläggning växer kärnantalet snabbt i Chip Multi/Many-core Processors (CMPs) ochMultiprocessor Systems-on-Chip (MPSoCs) för att stödja större parallellkörning. Network-on-Chip (NoC) har blivit den de facto lösningen för CMP:er och MPSoC:er för att mötakommunikationsutmaningen. I uppsatsen tar vi upp några viktiga problem med hög-presterande NoC-konstruktioner.Allmänna CMP:er omfattas ett fullständigt systemperspektiv för att design högprester-ande NoC för flertrådad program. Genom att utforska cachekoherensen under hela system-scenariot presenterar vi en smart kommunikationstjänst, AVCR (Advance Virtual ChannelReservation) för att tillhandahålla en motorväg till målpaket, vilket i hög grad kan min-ska deras förseningar i NoC. AVCR utnyttjar det faktum att vi kan veta eller förutsägadestinationen för vissa paket före deras ankomst till nätverksgränssnittet (Network inter-face, NI). Genom att utnyttja tidsintervallet innan ett paket är klart, etablerar AVCRen ände till ände motorväg från källan NI till destinationen NI. Denna motorväg byggsupp genom att reservera virtuell kanal (Virtual Channel, VC) resurser före målpaket-söverföringen och erbjuda prioriterade tjänster till flisar i den reserverade VC i wormholerouter. Dessutom föreslår vi också en tillträdeskontrollmetod i NoC med en centraliseradartificiellt neuronät (Artificial Neural Network, ANN) tillträdeskontroll, som kan förbättrasystemets prestanda genom att förutsäga den mest lämpliga injektionshastigheten för varjenod via nätverksprestationsinformationen. I onlinekontrollprocessen används en förbehan-dlingsenhet på data för att förenkla ANN-arkitekturen och göra förutsägningsresultatenmer korrekta. Baserat på den förbehandlade informationen bestämmer ANN-prediktornkontrollstrategin och sänder den till varje nod där tillträdeskontrollen kommer att tilläm-pas.För applikationsspecifika MPSoC:er fokuserar vi på att utveckla högpresterande NoCoch NI kompatibla med det gemensamma AMBA AXI4 protokoll. För att erbjuda möj-ligheten att använda AXI4-baserade processorer och kringutrustning i det on-chip baseradenätverkssystemet föreslår vi en hel systemarkitekturlösning för att göra AXI4 protokolletkompatibelt med den NoC-baserade kommunikation i det multikärnsystemet. På grundav den out-of-order överföring i NoC, som strider mot ordningskraven som anges i AXI4-protokollet, fokuserar vi i första hand på utformningen av transaktionsordningsenheterna,för att förverkliga en hög prestanda och låg kostnad-lösning på ordningskraven. Sedanfokuserar vi på NI och Quality of Service (QoS)-stödet i NoC. I vår design föreslås NI attgöra NoC-arkitekturen oberoende av AXI4-protokollet via meddelandeformatkonverteringmellan AXI4 signalformatet och paketformatet, vilket erbjuder NoC-designen hög flexi-bilitet. Den NoC-baserade kommunikationsarkitekturen är utformad för att stödja fleraQoS-schema med hög prestanda. NoC-systemet innehåller Time-Division Multiplexing(TDM) och VC-subnät för att tillämpa flera QoS-scheman på AXI4-signaler med olikaQoS-taggar och NI ansvarar för trafikdistribution mellan två subnät. Dessutom tillämpasen QoS-arvsmekanism i slav-sidan NI för att stödja QoS under paketets tur-returöverföringiNoC

QC 20201008

APA, Harvard, Vancouver, ISO, and other styles

23

Dickman, Thomas J. "Event List Organization and Management on the Nodes of a Many-Core Beowulf Cluster." University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378196499.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Roussel, Adrien. "Parallélisation sur un moteur exécutif à base de tâches des méthodes itératives pour la résolution de systèmes linéaires creux sur architecture multi et many coeurs : application aux méthodes de types décomposition de domaines multi-niveaux." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM010/document.

Full text

Abstract:

Les méthodes en simulation numérique dans le domaine de l’ingénierie pétrolière nécessitent la résolution de systèmes linéaires creux de grande taille et non structurés. La performance des méthodes itératives utilisées pour résoudre ces systèmes représente un enjeu majeur afin de permettre de tester de nombreux scénario.Dans ces travaux, nous présentons une manière d'implémenter des méthodes itératives parallèles au dessus d’un support exécutif à base de tâches. Afin de simplifier le développement des méthodes tout en gardant un contrôle fin sur la gestion du parallélisme, nous avons proposé une API permettant d’exprimer implicitement les dépendances entre tâches : la sémantique de l'API reste séquentielle et le parallélisme est implicite.Nous avons étendu le support exécutif HARTS pour enregistrer une trace d'exécution afin de mieux exploiter les architectures NUMA, tout comme de prendre en compte un placement des tâches et des données calculé au niveau de l’API. Nous avons porté et évalué l'API sur les processeurs many-coeurs KNL en considérant les différents types de mémoires de l’architecture. Cela nous a amené à optimiser le calcul du SpMV qui limite la performance de nos applications.L'ensemble de ce travail a été évalué sur des méthodes itératives et en particulier l’une de type décomposition de domaine. Nous montrons alors la pertinence de notre API, qui nous permet d’atteindre de très bon niveaux de performances aussi bien sur architecture multi-coeurs que many-coeurs
Numerical methods in reservoir engineering simulations lead to the resolution of unstructured, large and sparse linear systems. The performances of iterative methods employed in simulator to solve these systems are crucial in order to consider many more scenarios.In this work, we present a way to implement efficient parallel iterative methods on top of a task-based runtime system. It enables to simplify the development of methods while keeping control on parallelism management. We propose a linear algebra API which aims to implicitly express task dependencies: the semantic is sequential while the parallelism is implicit.We have extended the HARTS runtime system to monitor executions to better exploit NUMA architectures. Moreover, we implement a scheduling policy which exploits data locality for task placement. We have extended the API for KNL many-core systems while considering the various memory banks available. This work has led to the optimization of the SpMV kernel, one of the most time consuming operation in iterative methods.This work has been evaluated on iterative methods, and particularly on one method coming from domain decomposition. Hence, we demonstrate that the API enables to reach good performances on both multi-core and many-core architectures

APA, Harvard, Vancouver, ISO, and other styles

25

Vargas, Vallejo Vanessa Carolina. "Approche logicielle pour améliorer la fiabilité d’applications parallèles implémentées dans des processeurs multi-cœur et many-cœur." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAT042/document.

Full text

Abstract:

La grande capacité de calcul, flexibilité, faible consommation d'énergie, redondance intrinsèque et la haute performance fournie par les processeurs multi/many-cœur les rendent idéaux pour surmonter les nouveaux défis dans les systèmes informatiques. Cependant, le degré d'intégration de ces dispositifs augmente leur sensibilité aux effets des radiations naturelles. Par conséquent, des fabricants, partenaires industriels et universitaires travaillent ensemble pour améliorer les caractéristiques de ces dispositifs ce qui permettrait leur utilisation dans des systèmes embarqués et critiques. Dans ce contexte, le travail effectué dans le cadre de cette thèse vise à évaluer l'impact des SEEs (Single Event Effects) dans des applications parallèles s'exécutant sur des processeurs multi-cœur et many-cœur, et développer et valider une approche logicielle pour améliorer la fiabilité du système appelée N- MoRePar. La méthodologie utilisée pour l'évaluation était fondée sur des études de cas multiples. Les différents scénarios mis en œuvre envisagent une large gamme de configurations de système en termes de mode de multi-processing, modèle de programmation, modèle de mémoire et des ressources utilisées. Pour l'expérimentation, deux dispositifs COTS ont été sélectionnés: le quad-core Freescale PowerPC P2041 en technologie SOI 45nm, et le processeur multi-cœur KALRAY MPPA-256 en CMOS 28nm. Les études de cas ont été évaluées par l'injection de fautes et par des campagnes des tests sur neutron. Les résultats obtenus servent de guide aux développeurs pour choisir la configuration du système la plus fiable en fonction de leurs besoins. En outre, les résultats de l'évaluation de l'approche N-MoRePar basée sur des critères de redondance et de partitionnement augmente l'utilisation des processeurs COTS multi/many-cœur dans des systèmes qui requièrent haute fiabilité
The large computing capacity, great flexibility, low power consumption, intrinsic redundancy and high performance provided by multi/many-core processors make them ideal to overcome with the new challenges in computing systems. However, the degree of scale integration of these devices increases their sensitivity to the effects of natural radiation. Consequently manufacturers, industrial and university partners are working together to improve their characteristics which allow their usage in critical embedded systems. In this context, the work done throughout this thesis aims at evaluating the impact of SEEs on parallel applications running on multi-core and many-core processors, and proposing a software approach to improve the system reliability. The methodology used for evaluation was based on multiple-case studies. The different scenarios implemented consider a wide range of system configurations in terms of multi-processing mode, programming model, memory model, and resources used. For the experimentation, two COTS devices were selected: the Freescale PowerPC P2041 quad-core built in 45nm SOI technology, and the KALRAY MPPA-256 many-core processor built in 28nm CMOS technology. The case-studies were evaluated through fault-injection and neutron radiation. The obtained results serve as useful guidelines to developers for choosing the most reliable system configuration according to their requirements. Furthermore, the evaluation results of the proposed N-MoRePar fault-tolerant approach based on redundancy and partitioning criteria boost the usage of COTS multi/many-core processors in high level dependability systems

APA, Harvard, Vancouver, ISO, and other styles

26

Brière, Alexandre. "Modélisation système d'une architecture d'interconnexion RF reconfigurable pour les many-cœurs." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066296/document.

Full text

Abstract:

La multiplication du nombre de cœurs de calcul présents sur une même puce va depair avec une augmentation des besoins en communication. De plus, la variété des applications s’exécutant sur la puce provoque une hétérogénéité spatiale et temporelle des communications. C’est pour répondre à ces problématiques que nous pré-sentons dans ce manuscrit un réseau d’interconnexion sur puce dynamiquement reconfigurable utilisant la Radio Fréquence (RF). L’utilisation de la RF permet de disposer d’une bande passante plus importante tout en minimisant la latence. La possibilité de reconfigurer dynamiquement le réseau permet d’adapter cette puce many-cœur à la variabilité des applications et des communications. Nous présentons les raisons du choix de la RF par rapport aux autres nouvelles technologies du domaine que sont l’optique et la 3D, l’architecture détaillée de ce réseau et d’une puce le mettant en œuvre ainsi que l’évaluation de sa faisabilité et de ses performances. Durant la phase d’évaluation nous avons pu montrer que pour un Chip Multiprocessor (CMP) de 1 024 tuiles, notre solution permettait un gain en performance de 13 %. Un des avantages de ce réseau d’interconnexion RF est la possibilité de faire du broadcast sans surcoût par rapport aux communications point-à-point,ouvrant ainsi de nouvelles perspectives en termes de gestion de la cohérence mémoire notamment
The growing number of cores in a single chip goes along with an increase in com-munications. The variety of applications running on the chip causes spatial andtemporal heterogeneity of communications. To address these issues, we presentin this thesis a dynamically reconfigurable interconnect based on Radio Frequency(RF) for intra chip communications. The use of RF allows to increase the bandwidthwhile minimizing the latency. Dynamic reconfiguration of the interconnect allowsto handle the heterogeneity of communications. We present the rationale for choos-ing RF over optics and 3D, the detailed architecture of the network and the chipimplementing it, the evaluation of its feasibility and its performances. During theevaluation phase we were able to show that for a CMP of 1 024 tiles, our solutionallowed a performance gain of 13 %. One advantage of this RF interconnect is theability to broadcast without additional cost compared to point-to-point communi-cations, opening new perspectives in terms of cache coherence

APA, Harvard, Vancouver, ISO, and other styles

27

Müller, Thomas [Verfasser]. "Techniques for adapting Industrial Simulation Software for Power Devices and Networks to Multi- and Many-Core Architectures / Thomas Müller." München : Verlag Dr. Hut, 2014. http://d-nb.info/1052375227/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Hofmann, Johannes [Verfasser]. "A First-Principles Approach to Performance, Power, and Energy Models for Contemporary Multi- and Many-Core Processors / Johannes Hofmann." München : Verlag Dr. Hut, 2019. http://d-nb.info/1198542780/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Ho, Minh Quan. "Optimisation de transfert de données pour les processeurs pluri-coeurs, appliqué à l'algèbre linéaire et aux calculs sur stencils." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM042/document.

Full text

Abstract:

La prochaine cible de Exascale en calcul haute performance (High Performance Computing - HPC) et des récent accomplissements dans l'intelligence artificielle donnent l'émergence des architectures alternatives non conventionnelles, dont l'efficacité énergétique est typique des systèmes embarqués, tout en fournissant un écosystème de logiciel équivalent aux plateformes HPC classiques. Un facteur clé de performance de ces architectures à plusieurs cœurs est l'exploitation de la localité de données, en particulier l'utilisation de mémoire locale (scratchpad) en combinaison avec des moteurs d'accès direct à la mémoire (Direct Memory Access - DMA) afin de chevaucher le calcul et la communication. Un tel paradigme soulève des défis de programmation considérables à la fois au fabricant et au développeur d'application. Dans cette thèse, nous abordons les problèmes de transfert et d'accès aux mémoires hiérarchiques, de performance de calcul, ainsi que les défis de programmation des applications HPC, sur l'architecture pluri-cœurs MPPA de Kalray. Pour le premier cas d'application lié à la méthode de Boltzmann sur réseau (Lattice Boltzmann method - LBM), nous fournissons des techniques génériques et réponses fondamentales à la question de décomposition d'un domaine stencil itérative tridimensionnelle sur les processeurs clusterisés équipés de mémoires locales et de moteurs DMA. Nous proposons un algorithme de streaming et de recouvrement basé sur DMA, délivrant 33% de gain de performance par rapport à l'implémentation basée sur la mémoire cache par défaut. Le calcul de stencil multi-dimensionnel souffre d'un goulot d'étranglement important sur les entrées/sorties de données et d'espace mémoire sur puce limitée. Nous avons développé un nouvel algorithme de propagation LBM sur-place (in-place). Il consiste à travailler sur une seule instance de données, au lieu de deux, réduisant de moitié l'empreinte mémoire et cède une efficacité de performance-par-octet 1.5 fois meilleur par rapport à l'algorithme traditionnel dans l'état de l'art. Du côté du calcul intensif avec l'algèbre linéaire dense, nous construisons un benchmark de multiplication matricielle optimale, basé sur exploitation de la mémoire locale et la communication DMA asynchrone. Ces techniques sont ensuite étendues à un module DMA générique du framework BLIS, ce qui nous permet d'instancier une bibliothèque BLAS3 (Basic Linear Algebra Subprograms) portable et optimisée sur n'importe quelle architecture basée sur DMA, en moins de 100 lignes de code. Nous atteignons une performance maximale de 75% du théorique sur le processeur MPPA avec l'opération de multiplication de matrices (GEMM) de BLAS, sans avoir à écrire des milliers de lignes de code laborieusement optimisé pour le même résultat
Upcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on he Kalray MPPA many-core architecture. With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted withs cratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33%performance gain over the default cache-based implementation.High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm. On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result

APA, Harvard, Vancouver, ISO, and other styles

30

Ramos, Vargas Pablo Francisco. "Evaluation de la sensibilité face aux SEE et méthodologie pour la prédiction de taux d’erreurs d’applications implémentées dans des processeurs Multi-cœur et Many-cœur." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAT022/document.

Full text

Abstract:

La présente thèse vise à évaluer la sensibilité statique et dynamique face aux SEE de trois dispositifs COTS différents. Le premier est le processeur multi-cœurs P2041 de Freescale fabriqué en technologie 45nm SOI qui met en œuvre ECC et la parité dans leurs mémoires cache. Le second est le processeur multifonction Kalray MPPA-256 fabriqué en technologie CMOS 28nm TSMC qui intègre 16 clusters de calcul chacun avec 16 cœurs, et met en œuvre ECC dans ses mémoires statiques et parité dans ses mémoires caches. Le troisième est le microprocesseur Adapteva E16G301 fabriqué en 65nm CMOS processus qui intègre 16 cœurs de processeur et ne pas mettre en œuvre des mécanismes de protection. L'évaluation a été réalisée par des expériences de rayonnement avec des neutrons de 14 Mev dans des accélérateurs de particules pour émuler un environnement de rayonnement agressif, et par injection de fautes dans des mémoires cache, des mémoires partagées ou des registres de processeur pour simuler les conséquences des SEU dans l'exécution du programme. Une analyse approfondie des erreurs observées a été effectuée pour identifier les vulnérabilités dans les mécanismes de protection. Des zones critiques telles que des Tag adresses et des registres à usage général ont été affectées pendant les expériences de rayonnement. De plus, l'approche Code Emulating Upset (CEU), développée au Laboratoire TIMA, a été étendue pour des processeurs multi-cœur et many-cœur pour prédire le taux d'erreur d'application en combinant les résultats issus des campagnes d'injection de fautes avec ceux issus des expériences de rayonnement
The present thesis aims at evaluating the SEE static and dynamic sensitivity of three different COTS multi-core and many-core processors. The first one is the Freescale P2041 multi-core processor manufactured in 45nm SOI technology which implements ECC and parity in their cache memories. The second one is the Kalray MPPA-256 many-core processor manufactured in 28nm TSMC CMOS technology which integrates 16 compute clusters each one with 16 processor cores, and implements ECC in its static memories and parity in its cache memories. The third one is the Adapteva Epiphany E16G301 microprocessor manufactured in 65nm CMOS process which integrates 16 processor cores and do not implement protection mechanisms. The evaluation was accomplished through radiation experiments with 14 Mev neutrons in particle accelerators to emulate a harsh radiation environment, and by fault injection in cache memories, shared memories or processor registers, to simulate the consequences of SEUs in the execution of the program. A deep analysis of the observed errors was carried out to identify vulnerabilities in the protection mechanisms. Critical zones such as address tag and general purpose registers were affected during the radiation experiments. In addition, The Code Emulating Upset (CEU) approach, developed at TIMA Laboratory was extended to multi-core and many core processors for predicting the application error rate by combining the results issued from fault injection campaigns with those coming from radiation experiments

APA, Harvard, Vancouver, ISO, and other styles

31

Sarrazin, Guillaume. "Simulation fonctionnelle native pour des systèmes many-cœurs." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM015/document.

Full text

Abstract:

Le nombre de transistors dans une puce augmente constamment en suivant la conjecture de Moore, qui dit que le nombre de transistors dans une puce double tous les 2 ans. On arrive donc aujourd’hui à des systèmes d’une telle complexité que l’exploration architecturale ou le développement, même parallèle, de la conception de la puce et du code applicatif prend trop de temps. Pour réduire ce temps, la solution généralement admise consiste à développer des plateformes virtuelles reproduisant le comportement de la puce cible. Avoir une haute vitesse de simulation est essentiel pour ces plateformes, notamment pour les systèmes many-cœurs à cause du grand nombre de cœurs à simuler. Nous nous focalisons donc dans cette thèse sur la simulation native, dont le principe est de compiler le code source directement pour l’architecture hôte, offrant ainsi un temps de simulation que l’on peut espérer optimal. Mais un certain nombre de caractéristiques fonctionnelles spécifiques au cœur cible peuvent ne pas être présentes sur le cœur hôte. L’utilisation de l’assistance matérielle à la virtualisation (HAV) comme base pour la simulation native vient renforcer la dépendance de la simulation du cœur cible par rapport aux caractéristiques du cœur hôte. Nous proposons dans ce contexte un moyen de simuler les caractéristiques fonctionnelles spécifiques du cœur cible en simulation native basée sur le HAV. Parmi les caractéristiques propres au cœur cible, l’unité de calcul à virgule flottante est un élément important, bien trop souvent négligé en simulation native conduisant certains calculs à donner des résultats différents entre le cœur cible et le cœur hôte. Nous nous restreignons au cas de la simulation compilée et nous proposons une méthodologie permettant de simuler correctement les opérations de calcul à virgule flottante. Finalement la simulation native pose des problèmes de passage à l’échelle. Des problèmes de découplage temporel amènent à simuler inutilement certaines instructions lors de procédures de synchronisation entre des tâches s’exécutant sur les cœurs cibles, conduisant à une réduction de la vitesse de simulation. Nous proposons des solutions pour permettre un meilleur passage à l’échelle de la simulation native
The number of transistors in one chip is increasing following Moore’s conjecture which says that the number of transistors per chip doubles every two years. Current systems are so complex that chip design and specific software development for one chip take too much time even if software development is done in parallel with the design of the hardware architecture, often because of system integration issues. To help reducing this time, the general solution consists of using virtual platforms to reproduce the behavior of the target chip. The simulation speed of these platforms is a major issue, especially for many-core systems in which the number of programmable cores is really high. We focus in this thesis on native simulation. Its principle is to compile source code directly for the host architecture to allow very fast simulation, at the cost of requiring "equivalent" features on the target and host cores.However, some target core specific features can be missing in the host core. Hardware Assisted Virtualization (HAV) is used to ease native simulation but it reinforces the dependency of the target chip simulation regarding the host core capabilities. In this context, we propose a solution to simulate the target core functional specific features with HAV based native simulation.Among target core features, the floating point unit is an important element which is neglected in native simulation leading to potential functional differences between target and host computation results. We restrict our study to the compiled simulation technique and we propose a methodology ensuring to accurately simulate floating point computations while still keeping a good simulation speed.Finally, native simulation has a scalability issue. Time decoupling problems generate unnecessary code simulation during synchronisation protocols between threads executed on the target cores, leading to an important decrease of simulation speed when the number of cores grows. We address this problem and propose solutions to allow a better scalability for native simulation

APA, Harvard, Vancouver, ISO, and other styles

32

Müller, Thomas [Verfasser], Arndt [Akademischer Betreuer] Bode, Hans-Joachim [Akademischer Betreuer] Bungartz, and Carsten [Akademischer Betreuer] Trinitis. "Techniques for adapting Industrial Simulation Software for Power Devices and Networks to Multi- and Many-Core Architectures / Thomas Müller. Gutachter: Hans-Joachim Bungartz ; Arndt Bode ; Carsten Trinitis. Betreuer: Arndt Bode." München : Universitätsbibliothek der TU München, 2014. http://d-nb.info/1051078245/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Hashmi, Jahanzeb Maqbool. "Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1588038721555713.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Ye, Fan. "Nouveaux algorithmes numériques pour l’utilisation efficace des architectures multi-cœurs et hétérogènes." Thesis, Lille 1, 2015. http://www.theses.fr/2015LIL10169/document.

Full text

Abstract:

Cette étude est motivée par les besoins réels de calcul dans la physique des réacteurs. Notre objectif est de concevoir les algorithmes parallèles, y compris en proposant efficaces noyaux algébriques linéaires et méthodes numériques parallèles.Dans un environnement many-cœurs en mémoire partagée tel que le système Intel Many Integrated Core (MIC), la parallélisation efficace d'algorithmes est obtenue en termes de parallélisme des tâches à grain fin et parallélisme de données. Pour la programmation des tâches, deux principales stratégies, le partage du travail et vol de travail ont été étudiées. A des fins de généralité et de réutilisation, nous utilisons des interfaces de programmation parallèle standard, comme OpenMP, Cilk/Cilk+ et TBB. Pour vectoriser les tâches, les outils disponibles incluent Cilk+ array notation, pragmas SIMD, et les fonctions intrinsèques. Nous avons évalué ces techniques et proposé un noyau efficace de multiplication matrice-vecteur dense. Pour faire face à une situation plus complexe, nous proposons d'utiliser le modèle hybride MPI/OpenMP pour la mise en œuvre de noyau multiplication matrice-vecteur creux. Nous avons également conçu un modèle de performance pour modéliser les performances sur MICs et ainsi guider l'optimisation. En ce qui concerne la résolution de systèmes linéaires, nous avons proposé un solveur parallèle évolutif issue de méthodes Monte Carlo. Cette méthode présente un degré de parallélisme abondant, qui s’adapte bien à l'architecture multi-coeurs. Pour répondre à certains des goulots d'étranglement fondamentaux de ce solveur, nous proposons un modèle d'exécution basée sur les tâches qui résout complètement ces problèmes
This study is driven by the real computational needs coming from different fields of reactor physics, such as neutronics or thermal hydraulics, where the eigenvalue problem and resolution of linear system are the key challenges that consume substantial computing resources. In this context, our objective is to design and improve the parallel computing techniques, including proposing efficient linear algebraic kernels and parallel numerical methods. In a shared-memory environment such as the Intel Many Integrated Core (MIC) system, the parallelization of an algorithm is achieved in terms of fine-grained task parallelism and data parallelism. For scheduling the tasks, two main policies, the work-sharing and work-stealing was studied. For the purpose of generality and reusability, we use common parallel programming interfaces, such as OpenMP, Cilk/Cilk+, and TBB. For vectorizing the task, the available tools include Cilk+ array notation, SIMD pragmas, and intrinsic functions. We evaluated these techniques and propose an efficient dense matrix-vector multiplication kernel. In order to tackle a more complicated situation, we propose to use hybrid MPI/OpenMP model for implementing sparse matrix-vector multiplication. We also designed a performance model for characterizing performance issues on MIC and guiding the optimization. As for solving the linear system, we derived a scalable parallel solver from the Monte Carlo method. Such method exhibits inherently abundant parallelism, which is a good fit for many-core architecture. To address some of the fundamental bottlenecks of this solver, we propose a task-based execution model that completely fixes the problems

APA, Harvard, Vancouver, ISO, and other styles

35

Perret, Quentin. "Exécution prédictible sur processeurs pluri-coeurs." Thesis, Toulouse, ISAE, 2017. http://www.theses.fr/2017ESAE0007/document.

Full text

Abstract:

Dans cette thèse, nous étudions l’adéquation de l’architecture distribuée des processeurs pluricoeurs avec les besoins des concepteurs de systèmes temps réels avioniques. Nous proposons d’abord une analyse détaillée d’un processeur sur étagère (COTS), le KALRAY MPPA®-256, et nous identifions certaines de ses ressources partagées comme étant les goulots d’étranglement limitant à la fois la performance et la prédictibilité lorsque plusieurs applications s’exécutent. Pour limiter l’impact de ces ressources sur les WCETs, nous définissons formellement un modèle d’exécution isolant temporellement les applications concurrentes. Son implantation est réalisée au sein d’un hyperviseur offrant à chaque application un environnement d’exécution isolé et assurant le respect des comportements attendus en ligne. Sur cette base, nous formalisons la notion de partition comme l’association d’une application avec un budget de ressources matérielles. Dans notre approche, les applications s’exécutant au sein d’une partition sont garanties d’être temporellement isolées des autres applications. Ainsi, étant donné une application et son budget associé, nous proposons d’utiliser la programmation par contraintes pour vérifier automatiquement si les ressources allouées à l’application sont suffisantes pour permettre son exécution de manière satisfaisante. Dans le même temps, dans le cas où un budget est effectivement valide, notre approche fournit un ordonnancement et un placement complet de l’application sur le sous-ensemble des ressources du processeurallouées à sa partition
In this thesis, we study the suitability of the distributed architecture of many-core processors for the design of highly constrained real-time systems as is the case in avionics. We firstly propose a thorough analysis of an existing COTS processor, namely the KALRAY MPPA®-256, and we identify some of its shared resources to be paths of interference when shared among several applications. We provide an execution model to restrict the access to these resources in order to mitigate their impact on WCETs and to temporally isolate co-running applications. We describe in detail how such an execution model can be implemented with a hypervisor which practically provides the expected property of temporal isolation at run-time. Based on this, we formalize a notion of partition which represents the association of an application with a resource budget. In our approach, an application placed in a partition is guaranteed to be temporally isolated from applications placed in other partitions. Then, assuming that applications and resource budgets are given,we propose to use constraint programming in order to verify automatically whether the amount of resources requested by a budget is sufficient to meet all of the application’s constraints. Simultaneously, when a budget is valid, our approach computes a schedule of the application on the subset of the processor’s resources allocated to it

APA, Harvard, Vancouver, ISO, and other styles

36

Porada, Katarzyna. "Contribution à la parallélisation automatique : un modèle de processeur à beaucoup de coeurs parallélisant." Thesis, Perpignan, 2017. http://www.theses.fr/2017PERP0063.

Full text

Abstract:

Depuis les premiers ordinateurs on est en quête de machines plus rapides, plus puissantes, plus performantes. Après avoir épuisé le filon de l’augmentation de la fréquence, les constructeurs se sont tournés vers les multi-cœurs. Le modèle de calcul actuel repose sur les threads de l'OS qu’on exploite à travers différents langages à constructions parallèles. Cependant, la programmation multithread reste un art délicat car le calcul parallèle découpé en threads souffre d’un grand défaut : il est non déterministe.Pourtant, on peut faire du calcul parallèle déterministe, à condition de remplacer le modèle des threads par un modèle s’appuyant sur l’ordre partiel des dépendances. Dans cette thèse, nous proposons un modèle alternatif d’architecture qui exploite le parallélisme d’instructions (ILP) présent dans les programmes. Nous proposons de nombreuses techniques pour s’affranchir de la plupart des dépendances architecturales et obtenir ainsi un ILP qui croît avec la taille de l’exécution. L’ILP qu’on atteint de cette façon est suffisant pour permettre d’alimenter plusieurs milliers de cœurs. Les dépendances architecturales sérialisantes ayant été supprimées, l’ILP peut être bien mieux exploité que dans les architectures actuelles. Un code VHDL au niveau RTL de l’architecture a été développé pour en mesurer les avantages. Les résultats de synthèse d’un processeur allant de 2 à 64 cœurs montrent que la vitesse du matériel que nous proposons reste constante et que sa surface varie linéairement avec le nombre de cœurs. Cela prouve que le modèle d’interconnexion proposé est extensible
The pursuit for faster and more powerful machines started from the first computers. After exhausting the increase of the frequency, the manufacturers have turned to another solution and started to introduce multiples cores on a chip. The computational model is today based on the OS threads exploited through different languages offering parallel constructions. However, parallel programming remains an art because the thread management by the operating system is not deterministic.Nonetheless, it is possible to compute in a parallel deterministic way if we replace the thread model by a model built on the partial order of dependencies. In this thesis, we present an alternative architectural model exploiting the Instruction Level Parallelism (ILP) naturally present in applications. We propose many techniques to remove most of the architectural dependencies which leads to an ILP increasing with the execution length. The ILP which is reached this way is enough to allow feeding thousands of cores. Eliminating the architecutral dependencies serializing the run allows to exploit the ILP better than in actual microarchitectures. A VHDL code at the RTL level has been implemented to mesure the benefits of our design. The results of the synthesis of a processeur ranging from 2 to 64 cores are reported. They show that the speed of the proposed material keeps constant and the surface grows linearly with the number of cores : our interconnect solution is scalable

APA, Harvard, Vancouver, ISO, and other styles

37

Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066747/document.

Full text

Abstract:

L’évolution des supercalculateurs, de leur origine dans les années 60 jusqu’à nos jours, a fait face à 3 révolutions : (i) l’arrivée des transistors pour remplacer les triodes, (ii) l’apparition des calculs vectoriels, et (iii) l’organisation en grappe (clusters). Ces derniers se composent actuellement de processeurs standards qui ont profité de l’accroissement de leur puissance de calcul via une augmentation de la fréquence, la multiplication des cœurs sur la puce et l’élargissement des unités de calcul (jeu d’instructions SIMD). Un exemple récent comportant un grand nombre de cœurs et des unités vectorielles larges (512 bits) est le co-proceseur Intel Xeon Phi. Pour maximiser les performances de calcul sur ces puces en exploitant aux mieux ces instructions SIMD, il est nécessaire de réorganiser le corps des nids de boucles en tenant compte des aspects irréguliers (flot de contrôle et flot de données). Dans ce but, cette thèse propose d’étendre la transformation nommée Deep Jam pour extraire de la régularité d’un code irrégulier et ainsi faciliter la vectorisation. Ce document présente notre extension et son application sur une mini-application d’hydrodynamique multi-matériaux HydroMM. Ces travaux montrent ainsi qu’il est possible d’obtenir un gain de performances significatif sur des codes irréguliers
Since the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes

APA, Harvard, Vancouver, ISO, and other styles

38

Rihani, Hamza. "Analyse temporelle des systèmes temps-réels sur architectures pluri-coeurs." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM074/document.

Full text

Abstract:

La prédictibilité est un aspect important des systèmes temps-réel critiques. Garantir la fonctionnalité de ces systèmespasse par la prise en compte des contraintes temporelles. Les architectures mono-cœurs traditionnelles ne sont plussuffisantes pour répondre aux besoins croissants en performance de ces systèmes. De nouvelles architectures multi-cœurssont conçues pour offrir plus de performance mais introduisent d'autres défis. Dans cette thèse, nous nous intéressonsau problème d’accès aux ressources partagées dans un environnement multi-cœur.La première partie de ce travail propose une approche qui considère la modélisation de programme avec des formules desatisfiabilité modulo des théories (SMT). On utilise un solveur SMT pour trouverun chemin d’exécution qui maximise le temps d’exécution. On considère comme ressource partagée un bus utilisant unepolitique d’accès multiple à répartition dans le temps (TDMA). On explique comment la sémantique du programme analyséet le bus partagé peuvent être modélisés en SMT. Les résultats expérimentaux montrent une meilleure précision encomparaison à des approches simples et pessimistes.Dans la deuxième partie, nous proposons une analyse de temps de réponse de programmes à flot de données synchroness'exécutant sur un processeur pluri-cœur. Notre approche calcule l'ensemble des dates de début d'exécution et des tempsde réponse en respectant la contrainte de dépendance entre les tâches. Ce travail est appliqué au processeur pluri-cœurindustriel Kalray MPPA-256. Nous proposons un modèle mathématique de l'arbitre de bus implémenté sur le processeur. Deplus, l'analyse de l'interférence sur le bus est raffinée en prenant en compte : (i) les temps de réponseet les dates de début des tâches concurrentes, (ii) le modèle d'exécution, (iii) les bancsmémoires, (iv) le pipeline des accès à la mémoire. L'évaluation expérimentale est réalisé sur desexemples générés aléatoirement et sur un cas d'étude d'un contrôleur de vol
Predictability is of paramount importance in real-time and safety-critical systems, where non-functional properties --such as the timing behavior -- have high impact on the system's correctness. As many safety-critical systems have agrowing performance demand, classical architectures, such as single-cores, are not sufficient anymore. One increasinglypopular solution is the use of multi-core systems, even in the real-time domain. Recent many-core architectures, such asthe Kalray MPPA, were designed to take advantage of the performance benefits of a multi-core architecture whileoffering certain predictability. It is still hard, however, to predict the execution time due to interferences on sharedresources (e.g., bus, memory, etc.).To tackle this challenge, Time Division Multiple Access (TDMA) buses are often advocated. In the first part of thisthesis, we are interested in the timing analysis of accesses to shared resources in such environments. Our approach usesSatisfiability Modulo Theory (SMT) to encode the semantics and the execution time of the analyzed program. To estimatethe delays of shared resource accesses, we propose an SMT model of a shared TDMA bus. An SMT-solver is used to find asolution that corresponds to the execution path with the maximal execution time. Using examples, we show how theworst-case execution time estimation is enhanced by combining the semantics and the shared bus analysis in SMT.In the second part, we introduce a response time analysis technique for Synchronous Data Flow programs. These are mappedto multiple parallel dependent tasks running on a compute cluster of the Kalray MPPA-256 many-core processor. Theanalysis we devise computes a set of response times and release dates that respect the constraints in the taskdependency graph. We derive a mathematical model of the multi-level bus arbitration policy used by the MPPA. Further,we refine the analysis to account for (i) release dates and response times of co-runners, (ii)task execution models, (iii) use of memory banks, (iv) memory accesses pipelining. Furtherimprovements to the precision of the analysis were achieved by considering only accesses that block the emitting core inthe interference analysis. Our experimental evaluation focuses on randomly generated benchmarks and an avionics casestudy

APA, Harvard, Vancouver, ISO, and other styles

39

Zhao, Jia. "On-chip monitoring infrastructures and strategies for multi-core and many-core systems." 2012. https://scholarworks.umass.edu/dissertations/AAI3518401.

Full text

Abstract:

On-chip sensors are widely used in processors to closely monitor system temperature, performance, and supply power fluctuation, among other environmental conditions. As the number of cores integrated in a single die increases, the total number of on-chip sensors increases correspondingly. Information from these sensors needs to be collected and processed efficiently and effectively at run-time to achieve high performance and low power consumption at the system level. In this dissertation, dedicated infrastructures for sensor data collection, sensor measurement calibration and the use of sensor information to overcome thermal emergencies, voltage droops and soft errors are examined. These problems are addressed in both multi-core and many-core environments. This dissertation research first shows that a dedicated on-chip monitoring infrastructure (monitor network-on-chip, MNoC) can achieve better performance than a bus in interconnecting on-chip sensors in multi-core systems. Our experiment results show that this dedicated on-chip network can provide consistent low latency for sensor data packets without affecting application on-chip network traffic. The necessity of a dedicated infrastructure for monitoring is then addressed in a many-core environment. A two-level hierarchical network-on-chip (NoC), which allows for efficient sensor data collection in many-cores, is introduced. This design is evaluated using benchmark driven simulations for a three-dimensional many-core system. The use of a two-level NoC is shown to provide an average of 65% sensor data latency improvement versus a flat sensor data NoC structure for a 256 core system. As the number of on-chip sensors increases, the accuracy of these sensors' measurement becomes more and more important since it directly affects processor performance and reliability. For example, on-chip thermal sensors are used for monitoring system temperature and their measurements may affect the system frequency and operating voltage. A new approach is introduced in this dissertation, which determines models for imprecise thermal sensor measurements using probability distributions based on device parameters. Thermal measurements which are determined to be imprecise can be excluded from thermal management strategies. The collecting of on-chip sensor measurements is facilitated by dedicated on-chip monitoring infrastructures. Experiments show that a sensor operating outside a desired precision can be identified with a detection rate of 87% and an average false alarm rate of < 6%, with a confidence level of 90%. The introduction of dedicated infrastructures for on-chip monitoring opens the door for more advanced run-time system control strategies. In this dissertation, run-time voltage droop compensation and soft error protection in multi-cores are targeted. High voltage droops in modern processors may cause serious reliability problems. A voltage droop compensation method considering ambient temperature changes is proposed to address this issue. In the proposed method, different reduced frequencies are used at different temperatures. A voltage droop signature sharing method in multi-core systems is proposed for early detection and remediation of high voltage droops. These two methods are combined and implemented in an 8 core system and a performance benefit of 5% on average is achieved. Soft errors are caused by alpha particles and cosmic rays, among other sources, and can be detected by processor component redundancy. An approach to selectively enable redundancy to combat soft errors is also proposed in this dissertation. Both dual modular redundancy (DMR) and chip-level redundant threading (CRT) are used for adaptive redundancy protection. Power and energy savings over 8% are achieved by both methods compared to conventional methods. A multi-core architecture vulnerability factor (AVF) is also calculated for a multi-core environment, using the MNoC infrastructure.

APA, Harvard, Vancouver, ISO, and other styles

40

Zhang, Tiansheng. "Resource and thermal management in 3D-stacked multi-/many-core systems." Thesis, 2017. https://hdl.handle.net/2144/20837.

Full text

Abstract:

Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures. 3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead. This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing. This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory. At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads. On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.
2018-03-09T00:00:00Z

APA, Harvard, Vancouver, ISO, and other styles

41

"Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium." Universität Potsdam, 2012. http://opus.kobv.de/ubp/volltexte/2012/5789/.

Full text

Abstract:

In continuation of a successful series of events, the 4th Many-core Applications Research Community (MARC) symposium took place at the HPI in Potsdam on December 8th and 9th 2011. Over 60 researchers from different fields presented their work on many-core hardware architectures, their programming models, and the resulting research questions for the upcoming generation of heterogeneous parallel systems.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Multi-Core and many-Core'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles