Dissertations / Theses on the topic 'Multi-Core and many-Core'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 41 dissertations / theses for your research on the topic 'Multi-Core and many-Core.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Kanellou, Eleni. "Data structures for current multi-core and future many-core architectures." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S171/document.
Full textThough a majority of current processor architectures relies on shared, cache-coherent memory, current prototypes that integrate large amounts of cores, connected through a message-passing substrate, indicate that architectures of the near future may have these characteristics. Either of those tendencies requires that processes execute in parallel, making concurrent programming a necessary tool. The inherent difficulty of reasoning about concurrency, however, may make the new processor architectures hard to program. In order to deal with issues such as this, we explore approaches for providing ease of programmability. We propose WFR-TM, an approach based on transactional memory (TM), which is a concurrent programming paradigm that employs transactions in order to synchronize the access to shared data. A transaction may either commit, making its updates visible, or abort, discarding its updates. WFR-TM combines desirable characteristics of pessimistic and optimistic TM. In a pessimistic TM, no transaction ever aborts; however, in order to achieve that, existing TM algorithms employ locks in order to execute update transactions sequentially, decreasing the degree of achieved parallelism. Optimistic TMs execute all transactions concurrently but commit them only if they have encountered no conflict during their execution. WFR-TM provides read-only transactions that are wait-free, without ever executing expensive synchronization operations (like CAS, LL/SC, etc), or sacrificing the parallelism between update transactions. We further present Dense, a concurrent graph implementation. Graphs are versatile data structures that allow the implementation of a variety of applications. However, multi-process applications that rely on graphs still largely use a sequential implementation. We introduce an innovative concurrent graph model that provides addition and removal of any edge of the graph, as well as atomic traversals of a part (or the entirety) of the graph. Dense achieves wait-freedom by relying on light-weight helping and provides the inbuilt capability of performing a partial snapshot on a dynamically determined subset of the graph. We finally aim at predicted future architectures. In the interest of ode reuse and of a common paradigm, there is recent momentum towards porting software runtime environments, originally intended for shared-memory settings, onto non-cache-coherent machines. JVM, the runtime environment of the high-productivity language Java, is a notable example. Concurrent data structure implementations are important components of the libraries that environments like these incorporate. With the goal of contributing to this effort, we study general techniques for implementing distributed data structures assuming they have to run on many-core architectures that offer either partially cache-coherent memory or no cache coherence at all and present implementations of stacks, queues, and lists
Serpa, Matheus da Silva. "Source code optimizations to reduce multi core and many core performance bottlenecks." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2018. http://hdl.handle.net/10183/183139.
Full textNowadays, there are several different architectures available not only for the industry but also for final consumers. Traditional multi-core processors, GPUs, accelerators such as the Xeon Phi, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. Related work proved to have a wide variety of solutions. Most of then focused on improving only memory performance. Others focus on load balancing, vectorization, and thread and data mapping, but perform them separately, losing optimization opportunities. In this master thesis, we propose several optimization techniques to improve the performance of a real-world seismic exploration application provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we show that loop interchange is a useful technique to improve the performance of different cache memory levels, improving the performance by up to 5.3 and 3.9 on the Intel Broadwell and Intel Knights Landing architectures, respectively. By changing the code to enable vectorization, performance was increased by up to 1.4 and 6.5 . Load Balancing improved the performance by up to 1.1 on Knights Landing. Thread and data mapping techniques were also evaluated, with a performance improvement of up to 1.6 and 4.4 . We also compared the best version of each architecture and showed that we were able to improve the performance of Broadwell by 22.7 and Knights Landing by 56.7 compared to a naive version, but, in the end, Broadwell was 1.2 faster than Knights Landing.
Martins, Andr? Lu?s Del Mestre. "Multi-objective resource management for many-core systems." Pontif?cia Universidade Cat?lica do Rio Grande do Sul, 2018. http://tede2.pucrs.br/tede2/handle/tede/8096.
Full textApproved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2018-06-04T11:21:09Z (GMT) No. of bitstreams: 1 ANDR?_LU?S_DEL_MESTRE_MARTINS_TES.pdf: 10284806 bytes, checksum: 089cdc5e5c91b6ab23816b94fdbe3d1d (MD5)
Made available in DSpace on 2018-06-04T11:37:12Z (GMT). No. of bitstreams: 1 ANDR?_LU?S_DEL_MESTRE_MARTINS_TES.pdf: 10284806 bytes, checksum: 089cdc5e5c91b6ab23816b94fdbe3d1d (MD5) Previous issue date: 2018-03-19
Sistemas many-core integram m?ltiplos cores em um chip, fornecendo alto desempenho para v?rios segmentos de mercado. Novas tecnologias introduzem restri??es de pot?ncia conhecidos como utilization-wall ou dark-silicon, onde a dissipa??o de pot?ncia no chip impede que todos os PEs sejam utilizados simultaneamente em m?ximo desempenho. A carga de trabalho (workload) em sistemas many-core inclui aplica??es tempo real (RT), com restri??es de vaz?o e temporiza??o. Al?m disso, workloads t?picos geram vales e picos de utiliza??o de recursos ao longo do tempo. Este cen?rio, sistemas complexos de alto desempenho sujeitos a restri??es de pot?ncia e utiliza??o, exigem um gerenciamento de recursos (RM) multi-objetivos capaz de adaptar dinamicamente os objetivos do sistema, respeitando as restri??es impostas. Os trabalhos relacionados que tratam aplica??es RT aplicam uma an?lise em tempo de projeto com o workload esperado, para atender ?s restri??es de vaz?o e temporiza??o. Para abordar esta limita??o do estado-da-arte, ecis?es em tempo de projeto, esta Tese prop?e um gerenciamento hier?rquico de energia (REM), sendo o primeiro trabalho que considera a execu??o de aplica??es RT e ger?ncia de recursos sujeitos a restri??es de pot?ncia, sem uma an?lise pr?via do conjunto de aplica??es. REM emprega diferentes heur?sticas de mapeamento e de DVFS para reduzir o consumo de energia. Al?m de n?o incluir as aplica??es RT, os trabalhos relacionados n?o consideram um workload din?mico, propondo RMs com um ?nico objetivo a otimizar. Para tratar esta segunda limita??o do estado-da-arte, RMs com objetivo ?nico a otimizar, esta Tese apresenta um gerenciamento de recursos multi-objetivos adaptativo e hier?rquico (MORM) para sistemas many-core com restri??es de pot?ncia, considerando workloads din?micos com picos e vales de utiliza??o. MORM pode mudar dinamicamente os objetivos, priorizando energia ou desempenho, de acordo com o comportamento do workload. Ambos RMs (REM e MORM) s?o abordagens multi-objetivos. Esta Tese emprega o paradigma Observar-Decidir-Atuar (ODA) como m?todo de projeto para implementar REM e MORM. A Observa??o consiste em caracterizar os cores e integrar monitores de hardware para fornecer informa??es precisas e r?pidas relacionadas ? energia. A Atua??o configura os atuadores do sistema em tempo de execu??o para permitir que os RMs atendam ?s decis?es multi-objetivos. A Decis?o corresponde ? implementa??o do REM e do MORM, os quais compartilham os m?todos de Observa??o e Atua??o. REM e MORM destacam-se dos trabalhos relacionados devido ?s suas caracter?sticas de escalabilidade, abrang?ncia e estimativa de pot?ncia e energia precisas. As avalia??es utilizando REM em manycores com at? 144 cores reduzem o consumo de energia entre 15% e 28%, mantendo as viola??es de temporiza??o abaixo de 2,5%. Resultados mostram que MORM pode atender dinamicamente a objetivos distintos. Comparado MORM com um RM estado-da-arte, MORM otimiza o desempenho em vales de workload em 11,56% e em picos workload em at? 49%.
Many-core systems integrate several cores in a single die to provide high-performance computing in multiple market segments. The newest technology nodes introduce restricted power caps so that results in the utilization-wall (also known as dark silicon), i.e., the on-chip power dissipation prevents the use of all resources at full performance simultaneously. The workload of many-core systems includes real-time (RT) applications, which bring the application throughput as another constraint to meet. Also, dynamic workloads generate valleys and peaks of resources utilization over the time. This scenario, complex high-performance systems subject to power and performance constraints, creates the need for multi-objective resource management (RM) able to dynamically adapt the system goals while respecting the constraints. Concerning RT applications, related works apply a design-time analysis of the expected workload to ensure throughput constraints. To cover this limitation, design-time decisions, this Thesis proposes a hierarchical Runtime Energy Management (REM) for RT applications as the first work to link the execution of RT applications and RM under a power cap without design-time analysis of the application set. REM employs different mapping and DVFS (Dynamic Voltage Frequency Scaling) heuristics for RT and non-RT tasks to save energy. Besides not considering RT applications, related works do not consider the workload variation and propose single-objective RMs. To tackle this second limitation, single-objective RMs, this Thesis presents a hierarchical adaptive multi-objective resource management (MORM) for many-core systems under a power cap. MORM addresses dynamic workloads with peaks and valleys of resources utilization. MORM can dynamically shift the goals to prioritize energy or performance according to the workload behavior. Both RMs (REM and MORM), are multi-objective approaches. This Thesis employs the Observe-Decide-Act (ODA) paradigm as the design methodology to implement REM and MORM. The Observing consists on characterizing the cores and on integrating hardware monitors to provide accurate and fast power-related information for an efficient RM. The Actuation configures the system actuators at runtime to enable the RMs to follow the multi-objective decisions. The Decision corresponds to REM and MORM, which share the Observing and Actuation infrastructure. REM and MORM stand out from related works regarding scalability, comprehensiveness, and accurate power and energy estimation. Concerning REM, evaluations on many-core systems up to 144 cores show energy savings from 15% to 28% while keeping timing violations below 2.5%. Regarding MORM, results show it can drive applications to dynamically follow distinct objectives. Compared to a stateof- the-art RM targeting performance, MORM speeds up the workload valley by 11.56% and the workload peak by up to 49%.
Jelena, Tekić. "Оптимизација CFD симулације на групама вишејезгарних хетерогених архитектура." Phd thesis, Univerzitet u Novom Sadu, Prirodno-matematički fakultet u Novom Sadu, 2019. https://www.cris.uns.ac.rs/record.jsf?recordId=110976&source=NDLTD&language=en.
Full textPredmet istraživanja teze je iz oblasti paralelnog programiranja,implementacija CFD (Computational Fluid Dynamics) metode na višeheterogenih višejezgarnih uređaja istovremeno. U radu je prikazanonekoliko algoritama čiji je cilj ubrzanje CFD simulacije na personalnim računarima. Pokazano je da opisano rešenje postiže zadovoljavajuće performanse i na HPC uređajima (Tesla grafičkim karticama). Napravljena je simulacija u mikroservis arhitekturi koja je portabilna i fleksibilna i dodatno olakšava rad na personalnim računarima.
The case study of this dissertation belongs to the field of parallel programming, the implementation of CFD (Computational Fluid Dynamics) method on several heterogeneous multiple core devices simultaneously. The paper presents several algorithms aimed at accelerating CFD simulation on common computers. Also it has been shown that the described solution achieves satisfactory performance onHPC devices (Tesla graphic cards). Simulation is created in micro-service architecture that is portable and flexible and makes it easy to test CFDsimulations on common computers.
Singh, Ajeet. "GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading." Thesis, Virginia Tech, 2009. http://hdl.handle.net/10919/34264.
Full text
Consequently, this thesis proposes a framework called GePSeA (General Purpose Software
Acceleration Framework), which uses a small
fraction of the computational power on multi-core architectures to offload complex application-specific tasks. Specifically, GePSeA provides a lightweight process that acts as a helper agent to the application by executing application-specific tasks asynchronously and efficiently. GePSeA is not meant to replace hardware accelerators but to extend them. GePSeA
provide several utilities called core components that offload tasks on to the core or to the special-purpose hardware when available in a way that is transparent to the application. Examples of such core components include reliable communication service, distributed lock management, global memory management, dynamic load distribution and network protocol processing. We then apply the GePSeA framework to two applications, namely mpiBLAST, an open-source computational biology application and Reliable Blast UDP (RBUDP) based file transfer application. We observe significant speed-up for both applications.
Master of Science
Singh, Kunal. "High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524089757826551.
Full textLo, Moustapha. "Application des architectures many core dans les systèmes embarqués temps réel." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAM002/document.
Full textTraditional single-cores are no longer sufficient to meet the growing needs of performance in avionics domain. Multi-core and many-core processors have emerged in the recent years in order to integrate several functions thanks to the resource sharing. In contrast, all multi-core and many-core processorsdo not necessarily satisfy the avionic constraints. We prefer to have more determinism than computing power because the certification of such processors depends on mastering the determinism.The aim of this thesis is to evaluate the many-core processor (MPPA-256) from Kalray in avionic context. We choose the maintenance function HMS (Health Monitoring System) which requires an important bandwidth and a response time guarantee. In addition, this function has also parallelism properties. It computes data from sensors that are functionally independent and, therefore their processing can be parallelized in several cores. This study focuses on deploying the existing sequential HMS on a many-core processor from the data acquisition to the computation of the health indicators with a strongemphasis on the input flow.Our research led to five main contributions:• Transformation of the global existing algorithms into a real-time ones which can process data as soon as they are available.• Management of the input flow of vibration samples from the sensors to the computation of the health indicators, the availability of raw vibration data in the internal cluster, when they are consumed and finally the workload estimation.• Implementing a lightweight Timing measurements directly on the MPPA-256 by adding timestamps in the data flow.• Software architecture that respects real-time constraints even in the worst cases. The software architecture is based on three pipeline stages.• Illustration of the limits of the existing function: our experiments have shown that the contextual parameters of the helicopter such as the rotor speed must be correlated with the health indicators to reduce false alarms
Lukarski, Dimitar [Verfasser]. "Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms : Parallel Solvers and Preconditioners / Dimitar Lukarski." Karlsruhe : KIT-Bibliothek, 2012. http://d-nb.info/1020663480/34.
Full textJúnior, Manoel Baptista da Silva. "Portabilidade com eficiência de trechos da dinâmica do modelo BRAMS entre arquiteturas multi-core e many-core." Instituto Nacional de Pesquisas Espaciais (INPE), 2015. http://urlib.net/sid.inpe.br/mtc-m21b/2015/04.28.19.21.
Full textThe continuous growth of spatial and temporal resolutions in current meteorological models demands increasing processing power. The prompt execution of these models requires the use of supercomputers with hundreds or thousands of nodes. Currently, these models are executed at the operational environment of CPTEC on a supercomputer composed of nodes with CPUs with tens of cores (multi-core). Newer supercomputer generations have nodes with CPUs coupled to processing accelerators, typically graphics cards (GPGPUs), containing hundreds of cores (many-core). The rewriting of the model codes in order to use such nodes efficiently, with or without graphics cards (portable code), represents a challenge. The OpenMP programming interface proposed decades ago is a standard for decades to efficiently exploit multi-core architectures. A new programming interface, OpenACC, proposed decades ago is the many-core architectures. These two programming interfaces are similar, since they are based on parallelization directives for the concurrent execution of threads. This work shows the feasibility of writing a single code imbedding both interfaces and presenting acceptable efficiency. When executed on nodes with multi-core or many-core architecture. The code chosen as a case study is the advection of scalars, a part of the dynamics of the regional meteorological model BRAMS (Brazilian Regional Atmospheric Modeling System).
Thucanakkenpalayam, Sundararajan Karthik. "Energy efficient cache architectures for single, multi and many core processors." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/9916.
Full textBaskaran, Muthu Manikandan. "Compile-time and Run-time Optimizations for Enhancing Locality and Parallelism on Multi-core and Many-core Systems." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1253557044.
Full textZhang, Jing. "Transforming and Optimizing Irregular Applications for Parallel Architectures." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/82069.
Full textPh. D.
Gibson, Michael John. "Genetic programming and cellular automata for fast flood modelling on multi-core CPU and many-core GPU computers." Thesis, University of Exeter, 2015. http://hdl.handle.net/10871/20364.
Full textOlsen, Daniel. "PERFORMANCE-AWARE RESOURCE MANAGEMENT OF MULTI-THREADED APPLICATIONS FOR MANY-CORE SYSTEMS." OpenSIUC, 2016. https://opensiuc.lib.siu.edu/theses/1975.
Full textMartinez, Arroyo Gabriel Ernesto. "Cu2cl: a Cuda-To-Opencl Translator for Multi- and Many-Core Architectures." Thesis, Virginia Tech, 2011. http://hdl.handle.net/10919/34233.
Full textMaster of Science
Méndez, Real Maria. "Spatial Isolation against Logical Cache-based Side-Channel Attacks in Many-Core Architectures." Thesis, Lorient, 2017. http://www.theses.fr/2017LORIS454/document.
Full textThe technological evolution and the always increasing application performance demand have made of many-core architectures the necessary new trend in processor design. These architectures are composed of a large number of processing resources (hundreds or more) providing massive parallelism and high performance. Indeed, many-core architectures allow a wide number of applications coming from different sources, with a different level of sensitivity and trust, to be executed in parallel sharing physical resources such as computation, memory and communication infrastructure. However, this resource sharing introduces important security vulnerabilities. In particular, sensitive applications sharing cache memory with potentially malicious applications are vulnerable to logical cache-based side-channel attacks. These attacks allow an unprivileged application to access sensitive information manipulated by other applications despite partitioning methods such as memory protection and virtualization. While a lot of efforts on countering these attacks on multi-core architectures have been done, these have not been designed for recently emerged many-core architectures and require to be evaluated, and/or revisited in order to be practical for these new technologies. In this thesis work, we propose to enhance the operating system services with security-aware application deployment and resource allocation mechanisms in order to protect sensitive applications against cached-based attacks. Different application deployment strategies allowing spatial isolation are proposed and compared in terms of several performance indicators. Our proposal is evaluated through virtual prototyping based on SystemC and Open Virtual Platforms(OVP) technology
Yang, Simei. "Run-Time Management for Energy Efficiency of Cluster-Based Multi/Many Core Systems." Thesis, Nantes, 2020. http://www.theses.fr/2020NANT4004.
Full textCluster-based multi/many-core platforms represent promising solutions to deliver high computing performance and energy efficiency in modern embedded systems. These platforms often support per-cluster Dynamic Voltage/Frequency Scaling (DVFS), allowing different clusters to change their own v/f levels independently. The increasing application complexity and application dynamism on such platforms arise the need for run-time management. This dissertation focuses on the run-time management of applications on clusterbased multi/many-core systems to improve energy efficiency. Towards the run-time management purpose, this dissertation presents different management strategies that estimate the mutual influence between application mapping and cluster v/f configurations to respectively achieve local optimization within a cluster and global optimization in the overall system. The proposed management strategies can achieve near-optimal management solutions with less strategy complexity compared to state-of-theart strategies. ln addition, this dissertation presents a new modelling and simulation approach that allows the evaluation of run-time management strategies in multi/many-core systems to guarantee that system constraints are fully met. The proposed simulation approach is validated using an industrial modelling and simulation framework
Binotto, Alécio Pedro Delazari. "A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2011. http://hdl.handle.net/10183/34768.
Full textA modern personal computer can be now considered as a one-node heterogeneous cluster that simultaneously processes several applications’ tasks. It can be composed by asymmetric Processing Units (PUs), like the multi-core Central Processing Unit (CPU), the many-core Graphics Processing Units (GPUs) - which have become one of the main co-processors that contributed towards high performance computing - and other PUs. This way, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In the perspective of this thesis, to improve the performance of applications and explore such heterogeneity, a workload distribution over the PUs plays a key role in such systems. This issue presents challenges since the execution cost of a task at a PU is non-deterministic and can be affected by a number of parameters not known a priori, like the problem size domain and the precision of the solution, among others. Within this scope, this doctoral research introduces a context-aware runtime and performance tuning system based on a compromise between reducing the execution time of the applications - due to appropriate dynamic scheduling of high-level tasks - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. This approach combines a model for a first scheduling based on an off-line task performance profile benchmark with a runtime model that keeps track of the tasks’ real execution time and efficiently schedules new instances of the high-level tasks dynamically over the CPU/GPU execution platform. For that, it is proposed a set of heuristics to schedule tasks over one CPU and one GPU and a generic and efficient scheduling strategy that considers several processing units. The proposed approach is applied in a case study using a CPU-GPU execution platform for computing iterative solvers for Systems of Linear Equations using a stencil code specially designed to explore the characteristics of modern GPUs. The solution uses the number of unknowns as the main parameter for assignment decision. By scheduling tasks to the CPU and to the GPU, it is achieved a performance gain of 21.77% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as OpenCL and CUDA for Nvidia) with a scheduling error of only 0.25% compared to exhaustive search.
Khizakanchery, Natarajan Surya Narayanan. "Modeling performance of serial and parallel sections of multi-threaded programs in many-core era." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S015/document.
Full textThis thesis work is done in the general context of the ERC, funded Defying Amdahl's Law (DAL) project which aims at exploring the micro-architectural techniques that will enable high performance on future many-core processors. The project envisions that despite future huge investments in the development of parallel applications and porting it to the parallel architectures, most applications will still exhibit a significant amount of sequential code sections and, hence, we should still focus on improving the performance of the serial sections of the application. In this thesis, the research work primarily focuses on studying the difference between parallel and serial sections of the existing multi-threaded (MT) programs and exploring the design space with respect to the processor core requirement for the serial and parallel sections in future many-core with area-performance tradeoff as a primary goal
Laville, Guillaume. "Exécution efficace de systèmes Multi-Agents sur GPU." Thesis, Besançon, 2014. http://www.theses.fr/2014BESA2016/document.
Full textThese last years have seen the emergence of parallelism in many fields of computer science. This is explainedby the stagnation of the frequency of execution units at the hardware level and by the increasing usage ofparallel platforms at the software level. A form of parallelism is present in multi-agent systems, that facilitatethe description of complex systems as a collection of interacting entities. If the similarity between this softwareand this logical parallelism seems obvious, the parallelization process remains difficult in this case because ofthe numerous dependencies encountered in many multi-agent systems.In this thesis, we propose a common solution to facilitate the adaptation of these models on a parallel platformsuch as GPUs. Our library, MCMAS, provides access to two programming interface to facilitate this adaptation:a low-level layer providing direct access to OpenCL, MCM, and a high-level set of plugins not requiring anyGPU-related knowledge.We study the usage of this library on three existing multi-agent models : predator-prey,MIOR and Collembola. To prove the interest of the approach we present a performance study for each modeland an analysis of the various factors contributing to an efficient execution on GPUs. We finally conclude on aoverview of the work and results presented in the report and suggest future directions to enhance our solution
Gupta, Vishakha. "Coordinated system level resource management for heterogeneous many-core platforms." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42750.
Full textWang, Boqian. "High-Performance Network-on-Chip Design for Many-Core Processors." Licentiate thesis, KTH, Elektronik och inbyggda system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-283517.
Full textMed utvecklingen av tillverkningsteknologi av on-chip och kraven på högpresterande da-toranläggning växer kärnantalet snabbt i Chip Multi/Many-core Processors (CMPs) ochMultiprocessor Systems-on-Chip (MPSoCs) för att stödja större parallellkörning. Network-on-Chip (NoC) har blivit den de facto lösningen för CMP:er och MPSoC:er för att mötakommunikationsutmaningen. I uppsatsen tar vi upp några viktiga problem med hög-presterande NoC-konstruktioner.Allmänna CMP:er omfattas ett fullständigt systemperspektiv för att design högprester-ande NoC för flertrådad program. Genom att utforska cachekoherensen under hela system-scenariot presenterar vi en smart kommunikationstjänst, AVCR (Advance Virtual ChannelReservation) för att tillhandahålla en motorväg till målpaket, vilket i hög grad kan min-ska deras förseningar i NoC. AVCR utnyttjar det faktum att vi kan veta eller förutsägadestinationen för vissa paket före deras ankomst till nätverksgränssnittet (Network inter-face, NI). Genom att utnyttja tidsintervallet innan ett paket är klart, etablerar AVCRen ände till ände motorväg från källan NI till destinationen NI. Denna motorväg byggsupp genom att reservera virtuell kanal (Virtual Channel, VC) resurser före målpaket-söverföringen och erbjuda prioriterade tjänster till flisar i den reserverade VC i wormholerouter. Dessutom föreslår vi också en tillträdeskontrollmetod i NoC med en centraliseradartificiellt neuronät (Artificial Neural Network, ANN) tillträdeskontroll, som kan förbättrasystemets prestanda genom att förutsäga den mest lämpliga injektionshastigheten för varjenod via nätverksprestationsinformationen. I onlinekontrollprocessen används en förbehan-dlingsenhet på data för att förenkla ANN-arkitekturen och göra förutsägningsresultatenmer korrekta. Baserat på den förbehandlade informationen bestämmer ANN-prediktornkontrollstrategin och sänder den till varje nod där tillträdeskontrollen kommer att tilläm-pas.För applikationsspecifika MPSoC:er fokuserar vi på att utveckla högpresterande NoCoch NI kompatibla med det gemensamma AMBA AXI4 protokoll. För att erbjuda möj-ligheten att använda AXI4-baserade processorer och kringutrustning i det on-chip baseradenätverkssystemet föreslår vi en hel systemarkitekturlösning för att göra AXI4 protokolletkompatibelt med den NoC-baserade kommunikation i det multikärnsystemet. På grundav den out-of-order överföring i NoC, som strider mot ordningskraven som anges i AXI4-protokollet, fokuserar vi i första hand på utformningen av transaktionsordningsenheterna,för att förverkliga en hög prestanda och låg kostnad-lösning på ordningskraven. Sedanfokuserar vi på NI och Quality of Service (QoS)-stödet i NoC. I vår design föreslås NI attgöra NoC-arkitekturen oberoende av AXI4-protokollet via meddelandeformatkonverteringmellan AXI4 signalformatet och paketformatet, vilket erbjuder NoC-designen hög flexi-bilitet. Den NoC-baserade kommunikationsarkitekturen är utformad för att stödja fleraQoS-schema med hög prestanda. NoC-systemet innehåller Time-Division Multiplexing(TDM) och VC-subnät för att tillämpa flera QoS-scheman på AXI4-signaler med olikaQoS-taggar och NI ansvarar för trafikdistribution mellan två subnät. Dessutom tillämpasen QoS-arvsmekanism i slav-sidan NI för att stödja QoS under paketets tur-returöverföringiNoC
QC 20201008
Dickman, Thomas J. "Event List Organization and Management on the Nodes of a Many-Core Beowulf Cluster." University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378196499.
Full textRoussel, Adrien. "Parallélisation sur un moteur exécutif à base de tâches des méthodes itératives pour la résolution de systèmes linéaires creux sur architecture multi et many coeurs : application aux méthodes de types décomposition de domaines multi-niveaux." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM010/document.
Full textNumerical methods in reservoir engineering simulations lead to the resolution of unstructured, large and sparse linear systems. The performances of iterative methods employed in simulator to solve these systems are crucial in order to consider many more scenarios.In this work, we present a way to implement efficient parallel iterative methods on top of a task-based runtime system. It enables to simplify the development of methods while keeping control on parallelism management. We propose a linear algebra API which aims to implicitly express task dependencies: the semantic is sequential while the parallelism is implicit.We have extended the HARTS runtime system to monitor executions to better exploit NUMA architectures. Moreover, we implement a scheduling policy which exploits data locality for task placement. We have extended the API for KNL many-core systems while considering the various memory banks available. This work has led to the optimization of the SpMV kernel, one of the most time consuming operation in iterative methods.This work has been evaluated on iterative methods, and particularly on one method coming from domain decomposition. Hence, we demonstrate that the API enables to reach good performances on both multi-core and many-core architectures
Vargas, Vallejo Vanessa Carolina. "Approche logicielle pour améliorer la fiabilité d’applications parallèles implémentées dans des processeurs multi-cœur et many-cœur." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAT042/document.
Full textThe large computing capacity, great flexibility, low power consumption, intrinsic redundancy and high performance provided by multi/many-core processors make them ideal to overcome with the new challenges in computing systems. However, the degree of scale integration of these devices increases their sensitivity to the effects of natural radiation. Consequently manufacturers, industrial and university partners are working together to improve their characteristics which allow their usage in critical embedded systems. In this context, the work done throughout this thesis aims at evaluating the impact of SEEs on parallel applications running on multi-core and many-core processors, and proposing a software approach to improve the system reliability. The methodology used for evaluation was based on multiple-case studies. The different scenarios implemented consider a wide range of system configurations in terms of multi-processing mode, programming model, memory model, and resources used. For the experimentation, two COTS devices were selected: the Freescale PowerPC P2041 quad-core built in 45nm SOI technology, and the KALRAY MPPA-256 many-core processor built in 28nm CMOS technology. The case-studies were evaluated through fault-injection and neutron radiation. The obtained results serve as useful guidelines to developers for choosing the most reliable system configuration according to their requirements. Furthermore, the evaluation results of the proposed N-MoRePar fault-tolerant approach based on redundancy and partitioning criteria boost the usage of COTS multi/many-core processors in high level dependability systems
Brière, Alexandre. "Modélisation système d'une architecture d'interconnexion RF reconfigurable pour les many-cœurs." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066296/document.
Full textThe growing number of cores in a single chip goes along with an increase in com-munications. The variety of applications running on the chip causes spatial andtemporal heterogeneity of communications. To address these issues, we presentin this thesis a dynamically reconfigurable interconnect based on Radio Frequency(RF) for intra chip communications. The use of RF allows to increase the bandwidthwhile minimizing the latency. Dynamic reconfiguration of the interconnect allowsto handle the heterogeneity of communications. We present the rationale for choos-ing RF over optics and 3D, the detailed architecture of the network and the chipimplementing it, the evaluation of its feasibility and its performances. During theevaluation phase we were able to show that for a CMP of 1 024 tiles, our solutionallowed a performance gain of 13 %. One advantage of this RF interconnect is theability to broadcast without additional cost compared to point-to-point communi-cations, opening new perspectives in terms of cache coherence
Müller, Thomas [Verfasser]. "Techniques for adapting Industrial Simulation Software for Power Devices and Networks to Multi- and Many-Core Architectures / Thomas Müller." München : Verlag Dr. Hut, 2014. http://d-nb.info/1052375227/34.
Full textHofmann, Johannes [Verfasser]. "A First-Principles Approach to Performance, Power, and Energy Models for Contemporary Multi- and Many-Core Processors / Johannes Hofmann." München : Verlag Dr. Hut, 2019. http://d-nb.info/1198542780/34.
Full textHo, Minh Quan. "Optimisation de transfert de données pour les processeurs pluri-coeurs, appliqué à l'algèbre linéaire et aux calculs sur stencils." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM042/document.
Full textUpcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on he Kalray MPPA many-core architecture. With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted withs cratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33%performance gain over the default cache-based implementation.High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm. On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result
Ramos, Vargas Pablo Francisco. "Evaluation de la sensibilité face aux SEE et méthodologie pour la prédiction de taux d’erreurs d’applications implémentées dans des processeurs Multi-cœur et Many-cœur." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAT022/document.
Full textThe present thesis aims at evaluating the SEE static and dynamic sensitivity of three different COTS multi-core and many-core processors. The first one is the Freescale P2041 multi-core processor manufactured in 45nm SOI technology which implements ECC and parity in their cache memories. The second one is the Kalray MPPA-256 many-core processor manufactured in 28nm TSMC CMOS technology which integrates 16 compute clusters each one with 16 processor cores, and implements ECC in its static memories and parity in its cache memories. The third one is the Adapteva Epiphany E16G301 microprocessor manufactured in 65nm CMOS process which integrates 16 processor cores and do not implement protection mechanisms. The evaluation was accomplished through radiation experiments with 14 Mev neutrons in particle accelerators to emulate a harsh radiation environment, and by fault injection in cache memories, shared memories or processor registers, to simulate the consequences of SEUs in the execution of the program. A deep analysis of the observed errors was carried out to identify vulnerabilities in the protection mechanisms. Critical zones such as address tag and general purpose registers were affected during the radiation experiments. In addition, The Code Emulating Upset (CEU) approach, developed at TIMA Laboratory was extended to multi-core and many core processors for predicting the application error rate by combining the results issued from fault injection campaigns with those coming from radiation experiments
Sarrazin, Guillaume. "Simulation fonctionnelle native pour des systèmes many-cœurs." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM015/document.
Full textThe number of transistors in one chip is increasing following Moore’s conjecture which says that the number of transistors per chip doubles every two years. Current systems are so complex that chip design and specific software development for one chip take too much time even if software development is done in parallel with the design of the hardware architecture, often because of system integration issues. To help reducing this time, the general solution consists of using virtual platforms to reproduce the behavior of the target chip. The simulation speed of these platforms is a major issue, especially for many-core systems in which the number of programmable cores is really high. We focus in this thesis on native simulation. Its principle is to compile source code directly for the host architecture to allow very fast simulation, at the cost of requiring "equivalent" features on the target and host cores.However, some target core specific features can be missing in the host core. Hardware Assisted Virtualization (HAV) is used to ease native simulation but it reinforces the dependency of the target chip simulation regarding the host core capabilities. In this context, we propose a solution to simulate the target core functional specific features with HAV based native simulation.Among target core features, the floating point unit is an important element which is neglected in native simulation leading to potential functional differences between target and host computation results. We restrict our study to the compiled simulation technique and we propose a methodology ensuring to accurately simulate floating point computations while still keeping a good simulation speed.Finally, native simulation has a scalability issue. Time decoupling problems generate unnecessary code simulation during synchronisation protocols between threads executed on the target cores, leading to an important decrease of simulation speed when the number of cores grows. We address this problem and propose solutions to allow a better scalability for native simulation
Müller, Thomas [Verfasser], Arndt [Akademischer Betreuer] Bode, Hans-Joachim [Akademischer Betreuer] Bungartz, and Carsten [Akademischer Betreuer] Trinitis. "Techniques for adapting Industrial Simulation Software for Power Devices and Networks to Multi- and Many-Core Architectures / Thomas Müller. Gutachter: Hans-Joachim Bungartz ; Arndt Bode ; Carsten Trinitis. Betreuer: Arndt Bode." München : Universitätsbibliothek der TU München, 2014. http://d-nb.info/1051078245/34.
Full textHashmi, Jahanzeb Maqbool. "Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1588038721555713.
Full textYe, Fan. "Nouveaux algorithmes numériques pour l’utilisation efficace des architectures multi-cœurs et hétérogènes." Thesis, Lille 1, 2015. http://www.theses.fr/2015LIL10169/document.
Full textThis study is driven by the real computational needs coming from different fields of reactor physics, such as neutronics or thermal hydraulics, where the eigenvalue problem and resolution of linear system are the key challenges that consume substantial computing resources. In this context, our objective is to design and improve the parallel computing techniques, including proposing efficient linear algebraic kernels and parallel numerical methods. In a shared-memory environment such as the Intel Many Integrated Core (MIC) system, the parallelization of an algorithm is achieved in terms of fine-grained task parallelism and data parallelism. For scheduling the tasks, two main policies, the work-sharing and work-stealing was studied. For the purpose of generality and reusability, we use common parallel programming interfaces, such as OpenMP, Cilk/Cilk+, and TBB. For vectorizing the task, the available tools include Cilk+ array notation, SIMD pragmas, and intrinsic functions. We evaluated these techniques and propose an efficient dense matrix-vector multiplication kernel. In order to tackle a more complicated situation, we propose to use hybrid MPI/OpenMP model for implementing sparse matrix-vector multiplication. We also designed a performance model for characterizing performance issues on MIC and guiding the optimization. As for solving the linear system, we derived a scalable parallel solver from the Monte Carlo method. Such method exhibits inherently abundant parallelism, which is a good fit for many-core architecture. To address some of the fundamental bottlenecks of this solver, we propose a task-based execution model that completely fixes the problems
Perret, Quentin. "Exécution prédictible sur processeurs pluri-coeurs." Thesis, Toulouse, ISAE, 2017. http://www.theses.fr/2017ESAE0007/document.
Full textIn this thesis, we study the suitability of the distributed architecture of many-core processors for the design of highly constrained real-time systems as is the case in avionics. We firstly propose a thorough analysis of an existing COTS processor, namely the KALRAY MPPA®-256, and we identify some of its shared resources to be paths of interference when shared among several applications. We provide an execution model to restrict the access to these resources in order to mitigate their impact on WCETs and to temporally isolate co-running applications. We describe in detail how such an execution model can be implemented with a hypervisor which practically provides the expected property of temporal isolation at run-time. Based on this, we formalize a notion of partition which represents the association of an application with a resource budget. In our approach, an application placed in a partition is guaranteed to be temporally isolated from applications placed in other partitions. Then, assuming that applications and resource budgets are given,we propose to use constraint programming in order to verify automatically whether the amount of resources requested by a budget is sufficient to meet all of the application’s constraints. Simultaneously, when a budget is valid, our approach computes a schedule of the application on the subset of the processor’s resources allocated to it
Porada, Katarzyna. "Contribution à la parallélisation automatique : un modèle de processeur à beaucoup de coeurs parallélisant." Thesis, Perpignan, 2017. http://www.theses.fr/2017PERP0063.
Full textThe pursuit for faster and more powerful machines started from the first computers. After exhausting the increase of the frequency, the manufacturers have turned to another solution and started to introduce multiples cores on a chip. The computational model is today based on the OS threads exploited through different languages offering parallel constructions. However, parallel programming remains an art because the thread management by the operating system is not deterministic.Nonetheless, it is possible to compute in a parallel deterministic way if we replace the thread model by a model built on the partial order of dependencies. In this thesis, we present an alternative architectural model exploiting the Instruction Level Parallelism (ILP) naturally present in applications. We propose many techniques to remove most of the architectural dependencies which leads to an ILP increasing with the execution length. The ILP which is reached this way is enough to allow feeding thousands of cores. Eliminating the architecutral dependencies serializing the run allows to exploit the ILP better than in actual microarchitectures. A VHDL code at the RTL level has been implemented to mesure the benefits of our design. The results of the synthesis of a processeur ranging from 2 to 64 cores are reported. They show that the speed of the proposed material keeps constant and the surface grows linearly with the number of cores : our interconnect solution is scalable
Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066747/document.
Full textSince the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes
Rihani, Hamza. "Analyse temporelle des systèmes temps-réels sur architectures pluri-coeurs." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM074/document.
Full textPredictability is of paramount importance in real-time and safety-critical systems, where non-functional properties --such as the timing behavior -- have high impact on the system's correctness. As many safety-critical systems have agrowing performance demand, classical architectures, such as single-cores, are not sufficient anymore. One increasinglypopular solution is the use of multi-core systems, even in the real-time domain. Recent many-core architectures, such asthe Kalray MPPA, were designed to take advantage of the performance benefits of a multi-core architecture whileoffering certain predictability. It is still hard, however, to predict the execution time due to interferences on sharedresources (e.g., bus, memory, etc.).To tackle this challenge, Time Division Multiple Access (TDMA) buses are often advocated. In the first part of thisthesis, we are interested in the timing analysis of accesses to shared resources in such environments. Our approach usesSatisfiability Modulo Theory (SMT) to encode the semantics and the execution time of the analyzed program. To estimatethe delays of shared resource accesses, we propose an SMT model of a shared TDMA bus. An SMT-solver is used to find asolution that corresponds to the execution path with the maximal execution time. Using examples, we show how theworst-case execution time estimation is enhanced by combining the semantics and the shared bus analysis in SMT.In the second part, we introduce a response time analysis technique for Synchronous Data Flow programs. These are mappedto multiple parallel dependent tasks running on a compute cluster of the Kalray MPPA-256 many-core processor. Theanalysis we devise computes a set of response times and release dates that respect the constraints in the taskdependency graph. We derive a mathematical model of the multi-level bus arbitration policy used by the MPPA. Further,we refine the analysis to account for (i) release dates and response times of co-runners, (ii)task execution models, (iii) use of memory banks, (iv) memory accesses pipelining. Furtherimprovements to the precision of the analysis were achieved by considering only accesses that block the emitting core inthe interference analysis. Our experimental evaluation focuses on randomly generated benchmarks and an avionics casestudy
Zhao, Jia. "On-chip monitoring infrastructures and strategies for multi-core and many-core systems." 2012. https://scholarworks.umass.edu/dissertations/AAI3518401.
Full textZhang, Tiansheng. "Resource and thermal management in 3D-stacked multi-/many-core systems." Thesis, 2017. https://hdl.handle.net/2144/20837.
Full text2018-03-09T00:00:00Z
"Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium." Universität Potsdam, 2012. http://opus.kobv.de/ubp/volltexte/2012/5789/.
Full text