Dissertations / Theses on the topic 'Architecture manycore'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 38 dissertations / theses for your research on the topic 'Architecture manycore.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Cho, Myong Hyon Ph D. Massachusetts Institute of Technology. "On-chip networks for manycore architecture." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/84885.
Full textCataloged from PDF version of thesis.
Includes bibliographical references (pages 109-116).
Over the past decade, increasing the number of cores on a single processor has successfully enabled continued improvements of computer performance. Further scaling these designs to tens and hundreds of cores, however, still presents a number of hard problems, such as scalability, power efficiency and effective programming models. A key component of manycore systems is the on-chip network, which faces increasing efficiency demands as the number of cores grows. In this thesis, we present three techniques for improving the efficiency of on-chip interconnects. First, we present PROM (Path-based, Randomized, Oblivious, and Minimal routing) and BAN (Bandwidth Adaptive Networks), techniques that offer efficient intercore communication for bandwith-constrained networks. Next, we present ENC (Exclusive Native Context), the first deadlock-free, fine-grained thread migration protocol developed for on-chip networks. ENC demonstrates that a simple and elegant technique in the on-chip network can provide critical functional support for higher-level application and system layers. Finally, we provide a realistic context by sharing our hands-on experience in the physical implementation of the on-chip network for the Execution Migration Machine, an ENC-based 110-core processor fabricated in 45nm ASIC technology.
by Myong Hyon Cho.
Ph.D.
Stubbfält, Erik. "Hardware Architecture Impact on Manycore Programming Model." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-441739.
Full textDévigne, Clément. "Exécution sécurisée de plusieurs machines virtuelles sur une plateforme Manycore." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066138/document.
Full textManycore architectures, which comprise a lot of cores, are a way to answer the always growing demand for digital data processing, especially in a context of cloud computing infrastructures. These data, which can belong to companies as well as private individuals, are sensitive by nature, and this is why the isolation problematic is primordial. Yet, since the beginning of cloud computing, virtualization techniques are more and more used to allow different users to physically share the same hardware resources. This is all the more true for manycore architectures, and it partially comes down to the architectures to guarantee that data integrity and confidentiality are preserved for the software it executes. We propose in this thesis a secured virtualization environment for a manycore architecture. Our mechanism relies on hardware components and a hypervisor software to isolate several operating systems running on the same architecture. The hypervisor is in charge of allocating resources for the virtualized operating systems, but does not have the right to access the resources allocated to these systems. Thus, a security flaw in the hypervisor does not imperil data confidentiality and integrity of the virtualized systems. Our solution is evaluated on a cycle-accurate virtual prototype and has been implemented in a coherent shared memory manycore architecture. Our evaluations target the hardware and performance overheads added by our mechanisms. Finally, we analyze the security provided by our solution
Azar, Céline. "On the design of a distributed adaptive manycore architecture for embedded systems." Lorient, 2012. http://www.theses.fr/2012LORIS268.
Full textChip design challenges emerged lately at many levels: the increase of the number of cores at the hardware stage, the complexity of the parallel programming models at the software level, and the dynamic requirements of current applications. Facing this evolution, the PhD thesis aims to design a distributed adaptive manycore architecture, named CEDAR (Configurable Embedded Distributed ARchitecture), which main assets are scalability, flexibility and simplicity. The CEDAR platform is an array of homogeneous, small footprint, RISC processors, each connected to its four nearest neighbors. No global control exists, yet it is distributed among the cores. Two versions are designed for the platform, along with a user-familiar programming model. A software version, CEDAR-S, is the basic implementation where adjacent cores are connected to each other via shared buffers. A co-processor called DMC (Direct Management of Communications) is added in the CEDAR-H version, to optimize the routing protocol. The DMCs are interconnected in a mesh fashion. Two novel concepts are proposed to enhance the adaptiveness of CEDAR. First, a distributed dynamic routing strategy, based on a bio-inspired algorithm, handles routing in a non-supervised fashion, and is independent of the physical placement of communicating tasks. The second concept presents dynamic distributed task migration in response to several system and application requirements. Results show that CEDAR scores high performances with its optimized routing strategy, compared to state-of-art networks. The migration cost is evaluated and adequate protocols are presented. CEDAR is shown to be a promising design concept for future manycores
Dévigne, Clément. "Exécution sécurisée de plusieurs machines virtuelles sur une plateforme Manycore." Electronic Thesis or Diss., Paris 6, 2017. http://www.theses.fr/2017PA066138.
Full textManycore architectures, which comprise a lot of cores, are a way to answer the always growing demand for digital data processing, especially in a context of cloud computing infrastructures. These data, which can belong to companies as well as private individuals, are sensitive by nature, and this is why the isolation problematic is primordial. Yet, since the beginning of cloud computing, virtualization techniques are more and more used to allow different users to physically share the same hardware resources. This is all the more true for manycore architectures, and it partially comes down to the architectures to guarantee that data integrity and confidentiality are preserved for the software it executes. We propose in this thesis a secured virtualization environment for a manycore architecture. Our mechanism relies on hardware components and a hypervisor software to isolate several operating systems running on the same architecture. The hypervisor is in charge of allocating resources for the virtualized operating systems, but does not have the right to access the resources allocated to these systems. Thus, a security flaw in the hypervisor does not imperil data confidentiality and integrity of the virtualized systems. Our solution is evaluated on a cycle-accurate virtual prototype and has been implemented in a coherent shared memory manycore architecture. Our evaluations target the hardware and performance overheads added by our mechanisms. Finally, we analyze the security provided by our solution
Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066747/document.
Full textSince the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes
Bechara, Charly. "Study and design of a manycore architecture with multithreaded processors for dynamic embedded applications." Phd thesis, Université Paris Sud - Paris XI, 2011. http://tel.archives-ouvertes.fr/tel-00713536.
Full textPark, Seo Jin. "Analyzing performance and usability of broadcast-based inter-core communication (ATAC) on manycore architecture." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85219.
Full textThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 55-56).
In this thesis, I analyze the performance and usability benefits of broadcast-based inter-core communication on manycore architecture. The problem of high communication cost on manycore architecture was tackled by a new architecture which allows ecient broadcasting by leveraging an on-chip optical network. I designed the new architecture and API for the new broadcasting feature and implemented them on a multicore simulator called Graphite. I also re-implemented common parallel APIs (barrier and work-stealing) which benet from the cheap broadcasting and showed their ease of use and superior performance versus existing parallel programming libraries through conducting famous benchmarks on the Graphite simulator.
by Seo Jin Park.
M. Eng.
Gao, Yang. "Contrôleur de cache générique pour une architecture manycore massivement parallèle à mémoire partagée cohérente." Paris 6, 2011. http://www.theses.fr/2011PA066296.
Full textKaraoui, Mohamed Lamine. "Système de fichiers scalable pour architectures many-cores à faible empreinte énergétique." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066186/document.
Full textIn this thesis we study the problems of implementing a UNIX-like scalable file system on a hardware cache coherent NUMA manycore architecture. To this end, we use the TSAR manycore architecture and ALMOS, a UNIX-like operating system.The TSAR architecture presents, from the operating system point of view, three problems to which we offer a set of solutions. One of these problems is specific to the TSAR architecture while the others are common to existing coherent NUMA manycore.The first problem concerns the support of a physical memory that is larger than the virtual memory. This is due to the extended physical address space of TSAR, which is 256 times bigger than the virtual address space. To resolve this problem, we modified the structure of the kernel to decompose it into multiple communicating units.The second problem is the placement strategy to be used on the file system structures. To solve this problem, we implemented a strategy that evenly distributes the data on the different memory banks.The third problem is the synchronization of concurrent accesses to the file system. Our solution to resolve this problem uses multiple mechanisms. In particular, the solution uses an efficient lock-free mechanism that we designed, which synchronizes the accesses between several readers and a single writer.Experimental results show that: (1) structuring the kernel into multiple units does not deteriorate the performance and may even improve them; (2) our set of solutions allow us to give performances that scale better than NetBSD; (3) the placement strategy which distributes evenly the data is the most adapted for manycore architectures
Lööw, Andreas. "A Functional-Level Simulator for the Configurable (Many-Core) PRAM-Like REPLICA Architecture." Thesis, Linköpings universitet, Programvara och system, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-79049.
Full textBates, Daniel. "Exploiting tightly-coupled cores." Thesis, University of Cambridge, 2014. https://www.repository.cam.ac.uk/handle/1810/245179.
Full textKaraoui, Mohamed Lamine. "Système de fichiers scalable pour architectures many-cores à faible empreinte énergétique." Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066186.
Full textIn this thesis we study the problems of implementing a UNIX-like scalable file system on a hardware cache coherent NUMA manycore architecture. To this end, we use the TSAR manycore architecture and ALMOS, a UNIX-like operating system.The TSAR architecture presents, from the operating system point of view, three problems to which we offer a set of solutions. One of these problems is specific to the TSAR architecture while the others are common to existing coherent NUMA manycore.The first problem concerns the support of a physical memory that is larger than the virtual memory. This is due to the extended physical address space of TSAR, which is 256 times bigger than the virtual address space. To resolve this problem, we modified the structure of the kernel to decompose it into multiple communicating units.The second problem is the placement strategy to be used on the file system structures. To solve this problem, we implemented a strategy that evenly distributes the data on the different memory banks.The third problem is the synchronization of concurrent accesses to the file system. Our solution to resolve this problem uses multiple mechanisms. In particular, the solution uses an efficient lock-free mechanism that we designed, which synchronizes the accesses between several readers and a single writer.Experimental results show that: (1) structuring the kernel into multiple units does not deteriorate the performance and may even improve them; (2) our set of solutions allow us to give performances that scale better than NetBSD; (3) the placement strategy which distributes evenly the data is the most adapted for manycore architectures
Alnervik, Erik. "Evaluation of the Configurable Architecture REPLICA with Emulated Shared Memory." Thesis, Linköpings universitet, Programvara och system, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-104313.
Full textREPLICA är en grupp av konfigurerbara multiprocessorer som med hjälp utav ett emulerat delat minne realiserar PRAM modellen. Syftet med denna avhandling är att genom benchmarking av olika beräkningsproblem på REPLICA, liknande (SB-PRAM och XMT) och mindre lika (Xeon X5660 och Tesla M2050) parallella arkitekturer, utvärdera hur REPLICA står sig mot andra befintliga arkitekturer. Både prestandamässigt och hur enkel arkitekturen är att programmera effektiv, men även försöka ta reda på om REPLICA är speciellt lämpad för några särskilda typer av beräkningsproblem. Genom att använda välkända Berkeley dwarfs applikationer och opartisk indata från bland annat The University of Florida Sparse Matrix Collection och Rodinia benchmark suite, säkerställer vi att det är relevanta beräkningsproblem som utförs och mäts. Vi visar att dagens parallella arkitekturer har problem med prestandan för applikationer med oregelbundna minnesaccessmönster, vilken REPLICA arkitekturen kan vara en lösning på. Till exempel, så behöver REPLICA endast vara klockad med några få MHz för att matcha både Xeon X5660 och Tesla M2050 för algoritmen breadth first search, vilken lider av just oregelbunden minnesåtkomst. Genom att jämföra effektiviteten för REPLICA gentemot en CPU (Xeon X5660), visar vi att det är lättare att programmera REPLICA effektivt än dagens multiprocessorer.
Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"." Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066747.
Full textSince the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes
Savas, Süleyman. "Utilizing Heterogeneity in Manycore Architectures for Streaming Applications." Licentiate thesis, Högskolan i Halmstad, Centrum för forskning om inbyggda system (CERES), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-33792.
Full textHiPEC (High Performance Embedded Computing)
NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Farjallah, Asma. "Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation." Thesis, Versailles-St Quentin en Yvelines, 2014. http://www.theses.fr/2014VERS0050/document.
Full textAs we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous cores
Amstel, Duco van. "Optimisation de la localité des données sur architectures manycœurs." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM019/document.
Full textThe continuous evolution of computer architectures has been an important driver of research in code optimization and compiler technologies. A trend in this evolution that can be traced back over decades is the growing ratio between the available computational power (IPS, FLOPS, ...) and the corresponding bandwidth between the various levels of the memory hierarchy (registers, cache, DRAM). As a result the reduction of the amount of memory communications that a given code requires has been an important topic in compiler research. A basic principle for such optimizations is the improvement of temporal data locality: grouping all references to a single data-point as close together as possible so that it is only required for a short duration and can be quickly moved to distant memory (DRAM) without any further memory communications.Yet another architectural evolution has been the advent of the multicore era and in the most recent years the first generation of manycore designs. These architectures have considerably raised the bar of the amount of parallelism that is available to programs and algorithms but this is again limited by the available bandwidth for communications between the cores. This brings some issues thatpreviously were the sole preoccupation of distributed computing to the world of compiling and code optimization techniques.In this document we present a first dive into a new optimization technique which has the promise of offering both a high-level model for data reuses and a large field of potential applications, a technique which we refer to as generalized tiling. It finds its source in the already well-known loop tiling technique which has been applied with success to improve data locality for both register and cache-memory in the case of nested loops. This new "flavor" of tiling has a much broader perspective and is not limited to the case of nested loops. It is build on a new representation, the memory-use graph, which is tightly linked to a new model for both memory usage and communication requirements and which can be used for all forms of iterate code.Generalized tiling expresses data locality as an optimization problem for which multiple solutions are proposed. With the abstraction introduced by the memory-use graph it is possible to solve this optimization problem in different environments. For experimental evaluations we show how this new technique can be applied in the contexts of loops, nested or not, as well as for computer programs expressed within a dataflow language. With the anticipation of using generalized tiling also to distributed computations over the cores of a manycore architecture we also provide some insight into the methods that can be used to model communications and their characteristics on such architectures.As a final point, and in order to show the full expressiveness of the memory-use graph and even more the underlying memory usage and communication model, we turn towards the topic of performance debugging and the analysis of execution traces. Our goal is to provide feedback on the evaluated code and its potential for further improvement of data locality. Such traces may contain information about memory communications during an execution and show strong similarities with the previously studied optimization problem. This brings us to a short introduction to the algorithmics of directed graphs and the formulation of some new heuristics for the well-studied topic of reachability and the much less known problem of convex partitioning
Petit, Eric. "Vers un partitionnement automatique d'applications en codelets spéculatifs pour les systèmes hétérogènes à mémoires distribuées." Phd thesis, Université Rennes 1, 2009. http://tel.archives-ouvertes.fr/tel-00445512.
Full textGiroudot, Frédéric. "NoC-based Architectures for Real-Time Applications : Performance Analysis and Design Space Exploration." Thesis, Toulouse, INPT, 2019. https://oatao.univ-toulouse.fr/25921/1/Giroudot_Frederic.pdf.
Full textMonoprocessor architectures have reached their limits in regard to the computing power they offer vs the needs of modern systems. Although multicore architectures partially mitigate this limitation and are commonly used nowadays, they usually rely on intrinsically non-scalable buses to interconnect the cores. The manycore paradigm was proposed to tackle the scalability issue of bus-based multicore processors. It can scale up to hundreds of processing elements (PEs) on a single chip, by organizing them into computing tiles (holding one or several PEs). Intercore communication is usually done using a Network-on-Chip (NoC) that consists of interconnected onchip routers allowing communication between tiles. However, manycore architectures raise numerous challenges, particularly for real-time applications. First, NoC-based communication tends to generate complex blocking patterns when congestion occurs, which complicates the analysis, since computing accurate worst-case delays becomes difficult. Second, running many applications on large Systems-on-Chip such as manycore architectures makes system design particularly crucial and complex. On one hand, it complicates Design Space Exploration, as it multiplies the implementation alternatives that will guarantee the desired functionalities. On the other hand, once a hardware architecture is chosen, mapping the tasks of all applications on the platform is a hard problem, and finding an optimal solution in a reasonable amount of time is not always possible. Therefore, our first contributions address the need for computing tight worst-case delay bounds in wormhole NoCs. We first propose a buffer-aware worst-case timing analysis (BATA) to derive upper bounds on the worst-case end-to-end delays of constant-bit rate data flows transmitted over a NoC on a manycore architecture. We then extend BATA to cover a wider range of traffic types, including bursty traffic flows, and heterogeneous architectures. The introduced method is called G-BATA for Graph-based BATA. In addition to covering a wider range of assumptions, G-BATA improves the computation time; thus increases the scalability of the method. In a second part, we develop a method addressing design and mapping for applications with real-time constraints on manycore platforms. It combines model-based engineering tools (TTool) and simulation with our analytical verification technique (G-BATA) and tools (WoPANets) to provide an efficient design space exploration framework. Finally, we validate our contributions on (a) a serie of experiments on a physical platform and (b) two case studies taken from the real world: an autonomous vehicle control application, and a 5G signal decoder application
Xypolitidis, Benard, and Rudin Shabani. "Architectural Design Space Exploration of Heterogeneous Manycores." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-29528.
Full textPrasad, Rohit <1991>. "Integrated Programmable-Array accelerator to design heterogeneous ultra-low power manycore architectures." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amsdottorato.unibo.it/9983/1/PhD_thesis__20_January_2022_.pdf.
Full textRakotoarivelo, Hoby. "Contributions au co-design de noyaux irréguliers sur architectures manycore : cas du remaillage anisotrope multi-échelle en mécanique des fluides numérique." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLE012/document.
Full textNumerical simulations of complex flows such as turbulence or shockwave propagation often require a huge computational time to achieve an industrial accuracy level. To speedup these simulations, two alternatives may be combined : mesh adaptation to reduce the number of required points on one hand, and parallel processing to absorb the computation workload on the other hand. However efficiently porting adaptive kernels on massively parallel architectures is far from being trivial. Indeed each task related to a local vicintiy need to be propagated, and it may induce new conflictual tasks though. Furthermore, these tasks are characterized by a low arithmetic intensity and a low reuse rate of already cached data. Besides, new kind of accelerators have arised in high performance computing landscape, involving a number of algorithmic constraints. In a context of electrical power consumption reduction, they are characterized by numerous underclocked cores and a deep hierarchy memory involving asymmetric expensive memory accesses. Therefore, kernels must expose a high degree of concurrency and high cached-data reuse rate to maintain an optimal core efficiency. The real issue is how to structure these data-driven and data-intensive kernels to match these constraints ?In this work, we provide an approach which conciliates both locality constraints and convergence in terms of mesh error and quality. More than a parallelization, it relies on redesign of kernels guided by hardware constraints while preserving accuracy. In fact, we devise a set of locality-aware kernels for anisotropic adaptation of triangulated differential manifold, as well as a lock-free and massively multithread parallelization of irregular kernels. Although being complementary, those axes come from distinct research themes mixing informatics and applied mathematics. Here, we aim to show that our devised schemes are as efficient as the state-of-the-art for both axes
Chang, Tao. "Evaluation of programming models for manycore and / or heterogeneous architectures for Monte Carlo neutron transport codes." Thesis, Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAX099.
Full textIn this thesis we propose to evaluate the different programming models available for addressing manycore and / or heterogeneous architectures within the framework of the Monte Carlo transport codes. A simple but representative application test case will be considered in order to cover a fairly wide range of solutions and compare them in terms of performance, portability of performance, ease of implementation and maintainability. The target architectures are `classic' CPUs, Intel Xeon Phi and GPUs. The most relevant programming models will then be set up in a Monte Carlo transport code
Nguyen, Tri Minh. "Exploring Data Compression and Random-access Reduction to Mitigate the Bandwidth Wall for Manycore Architectures." Thesis, Princeton University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10936059.
Full textThe performance gap between computer processors and memory bandwidth is severely limiting the throughput of modern and future multi-core and manycore architectures. To handle this growing gap, commercial processors such as the Intel Xeon Phi and NVIDIA or AMD GPUs have needed to use expensive memory solutions like high-bandwidth memory (HBM) and 3D-stacked memory to satisfy the bandwidth demand of the growing core-count over each product generation. Without a scalable solution for the memory bandwidth issue, throughput-oriented computation cannot be improved. This problem is widely known as the bandwidth-wall.
Data compression and random-access reduction are promising approaches to increase bandwidth without raising costs. This thesis makes three specific contributions to the state-of-the-art. First, to reduce cache misses, we propose an on-chip cache compression method that drastically increases compression performance and cache hit rate over prior work. Second, to improve direct compression of off-chip bandwidth and make it more scalable, we propose a novel link compression framework that exploits the on-chip caches themselves as a massive and scalable compression dictionary. Last, to overcome poor random-access performance of nonvolatile memory (NVM) and make it more attractive as a DRAM replacement with crash consistency, we propose a multi-undo logging scheme that seamlessly logs memory writes sequentially to maximize NVM I/O operations per second (IOPS).
As a common principle, this thesis seeks to overcome the bandwidth wall for manycore architectures not through expensive memory technologies but by assessing and exploiting workload behavior, and not through burdening programmers with specialized semantics but by implementing software-transparent architectural improvements.
Dao, Van Toan. "Calcul à haute performance et simulations stochastiques : Etude de la reproductibiité numérique sur architectures multicore et manycore." Thesis, Université Clermont Auvergne (2017-2020), 2017. http://www.theses.fr/2017CLFAC005/document.
Full textThe reproducibility of numerical experiments on high performance computing systems is sometimes overlooked. Moreover, the numerical methods used for rigorous parallelization of stochastic simulations are often unknown. Indeed, the results obtained for a stochastic simulation using high performance computing systems can be different from run to run with the same parameters and the same execution contexts due to the impact of new architectures, accelerators, compilers, operating systems or a changing of the order of execution of the floating arithmetic operations within the micro-processors for parallelizing optimizations. In the case of non-repeatability of numerical experiments, how can we seriously develop a scientific application? What credit can be given to the parallel software thus developed? In this thesis, we synthesize the main causes of non-reproducibility for a parallel stochastic simulation using high performance computing systems. Unlike the usual parallelism works, we do not focus on improving performance, but on obtaining numerically repeatable results from one experiment to another. We present the reproducibility and its contributions to the science of experimental and numerical computing. Furthermore, we propose some contributions, in particular: to verify the reproducibility and portability of top modern pseudo-random number generators, to detect the correlation between parallel streams issued from such generators, to repeat and reproduce the numerical results of independent parallel stochastic simulations
Rajamanikkam, Chidhambaranathan. "Understanding Security Threats of Emerging Computing Architectures and Mitigating Performance Bottlenecks of On-Chip Interconnects in Manycore NTC System." DigitalCommons@USU, 2019. https://digitalcommons.usu.edu/etd/7453.
Full textStan, Oana. "Placement of tasks under uncertainty on massively multicore architectures." Thesis, Compiègne, 2013. http://www.theses.fr/2013COMP2116/document.
Full textThis PhD thesis is devoted to the study of combinatorial optimization problems related to massively parallel embedded architectures when taking into account uncertain data (e.g. execution time). Our focus is on chance constrained programs with the objective of finding the best solution which is feasible with a preset probability guarantee. A qualitative analysis of the uncertain data we have to treat (dependent random variables, multimodal, multidimensional, difficult to characterize through classical distributions) has lead us to design a non parametric method, the so-called "robust binomial approach", valid whatever the joint distribution and which is based on robust optimization and statistical hypothesis testing. We also propose a methodology for adapting approximate algorithms for solving stochastic problems by integrating the robust binomial approach when verifying for solution feasibility. The paractical relevance of our approach is validated through two problems arising in the compilation of dataflow application for manycore platforms. The first problem treats the stochastic partitioning of networks of processes on a fixed set of nodes, by taking into account the load of each node and the uncertainty affecting the weight of the processes. For finding stochastic solutions, a semi-greedy iterative algorithm has been proposed which allowed measuring the robustness and cost of the solutions with regard to those for the deterministic version of the problem. The second problem consists in studying the global placement and routing of dataflow applications on a clusterized architecture. The purpose being to place the processes on clusters such that it exists a feasible routing, a GRASP heuristic has been conceived first for the deterministic case and afterwards extended for the chance constrained variant of the problem
Dahmani, Safae. "Modèles et protocoles de cohérence de données, décision et optimisation à la compilation pour des architectures massivement parallèles." Thesis, Lorient, 2015. http://www.theses.fr/2015LORIS384/document.
Full textManycores architectures consist of hundreds to thousands of embedded cores, distributed memories and a dedicated network on a single chip. In this context, and because of the scale of the processor, providing a shared memory system has to rely on efficient hardware and software mechanisms and data consistency protocols. Numerous works explored consistency mechanisms designed for highly parallel architectures. They lead to the conclusion that there won't exist one protocol that fits to all applications and hardware contexts. In order to deal with consistency issues for this kind of architectures, we propose in this work a multi-protocol compilation toolchain, in which shared data of the application can be managed by different protocols. Protocols are chosen and configured at compile time, following the application behaviour and the targeted architecture specifications. The application behaviour is characterized with a static analysis process that helps to guide the protocols assignment to each data access. The platform offers a protocol library where each protocol is characterized by one or more parameters. The range of possible values of each parameter depends on some constraints mainly related to the targeted platform. The protocols configuration relies on a genetic-based engine that allows to instantiate each protocol with appropriate parameters values according to multiple performance objectives. In order to evaluate the quality of each proposed solution, we use different evaluation models. We first use a traffic analytical model which gives some NoC communication statistics but no timing information. Therefore, we propose two cycle- based evaluation models that provide more accurate performance metrics while taking into account contention effect due to the consistency protocols communications.We also propose a cooperative cache consistency protocol improving the cache miss rate by sliding data to less stressed neighbours. An extension of this protocol is proposed in order to dynamically define the sliding radius assigned to each data migration. This extension is based on the mass-spring physical model. Experimental validation of different contributions uses the sliding based protocols versus a four-state directory-based protocol
Berger, Karl-Eduard. "Placement de graphes de tâches de grande taille sur architectures massivement multicoeurs." Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLV026/document.
Full textThis Ph.D thesis is devoted to the study of the mapping problem related to massively parallel embedded architectures. This problem arises from industrial needs like energy savings, performance demands for synchronous dataflow applications. This problem has to be solved considering three criteria: heuristics should be able to deal with applications with various sizes, they must meet the constraints of capacities of processors and they have to take into account the target architecture topologies. In this thesis, tasks are organized in communication networks, modeled as graphs. In order to determine a way of evaluating the efficiency of the developed heuristics, mappings, obtained by the heuristics, are compared to a random mapping. This comparison is used as an evaluation metric throughout this thesis. The existence of this metric is motivated by the fact that no comparative heuristics can be found in the literature at the time of writing of this thesis. In order to address this problem, two heuristics are proposed. They are able to solve a dataflow process network mapping problem, where a network of communicating tasks is placed into a set of processors with limited resource capacities, while minimizing the overall communication bandwidth between processors. They are applied on task graphs where weights of tasks and edges are unitary set. The first heuristic, denoted as Task-wise Placement, places tasks one after another using a notion of task affinities. The second algorithm, named Subgraph-wise Placement, gathers tasks in small groups then place the different groups on processors using a notion of affinities between groups and processors. These algorithms are tested on tasks graphs with grid or logic gates network topologies. Obtained results are then compared to an algorithm present in the literature. This algorithm maps task graphs with moderated size on massively parallel architectures. In addition, the random based mapping metric is used in order to evaluate results of both heuristics. Then, in a will to address problems that can be found in industrial cases, application cases are widen to tasks graphs with tasks and edges weights values similar to those that can be found in the industry. A progressive construction heuristic named Regret Based Approach, based on game theory, is proposed. This heuristic maps tasks one after another. The costs of mapping tasks according to already mapped tasks are computed. The process of task selection is based on a notion of regret, present in game theory. The task with the highest value of regret for not placing it, is pointed out and is placed in priority. In order to check the strength of the algorithm, many types of task graphs (grids, logic gates networks, series-parallel, random, sparse matrices) with various size are generated. Tasks and edges weights are randomly chosen using a bimodal law parameterized in order to have similar values than industrial applications. Obtained results are compared to the Task Wise placement, especially adapted for non-unitary values. Moreover, results are evaluated using the metric defined above
Chi-Neng, Wen, and 文啟能. "A Unified and Scalable Manycore Testing/Debugging/Tracing Architecture." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/09623940949115010219.
Full text國立中正大學
資訊工程研究所
99
Traditional debug methodologies are limited in their ability to provide debugging support for multicore parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with software debugging tools. Most traditional debugging approaches rely on globally synchronized signals, but these pose problems in terms of scalability. Moreover, another urgent issue when developing parallel programs on multicore systems is to reproduce the same faulted circumstance after faults are. Tracing is a popular solution in current multicore system; however, it is limited by the on-chip storage and also by the tracing bandwidth. As a result, an intelligent recording scheme followed by a replaying system is the key to the future many-core debugging problems. The first contribution of this work is to propose a novel non-uniform debug architecture (NUDA) based on a ring interconnection schema and the non-uniform memory. Our approach makes debugging both feasible and scalable for many-core processing scenarios. The key idea is to distribute the debugging support structures across a set of hierarchical clusters while avoiding address overlap. This allows the address space to be monitored using non-uniform protocols. Our second contribution is a non-intrusive approach to race detection supported by the NUDA. A non-uniform page-based monitoring cache in each NUDA node is used to watch the access footprints. The union of all the caches can serve as a race detection probe. The third contribution is that this work also developed a non-intrusive run-time assertion (RunAssert) for parallel program development based on NUDA. Our approaches are as follows: (a) a current language extension for parallel program debugging (b) corresponding non-intrusive hardware configuration logic and checking methodologies and (c) several reality cases using the extensions mentioned above. In general, the target program can be executed at its original speed without altering the parallel sequences, thereby eliminating the possibility of Heisenberg effect. Furthermore, this work reuse the test access mechanism (TAM) for NUDA to contribute a unified manycore testing/debugging/tracing architecture, called TAM-NUDA (Test Access Mechanism based Non-Uniform Debugging Architecture). The TAM bus is switched between testing during test mode and debugging during normal mode. Finally, this work also proposes a new post-processing solution between record and replay phases to significantly reduce the trace over-head, while enabling faster replaying. The key point is that we perform dependence analysis and minimize the event trace at static time. Then a deterministic replayer, called Dini, can be directed by the new trace at a faster speed. Three trace analysis strategies are also considered in this paper such that a faster parallel replayer on multicore host is possible to ensure correct sequence of the faulted execution. Our experimental results demonstrate that the proposed trace reduction technologies reduce the trace size from 5X to 12.5X and still remain a fast replay speed. Using the proposed approaches, we show that parallel race bugs can be precisely captured, and that most false-positive alerts can be efficiently eliminated at an average slowdown cost of only 1.4%~3.6%. The net hardware cost is relatively low, so that the NUDA can readily scale increasingly complex many-core systems. Our experimental results also demonstrate that the proposed TAM-NUDA method only scarifies an average of 5.9% testing time for debug/trace modes reuses. Moreover, this work also realizes state-of-art race detectors and trace monitors at total area overhead of only 1.49%. Finally, experimental results shows that the proposed trace reduction technologies reduce the trace size from 5X to 12.5X and still remain a fast replay speed.
"Application-aware Performance Optimization for Software Managed Manycore Architectures." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.53524.
Full textDissertation/Thesis
Doctoral Dissertation Computer Science 2019
"Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.27434.
Full textDissertation/Thesis
Masters Thesis Computer Science 2014
Baral, Pushpa. "Static Analysis and Dynamic Monitoring of Program Flow on REDEFINE Manycore Processor." Thesis, 2021. https://etd.iisc.ac.in/handle/2005/5577.
Full text"Scratchpad Management in Software Managed Manycore Architectures." Doctoral diss., 2017. http://hdl.handle.net/2286/R.I.46214.
Full textDissertation/Thesis
Doctoral Dissertation Computer Science 2017
"Optimizing Heap Data Management on Software Managed Manycore Architectures." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.45507.
Full textDissertation/Thesis
Masters Thesis Computer Science 2017
Yadav, Satyendra Singh. "Development of Wireless Communication Algorithms on Multicore/Manycore Architectures." Thesis, 2018. http://ethesis.nitrkl.ac.in/9426/1/2018_PhD_SSYadav_512EC1014_Development.pdf.
Full textΚολώνιας, Βασίλειος. "Παράλληλοι αλγόριθμοι και εφαρμογές σε πολυπύρηνες μονάδες επεξεργασίας γραφικών." Thesis, 2014. http://hdl.handle.net/10889/8315.
Full textIn this thesis, parallel algorithms and applications in manycore graphics processing units are presented. More specifically, we examine methods of designing a parallel algorithm for solving both simple and common problems such as sorting, and computationally demanding problems, so as to fully exploit the enormous computing power of modern graphics processing units (GPUs). First problem considered is sorting, which is one of the most common problems in computer science. It exists as an internal problem in many applications. Therefore, sorting faster, results in better performance in general. Chapter 3 describes all design options for the implementation of a sorting algorithm for integers, count sort, on a graphics processing unit. The elimination of thread synchronization in the last step of the algorithm had a significant effect on the performance. Chapter 4 addresses the examination timetabling problem for Universities, which is a combinatorial optimization problem. A hybrid evolutionary algorithm, which runs entirely on GPU, was used to solve the problem. The tremendous computing power of GPU and parallel programming enable the use of large populations in order to explore better the solution space and get better quality results. In the next chapter, the problem of motion planning for underwater vehicle manipulator systems is examined. In the gross motion planning problem, it is important to achieve a good solution with high accuracy. The parallel algorithm used for the representation of the working environment in a Bump-surface is a step towards this direction. In the local motion planning problem, which is a real-time problem, the time needed to find the next configuration of the vehicle is crucial. Parallel programming and the GPU greatly assist in this online problem. Last application considered is the atomistic Monte Carlo simulation of semifluorinated alkanes. The parallelization of part of the algorithm, the most time-consuming, enabled the study of a much larger system in an acceptable execution time. In general, it becomes obvious that parallel programming and new novel manycore architectures, such as graphics processing units, give new capabilities for solving everyday problems, real time and combinatorial optimization problems.