Log in

Relevant bibliographies by topics / Memory hierarchy / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Memory hierarchy.

Dissertations / Theses on the topic 'Memory hierarchy'

Author: Grafiati

Published: 4 June 2021

Last updated: 10 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Memory hierarchy.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Gan, Yee Ling. "Redesigning the memory hierarchy for memory-safe programming languages." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119765.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 69-75).
We present Hotpads, a new memory hierarchy designed from the ground up for modern, memory-safe languages like Java, Go, and Rust. Memory-safe languages hide the memory layout from the programmer. This prevents memory corruption bugs, improves programmability, and enables automatic memory management. Hotpads extends the same insight to the memory hierarchy: it hides the memory layout from software and enables hardware to take control over it, dispensing with the conventional flat address space abstraction. This avoids the need for associative caches and virtual memory. Instead, Hotpads moves objects across a hierarchy of directly-addressed memories. It rewrites pointers to avoid most associative lookups, provides hardware support for memory allocation, and unifies hierarchical garbage collection and data placement. As a result, Hotpads improves memory performance and efficiency substantially, and unlocks many new optimizations. This thesis contributes important optimizations for Hotpads and a comprehensive evaluation of Hotpads against prior work.
by Yee Ling Gan.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

2

Dublish, Saumay Kumar. "Managing the memory hierarchy in GPUs." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31205.

Full text

Abstract:

Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU architectures to address the needs of upcoming application domains. One such vital improvement is the introduction of the on-chip cache hierarchy, used primarily to filter the high bandwidth demand to the off-chip memory. However, in contrast to traditional CPUs, the cache hierarchy in GPUs is presented with significantly different challenges such as cache thrashing and bandwidth bottlenecks, arising due to small caches and high levels of memory traffic. These challenges lead to severe congestion across the memory hierarchy, resulting in high memory access latencies. In memory-intensive applications, such high memory access latencies often get exposed and can no longer be hidden through multithreading, and therefore adversely impact system performance. In this thesis, we address the inefficiencies across the memory hierarchy in GPUs that lead to such high levels of congestion. We identify three major factors contributing to poor memory system performance: first, disproportionate and insufficient bandwidth resources in the cache hierarchy; second, poor cache management policies; and third, high levels of multithreading. In order to revitalize the memory hierarchy by addressing the above limitations, we propose a three-pronged approach. First, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs and identify the architectural parameters that are most critical in alleviating congestion. Subsequently, we explore the architectural design space to mitigate the bandwidth bottlenecks in a cost-effective manner. Second, we identify significant inter-core reuse in GPUs, presenting an opportunity to reuse data among the L1s. We exploit this reuse by connecting the L1 caches with a lightweight ring network to facilitate inter-core communication of shared data. We show that this technique reduces traffic to the L2 cache, freeing up the bandwidth for other accesses. Third, we present Poise, a machine learning approach to mitigate cache thrashing and bandwidth bottlenecks by altering the levels of multi-threading. Poise comprises a supervised learning model that is trained offline on a set of profiled kernels to make good warp scheduling decisions. Subsequently, a hardware inference engine is used to predict good warp scheduling decisions at runtime using the model learned during training. In summary, we address the problem of bandwidth bottlenecks across the memory hierarchy in GPUs by exploring how to best scale, supplement and utilize the existing bandwidth resources. These techniques provide an effective and comprehensive methodology to mitigate the bandwidth bottlenecks in the GPU memory hierarchy.

APA, Harvard, Vancouver, ISO, and other styles

3

Taylor, Nathan Bryan. "Cachekata : memory hierarchy optimization via dynamic binary translation." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/44335.

Full text

Abstract:

As hardware parallelism continues to increase, CPU caches can no longer be considered a transparent, hardware-level performance optimization. Adverse cache impact on performance is entirely workload-dependent and may depend on runtime factors. The operating system must begin to treat CPU caches like any other shared hardware resource to effectively support workloads on parallel hardware. We present a binary translation system called Cachekata that provides a byte-granular memory remapping facility within the OS in an efficient manner. Cachekata is incorporated into a larger system, Plastic, which diagnoses and corrects instances of false sharing occurring within running applications. Our implementation is able to achieve a 3-6x performance improvement on known examples of false sharing in parallel benchmarks.

APA, Harvard, Vancouver, ISO, and other styles

4

Kim, Jinwoo. "Memory hierarchy management through off-line computational learning." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/8194.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Greenwald, Benjamin Eliot 1975. "A technique for compilation to exposed memory hierarchy." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/87163.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, June 2000.
"September 1999."
Includes bibliographical references (p. 55-56).
by Benjamin Eliot Greenwald.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

6

Dimić, Vladimir. "Runtime-assisted optimizations in the on-chip memory hierarchy." Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/670363.

Full text

Abstract:

Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.
Siguiendo la ley de Moore, el número de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programación eficiente de estos sistemas. Se han desarrollado muchos modelos de programación para resolver este problema; un ejemplo particular son los modelos de programación basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicación. La información de que disponen los sistemas en tiempo de ejecución (runtime systems) asociada con estos modelos de programación ofrece un enorme potencial para la mejora del diseño del hardware. Por otro lado, las limitaciones tecnológicas hacen que la ley de Moore pueda dejar de cumplirse próximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral·leles de que dispone el runtime system para mejorar el diseño de la jerarquía de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores diseños hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribución de esta tesis consiste en un conjunto de políticas de inserción para las memorias caché compartidas de último nivel que aprovecha la información de las tareas y las dependencias de datos entre estas. La intuición tras esta propuesta se basa en la observación de que los hilos de ejecución paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las políticas que se proponen insertan líneas de caché en posiciones lógicas diferentes en función de los tipos de dependencia y tarea a los que corresponde la petición de memoria. La segunda propuesta optimiza la ejecución de las reducciones, que se definen como un patrón de programación que combina datos de entrada para conseguir la variable de reducción como resultado. Esto se consigue mediante una técnica asistida por el runtime system para la realización de reducciones en la jerarquía de la caché del procesador, con el objetivo de ser una solución aplicable de forma universal sin depender del tipo de la variable de la reducción, su tamaño o el patrón de acceso. A nivel de software, el modelo de programación se extiende para que el programador especifique las variables de reducción de las tareas, así como el nivel de caché escogido para que se realice una determinada reducción. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta información al hardware subyacente, evitando así movimientos de datos innecesarios hacia y desde el núcleo del procesador, al realizarse la operación donde se encuentran los datos de la misma. La tercera contribución proporciona un esquema de priorización asistido por el runtime system para peticiones de memoria dentro de la jerarquía de memoria del chip. La propuesta se basa en la noción de camino crítico en el contexto de los códigos paralelos y en el hecho conocido de que acelerar tareas críticas reduce el tiempo de ejecución de la aplicación completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleración de las tareas críticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.
Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programació eficient d’aquests sistemes. Per intentar solucionar-ho, s’han desenvolupat diversos models de programació; un exemple particular en són els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral·lels dins d’una aplicació. La informació que hi ha al nivell dels sistemes en temps d’execució (runtime systems) associada amb aquests models de programació ofereix un gran potencial a l’hora de millorar el disseny del maquinari. D’altra banda, les limitacions tecnològiques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tendències actuals en la millora de rendiment. L’objectiu principal d’aquesta tesi és aprofitar els coneixements que el runtime System té d’una aplicació paral·lela per millorar el disseny de la jerarquia de memòria dins el xip. L’acoblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d’aquests sistemes. La primera contribució d’aquesta tesi consisteix en un conjunt de polítiques d’inserció a les memòries cau (cache memories) compartides d’últim nivell que aprofita informació sobre tasques i les dependències de dades entre aquestes. La intuïció que hi ha al darrere d’aquesta proposta es basa en el fet que els fils d’execució paral·lels mostren diferents patrons d’accés a la memòria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les polítiques que s’hi proposen insereixen línies de la memòria cau a diferents ubicacions lògiques en funció dels tipus de dependència i de tasca als quals correspon la petició de memòria. La segona proposta optimitza l’execució de les reduccions, que es defineixen com un patró de programació que combina dades d’entrada per aconseguir la variable de reducció com a resultat. Això s’aconsegueix mitjançant una tècnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la memòria cau del processador, amb l’objectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducció, la seva mida o el patró d’accés. A nivell de programari, es realitza una extensió del model de programació per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, així com el nivell de memòria cau desitjat on s’hauria de realitzar una certa reducció. El compilador font a font (compilador source-to-source) i el runtime system s’amplien per traduir i passar aquesta informació al maquinari subjacent. A nivell de maquinari, les memòries cau privades i compartides s’equipen amb unitats funcionals i la lògica corresponent per poder dur a terme les reduccions a la pròpia memòria cau, evitant així moviments de dades innecessaris entre el nucli del processador i la jerarquia de memòria. La tercera contribució proporciona un esquema de priorització assistit pel runtime System per peticions de memòria dins de la jerarquia de memòria del xip. La proposta es basa en la noció de camí crític en el context dels codis paral·lels i en el fet conegut que l’acceleració de les tasques que formen part del camí crític redueix el temps d’execució de l’aplicació sencera. En el context d’aquest treball, la criticitat de les tasques s’observa al nivell del seu tipus ja que permet que el programador les indiqui mitjançant anotacions senzilles. L’acceleració de les tasques crítiques s’aconsegueix prioritzant les corresponents peticions de memòria dins el microprocessador.

APA, Harvard, Vancouver, ISO, and other styles

7

Low, Douglas Wai Kok. "Network processor memory hierarchy designs for IP packet classification /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/6973.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Vitkovskiy, Arseniy <1979&gt. "Memory hierarchy and data communication in heterogeneous reconfigurable SoCs." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/1127/1/Tesi_Vitkovskiy_Arseniy.pdf.

Full text

Abstract:

The miniaturization race in the hardware industry aiming at continuous increasing of transistor density on a die does not bring respective application performance improvements any more. One of the most promising alternatives is to exploit a heterogeneous nature of common applications in hardware. Supported by reconfigurable computation, which has already proved its efficiency in accelerating data intensive applications, this concept promises a breakthrough in contemporary technology development. Memory organization in such heterogeneous reconfigurable architectures becomes very critical. Two primary aspects introduce a sophisticated trade-off. On the one hand, a memory subsystem should provide well organized distributed data structure and guarantee the required data bandwidth. On the other hand, it should hide the heterogeneous hardware structure from the end-user, in order to support feasible high-level programmability of the system. This thesis work explores the heterogeneous reconfigurable hardware architectures and presents possible solutions to cope the problem of memory organization and data structure. By the example of the MORPHEUS heterogeneous platform, the discussion follows the complete design cycle, starting from decision making and justification, until hardware realization. Particular emphasis is made on the methods to support high system performance, meet application requirements, and provide a user-friendly programmer interface. As a result, the research introduces a complete heterogeneous platform enhanced with a hierarchical memory organization, which copes with its task by means of separating computation from communication, providing reconfigurable engines with computation and configuration data, and unification of heterogeneous computational devices using local storage buffers. It is distinguished from the related solutions by distributed data-flow organization, specifically engineered mechanisms to operate with data on local domains, particular communication infrastructure based on Network-on-Chip, and thorough methods to prevent computation and communication stalls. In addition, a novel advanced technique to accelerate memory access was developed and implemented.

APA, Harvard, Vancouver, ISO, and other styles

9

Vitkovskiy, Arseniy <1979&gt. "Memory hierarchy and data communication in heterogeneous reconfigurable SoCs." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/1127/.

Full text

Abstract:

The miniaturization race in the hardware industry aiming at continuous increasing of transistor density on a die does not bring respective application performance improvements any more. One of the most promising alternatives is to exploit a heterogeneous nature of common applications in hardware. Supported by reconfigurable computation, which has already proved its efficiency in accelerating data intensive applications, this concept promises a breakthrough in contemporary technology development. Memory organization in such heterogeneous reconfigurable architectures becomes very critical. Two primary aspects introduce a sophisticated trade-off. On the one hand, a memory subsystem should provide well organized distributed data structure and guarantee the required data bandwidth. On the other hand, it should hide the heterogeneous hardware structure from the end-user, in order to support feasible high-level programmability of the system. This thesis work explores the heterogeneous reconfigurable hardware architectures and presents possible solutions to cope the problem of memory organization and data structure. By the example of the MORPHEUS heterogeneous platform, the discussion follows the complete design cycle, starting from decision making and justification, until hardware realization. Particular emphasis is made on the methods to support high system performance, meet application requirements, and provide a user-friendly programmer interface. As a result, the research introduces a complete heterogeneous platform enhanced with a hierarchical memory organization, which copes with its task by means of separating computation from communication, providing reconfigurable engines with computation and configuration data, and unification of heterogeneous computational devices using local storage buffers. It is distinguished from the related solutions by distributed data-flow organization, specifically engineered mechanisms to operate with data on local domains, particular communication infrastructure based on Network-on-Chip, and thorough methods to prevent computation and communication stalls. In addition, a novel advanced technique to accelerate memory access was developed and implemented.

APA, Harvard, Vancouver, ISO, and other styles

10

Senni, Sophiane. "Exploration of non-volatile magnetic memory for processor architecture." Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTS264/document.

Full text

Abstract:

De par la réduction continuelle des dimensions du transistor CMOS, concevoir des systèmes sur puce (SoC) à la fois très denses et énergétiquement efficients devient un réel défi. Concernant la densité, réduire la dimension du transistor CMOS est sujet à de fortes contraintes de fabrication tandis que le coût ne cesse d'augmenter. Concernant l'aspect énergétique, une augmentation importante de la puissance dissipée par unité de surface frêne l'évolution en performance. Ceci est essentiellement dû à l'augmentation du courant de fuite dans les transistors CMOS, entraînant une montée de la consommation d'énergie statique. En observant les SoCs actuels, les mémoires embarquées volatiles tels que la SRAM et la DRAM occupent de plus en plus de surface silicium. C'est la raison pour laquelle une partie significative de la puissance totale consommée provient des composants mémoires. Ces deux dernières décennies, de nouvelles mémoires non volatiles sont apparues possédant des caractéristiques pouvant aider à résoudre les problèmes des SoCs actuels. Parmi elles, la MRAM est une candidate à fort potentiel car elle permet à la fois une forte densité d'intégration et une consommation d'énergie statique quasi nulle, tout en montrant des performances comparables à la SRAM et à la DRAM. De plus, la MRAM a la capacité d'être non volatile. Ceci est particulièrement intéressant pour l'ajout de nouvelles fonctionnalités afin d'améliorer l'efficacité énergétique ainsi que la fiabilité. Ce travail de thèse a permis de mener une exploration en surface, performance et consommation énergétique de l'intégration de la MRAM au sein de la hiérarchie mémoire d'un processeur. Une première exploration fine a été réalisée au niveau mémoire cache pour des architectures multicoeurs. Une seconde étude a permis d'évaluer la possibilité d'intégrer la MRAM au niveau registre pour la conception d'un processeur non volatile. Dans le cadre d'applications des objets connectés, de nouvelles fonctionnalités ainsi que les intérêts apportés par la non volatilité ont été étudiés et évalués
With the downscaling of the complementary metal-oxide semiconductor (CMOS) technology,designing dense and energy-efficient systems-on-chip (SoC) is becoming a realchallenge. Concerning the density, reducing the CMOS transistor size faces up to manufacturingconstraints while the cost increases exponentially. Regarding the energy, a significantincrease of the power density and dissipation obstructs further improvement inperformance. This issue is mainly due to the growth of the leakage current of the CMOStransistors, which leads to an increase of the static energy consumption. Observing currentSoCs, more and more area is occupied by embedded volatile memories, such as staticrandom access memory (SRAM) and dynamic random access memory (DRAM). As a result,a significant proportion of total power is spent into memory systems. In the past twodecades, alternative memory technologies have emerged with attractive characteristics tomitigate the aforementioned issues. Among these technologies, magnetic random accessmemory (MRAM) is a promising candidate as it combines simultaneously high densityand very low static power consumption while its performance is competitive comparedto SRAM and DRAM. Moreover, MRAM is non-volatile. This capability, if present inembedded memories, has the potential to add new features to SoCs to enhance energyefficiency and reliability. In this thesis, an area, performance and energy exploration ofembedding the MRAM technology in the memory hierarchy of a processor architectureis investigated. A first fine-grain exploration was made at cache level for multi-core architectures.A second study evaluated the possibility to design a non-volatile processorintegrating MRAM at register level. Within the context of internet of things, new featuresand the benefits brought by the non-volatility were investigated

APA, Harvard, Vancouver, ISO, and other styles

11

Ramirez, Jon. "Analysis of compute cluster nodes with varying memory hierarchy distributions." To access this resource online via ProQuest Dissertations and Theses @ UTEP, 2009. http://0-proquest.umi.com.lib.utep.edu/login?COPT=REJTPTU0YmImSU5UPTAmVkVSPTI=&clientId=2515.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Ghosh, Mrinmoy. "Microarchitectural techniques to reduce energy consumption in the memory hierarchy." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/28265.

Full text

Abstract:

Thesis (M. S.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhakar.

APA, Harvard, Vancouver, ISO, and other styles

13

Kjelso, Morten. "A quantitative evaluation of data compression in the memory hierarchy." Thesis, Loughborough University, 1997. https://dspace.lboro.ac.uk/2134/10596.

Full text

Abstract:

This thesis explores the use of lossless data compression in the memory hierarchy of contemporary computer systems. Data compression may realise performance benefits by increasing the capacity of a level in the memory hierarchy and by improving the bandwidth between two levels in the memory hierarchy. Lossless data compression is already widely used in parts ofthe memory hierarchy. However, most of these applications are characterised by targeting inexpensive and relatively low performance devices such as magnetic disk and tape devices. The consequences of this are that the benefits of data compression are not realised to their full potential. This research aims to understand how the benefits of data compression can be realised for levels of the memory hierarchy which have a greater impact on system performance and system cost. This thesis presents a review of data compression in the memory hierarchy and argues that main memory compression has the greatest potential to improve system performance. The review also identifies three key issues relating to the use of data compression in the memory hierarchy. Quantitative investigations are presented to address these issues for main memory data compression. The first investigation is into memory data, and shows that memory data from a range of Unix applications typically compresses to half its original size. The second investigation develops three memory compression architectures, taking into account the results of the previous investigation. Furthermore, the management of compressed data is addressed and management methods are developed which achieve storage efficiencies in excess of 90% and typically complete allocation and de allocation operations with only a few memory accesses. The experimental work then culminates in a performance investigation. This shows that when memory resources are strecthed, hardware based memory compression can improve system performance by up to an order of magnitude. Furthermore, software based memory compression can improve system performance by up to a factor of 2. Finally, the performance models and quantitative results contained in this thesis enable us to identify under what conditions memory compression offers performance benefits. This may help designers incorporate memory compression into future computer systems.

APA, Harvard, Vancouver, ISO, and other styles

14

Azarkhish, Erfan <1985&gt. "Memory Hierarchy Design for Next Generation Scalable Many-core Platforms." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amsdottorato.unibo.it/7255/1/erfan-thesis-final.pdf.

Full text

Abstract:

Performance and energy consumption in modern computing platforms is largely dominated by the memory hierarchy. The increasing computational power in the multiprocessors and accelerators, and the emergence of the data-intensive workloads (e.g. large-scale graph traversal and scientific algorithms) requiring fast transfer of large volumes of data, are two main trends which intensify this problem by putting even higher pressure on the memory hierarchy. This increasing gap between computation speed and data transfer speed is commonly referred as the “memory wall” problem. With the emergence of heterogeneous Three Dimensional (3D) Integration based on through-silicon-vias (TSV), this situation has started to recover in the past years. On one hand, it is now possible to improve memory access bandwidth and/or latency by either stacking memories directly on top of processors or through abstracted memory interfaces such as Micron’s Hybrid Memory Cube (HMC). On the other hand, near memory computation has become worthy of revisiting due to the cost-effective integration of logic and memory in 3D stacks. These two directions bring about several interesting opportunities including performance improvement, energy and cost reduction, product miniaturization, and modular design for improved time to market. In this research, we study the effectiveness of the 3D integration technology and the optimization opportunities which it can provide in the different layers of the memory hierarchy in cluster-based many-core platforms ranging from intra-cluster L1 to inter-cluster L2 scratchpad memories (SPMs), as well as the main memory. In addition, by moving a part of the computation to where data resides, in the 3D-stacked memory context, we demonstrate further energy and performance improvement opportunities.

APA, Harvard, Vancouver, ISO, and other styles

15

Azarkhish, Erfan <1985&gt. "Memory Hierarchy Design for Next Generation Scalable Many-core Platforms." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amsdottorato.unibo.it/7255/.

Full text

Abstract:

Performance and energy consumption in modern computing platforms is largely dominated by the memory hierarchy. The increasing computational power in the multiprocessors and accelerators, and the emergence of the data-intensive workloads (e.g. large-scale graph traversal and scientific algorithms) requiring fast transfer of large volumes of data, are two main trends which intensify this problem by putting even higher pressure on the memory hierarchy. This increasing gap between computation speed and data transfer speed is commonly referred as the “memory wall” problem. With the emergence of heterogeneous Three Dimensional (3D) Integration based on through-silicon-vias (TSV), this situation has started to recover in the past years. On one hand, it is now possible to improve memory access bandwidth and/or latency by either stacking memories directly on top of processors or through abstracted memory interfaces such as Micron’s Hybrid Memory Cube (HMC). On the other hand, near memory computation has become worthy of revisiting due to the cost-effective integration of logic and memory in 3D stacks. These two directions bring about several interesting opportunities including performance improvement, energy and cost reduction, product miniaturization, and modular design for improved time to market. In this research, we study the effectiveness of the 3D integration technology and the optimization opportunities which it can provide in the different layers of the memory hierarchy in cluster-based many-core platforms ranging from intra-cluster L1 to inter-cluster L2 scratchpad memories (SPMs), as well as the main memory. In addition, by moving a part of the computation to where data resides, in the 3D-stacked memory context, we demonstrate further energy and performance improvement opportunities.

APA, Harvard, Vancouver, ISO, and other styles

16

de, Souza Ferreira Tharso. "Improving Memory Hierarchy Performance on MapReduce Frameworks for Multi-Core Architectures." Doctoral thesis, Universitat Autònoma de Barcelona, 2013. http://hdl.handle.net/10803/129468.

Full text

Abstract:

La necesidad de analizar grandes conjuntos de datos de diferentes tipos de aplicaciones ha popularizado el uso de modelos de programación simplicados como MapReduce. La popularidad actual se justifica por ser una abstracción útil para expresar procesamiento paralelo de datos y también ocultar eficazmente la sincronización de datos, tolerancia a fallos y la gestión de balanceo de carga para el desarrollador de la aplicación. Frameworks MapReduce también han sido adaptados a los sistema multi-core y de memoria compartida. Estos frameworks proponen que cada core de una CPU ejecute una tarea Map o Reduce de manera concurrente. Las fases Map y Reduce también comparten una estructura de datos común donde se aplica el procesamiento principal. En este trabajo se describen algunas limitaciones de los actuales frameworks para arquitecturas multi-core. En primer lugar, se describe la estructura de datos que se utiliza para mantener todo el archivo de entrada y datos intermedios en la memoria. Los frameworks actuales para arquitecturas multi-core han estado diseñado para mantener todos los datos intermedios en la memoria. Cuando se ejecutan aplicaciones con un gran conjunto de datos de entrada, la memoria disponible se convierte en demasiada pequeña para almacenar todos los datos intermedios del framework, presentando así una grave pérdida de rendimiento. Proponemos un subsistema de gestión de memoria que permite a las estructuras de datos procesar un número ilimitado de datos a través del uso de un mecanismo de spilling en el disco. También implementamos una forma de gestionar el acceso simultáneo al disco por todos los threads que realizan el procesamiento. Por último, se estudia la utilización eficaz de la jerarquía de memoria de los frameworks MapReduce y se propone una nueva implementación de una tarea MapReduce parcial para conjuntos de datos de entrada. El objetivo es hacer un buen uso de la caché, eliminando las referencias a los bloques de datos que ya no están en uso. Nuestra propuesta fue capaz de reducir significativamente el uso de la memoria principal y mejorar el rendimiento global con el aumento del uso de la memoria caché.
The need of analyzing large data sets from many different application fields has fostered the use of simplified programming models like MapReduce. Its current popularity is justified by being a useful abstraction to express data parallel processing and also by effectively hiding synchronization, fault tolerance and load balancing management details from the application developer. MapReduce frameworks have also been ported to multi-core and shared memory computer systems. These frameworks propose to dedicate a different computing CPU core for each map or reduce task to execute them concurrently. Also, Map and Reduce phases share a common data structure where main computations are applied. In this work we describe some limitations of current multi-core MapReduce frameworks. First, we describe the relevance of the data structure used to keep all input and intermediate data in memory. Current multi-core MapReduce frameworks are designed to keep all intermediate data in memory. When executing applications with large data input, the available memory becomes too small to store all framework intermediate data and there is a severe performance loss. We propose a memory management subsystem to allow intermediate data structures the processing of an unlimited amount of data by the use of a disk spilling mechanism. Also, we have implemented a way to manage concurrent access to disk of all threads participating in the computation. Finally, we have studied the effective use of the memory hierarchy by the data structures of the MapReduce frameworks and proposed a new implementation of partial MapReduce tasks to the input data set. The objective is to make a better use of the cache and to eliminate references to data blocks that are no longer in use. Our proposal was able to significantly reduce the main memory usage and improves the overall performance with the increasing of cache memory usage.

APA, Harvard, Vancouver, ISO, and other styles

17

Agarwal, Khushbu. "A partition based approach to approximate tree mining a memory hierarchy perspective /." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1196284256.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Agarwal, Khushbu. "A partition based approach to approximate tree mining : a memory hierarchy perspective." The Ohio State University, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=osu1196284256.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Mehta, Nishant K. "A Hierarchy Navigation Framework: Supporting Scalable Interactive Exploration over Large Databases." Link to electronic thesis, 2004. http://www.wpi.edu/Pubs/ETD/Available/etd-0827104-114148/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Xiang, Ping. "ANALYZING INSTRUCTTION BASED CACHE REPLACEMENT POLICIES." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2589.

Full text

Abstract:

The increasing speed gap between microprocessors and off-chip DRAM makes last-level caches (LLCs) a critical component for computer performance. Multi core processors aggravate the problem since multiple processor cores compete for the LLC. As a result, LLCs typically consume a significant amount of the die area and effective utilization of LLCs is mandatory for both performance and power efficiency. We present a novel replacement policy for last-level caches (LLCs). The fundamental observation is to view LLCs as a shared resource among multiple address streams with each stream being generated by a static memory access instruction. The management of LLCs in both single-core and multi-core processors can then be modeled as a competition among multiple instructions. In our proposed scheme, we prioritize those instructions based on the number of LLC accesses and reuses and only allow cache lines having high instruction priorities to replace those of low priorities. The hardware support for our proposed replacement policy is light-weighted. Our experimental results based on a set of SPEC 2006 benchmarks show that it achieves significant performance improvement upon the least-recently used (LRU) replacement policy for benchmarks with high numbers of LLC misses. To handle LRU-friendly workloads, the set sampling technique is adopted to retain the benefits from the LRU replacement policy.
M.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Engineering MSCpE

APA, Harvard, Vancouver, ISO, and other styles

21

Huang, Cheng-Chieh. "Optimizing cache utilization in modern cache hierarchies." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/19571.

Full text

Abstract:

Memory wall is one of the major performance bottlenecks in modern computer systems. SRAM caches have been used to successfully bridge the performance gap between the processor and the memory. However, SRAM cache’s latency is inversely proportional to its size. Therefore, simply increasing the size of caches could result in negative impact on performance. To solve this problem, modern processors employ multiple levels of caches, each of a different size, forming the so called memory hierarchy. Upon a miss, the processor will start to lookup the data from the highest level (L1 cache) to the lowest level (main memory). Such a design can effectively reduce the negative performance impact of simply using a large cache. However, because SRAM has lower storage density compared to other volatile storage, the size of an SRAM cache is restricted by the available on-chip area. With modern applications requiring more and more memory, researchers are continuing to look at techniques for increasing the effective cache capacity. In general, researchers are approaching this problem from two angles: maximizing the utilization of current SRAM caches or exploiting new technology to support larger capacity in cache hierarchies. The first part of this thesis focuses on how to maximize the utilization of existing SRAM cache. In our first work, we observe that not all words belonging to a cache block are accessed around the same time. In fact, a subset of words are consistently accessed sooner than others. We call this subset of words as critical words. In our study, we found these critical words can be predicted by using access footprint. Based on this observation, we propose critical-words-only cache (co cache). Unlike the conventional cache which stores all words that belongs to a block, co-cache only stores the words that we predict as critical. In this work, we convert an L2 cache to a co-cache and use L1s access footprint information to predict critical words. Our experiments show the co-cache can outperform a conventional L2 cache in the workloads whose working-set-sizes are greater than the L2 cache size. To handle the workloads whose working-set-sizes fit in the conventional L2, we propose the adaptive co-cache (acocache) which allows the co-cache to be configured back to the conventional cache. The second part of this thesis focuses on how to efficiently enable a large capacity on-chip cache. In the near future, 3D stacking technology will allow us to stack one or multiple DRAM chip(s) onto the processor. The total size of these chips is expected to be on the order of hundreds of megabytes or even few gigabytes. Recent works have proposed to use this space as an on-chip DRAM cache. However, the tags of the DRAM cache have created a classic space/time trade-off issue. On the one hand, we would like the latency of a tag access to be small as it would contribute to both hit and miss latencies. Accordingly, we would like to store these tags in a faster media such as SRAM. However, with hundreds of megabytes of die-stacked DRAM cache, the space overhead of the tags would be huge. For example, it would cost around 12 MB of SRAM space to store all the tags of a 256MB DRAM cache (if we used conventional 64B blocks). Clearly this is too large, considering that some of the current chip multiprocessors have an L3 that is smaller. Prior works have proposed to store these tags along with the data in the stacked DRAM array (tags-in-DRAM). However, this scheme increases the access latency of the DRAM cache. To optimize access latency in the DRAM cache, we propose aggressive tag cache (ATCache). Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In addition, we also address the high miss latency issue and cache pollution caused by excessive prefetching. To reduce this overhead, we propose a cost-effective prefetching, which is a combination of dynamic prefetching granularity tunning and hit-prefetching, to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average. The last proposed work in this thesis is a DRAM-Cache-Aware (DCA) DRAM controller. In this work, we first address the challenge of scheduling requests in the DRAM cache. While many recent DRAM works have built their techniques based on a tagsin- DRAM scheme, storing these tags in the DRAM array, however, increases the complexity of a DRAM cache request. In contrast to a conventional request to DRAM main memory, a request to the DRAM cache will now translate into multiple DRAM cache accesses (tag and data). In this work, we address challenges of how to schedule these DRAM cache accesses. We start by exploring whether or not a conventional DRAM controller will work well in this scenario. We introduce two potential designs and study their limitations. From this study, we derive a set of design principles that an ideal DRAM cache controller must satisfy. We then propose a DRAM-cache-aware (DCA) DRAM controller that is based on these design principles. Our experimental results show that DCA can outperform the baseline over 14%.

APA, Harvard, Vancouver, ISO, and other styles

22

Seong, Nak Hee. "A reliable, secure phase-change memory as a main memory." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/50123.

Full text

Abstract:

The main objective of this research is to provide an efficient and reliable method for using multi-level cell (MLC) phase-change memory (PCM) as a main memory. As DRAM scaling approaches the physical limit, alternative memory technologies are being explored for future computing systems. Among them, PCM is the most mature with announced commercial products for NOR flash replacement. Its fast access latency and scalability have led researchers to investigate PCM as a feasible candidate for DRAM replacement. Moreover, the multi-level potential of PCM cells can enhance the scalability by increasing the number of bits stored in a cell. However, the two major challenges for adopting MLC PCM are the limited write endurance cycle and the resistance drift issue. To alleviate the negative impact of the limited write endurance cycle, this thesis first introduces a secure wear-leveling scheme called Security Refresh. In the study, this thesis argues that a PCM design not only has to consider normal wear-out under normal application behavior, most importantly, it must take the worst-case scenario into account with the presence of malicious exploits and a compromised OS to address the durability and security issues simultaneously. Security Refresh can avoid information leak by constantly migrating their physical locations inside the PCM, obfuscating the actual data placement from users and system software. In addition to the secure wear-leveling scheme, this thesis also proposes SAFER, a hardware-efficient multi-bit stuck-at-fault error recovery scheme which can function in conjunction with existing wear-leveling techniques. The limited write endurance leads to wear-out related permanent failures, and furthermore, technology scaling increases the variation in cell lifetime resulting in early failures of many cells. SAFER exploits the key attribute that a failed cell with a stuck-at value is still readable, making it possible to continue to use the failed cell to store data; thereby reducing the hardware overhead for error recovery. Another approach that this thesis proposes to address the lower write endurance is a hybrid phase-change memory architecture that can dynamically classify, detect, and isolate frequent writes from accessing the phase-change memory. This proposed architecture employs a small SRAM-based Isolation Cache with a detection mechanism based on a multi-dimensional Bloom filter and a binary classifier. The techniques are orthogonal to and can be combined with other wear-out management schemes to obtain a synergistic result. Lastly, this thesis quantitatively studies the current art for MLC PCM in dealing with the resistance drift problem and shows that the previous techniques such as scrubbing or error correction schemes are incapable of providing sufficient level of reliability. Then, this thesis proposes tri-level-cell (3LC) PCM and demonstrates that 3LC PCM can be a viable solution to achieve the soft error rate of DRAM and the performance of single-level-cell PCM.

APA, Harvard, Vancouver, ISO, and other styles

23

Sampaio, Felipe Martin. "Energy-efficient memory hierarchy for motion and disparity estimation in multiview video coding." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/71292.

Full text

Abstract:

Esta dissertação de mestrado propõe uma hierarquia de memória para a Estimação de Movimento e de Disparidade (ME/DE) centrada nas referências da codificação, estratégia chamada de Reference-Centered Data Reuse (RCDR), com foco em redução de energia em codificadores de vídeo multivistas (MVC - Multiview Video Coding). Nos codificadores MVC, a ME/DE é responsável por praticamente 98% do consumo total de energia. Além disso, até 90% desta energia está relacionada com a memória do codificador: (a) acessos à memória externa para a busca das referências da ME/DE (45%) e (b) memória interna (cache) para manter armazenadas as amostras da área de busca e enviá-las para serem processadas pela ME/DE (45%). O principal objetivo deste trabalho é minimizar de maneira conjunta a energia consumida pelo módulo de ME/DE com relação às memórias externa e interna necessárias para a codificação MVC. A hierarquia de memória é composta por uma memória interna (a qual armazena a área de busca inteira), um controle dinâmico para a estratégia de power-gating da memória interna e um compressor de resultados parciais. Um controle de buscas foi proposto para explorar o comportamento da busca com o objetivo de atingir ainda mais reduções de energia. Além disso, este trabalho também agrega à hierarquia de memória um compressor de quadros de referência de baixa complexidade. A estratégia RCDR provê reduções de até 68% no consumo de energia quando comparada com estratégias estadoda- arte que são centradas no bloco atual da codificação. O compressor de resultados parciais é capaz de reduzir em 52% a comunicação com memória externa necessária para o armazenamento desses elementos. Quando comparada a técnicas de reuso de dados que não acessam toda área de busca, a estratégia RCDR também atinge os melhores resultados em consumo de energia, visto que acessos regulares a memórias externas DDR são energeticamente mais eficientes. O compressor de quadros de referência reduz ainda mais o número de acessos a memória externa (2,6 vezes menos acessos), aliando isso a perdas insignificantes na eficiência da codificação MVC. A memória interna requerida pela estratégia RCDR é até 74% menor do que estratégias centradas no bloco atual, como Level C. Além disso, o controle dinâmico para a técnica de power-gating provê reduções de até 82% na energia estática, o que é o melhor resultado entre os trabalho relacionados. A energia dinâmica é tratada pela técnica de união dos blocos candidatos, atingindo ganhos de mais de 65%. Considerando as reduções de consumo de energia atingidas pelas técnicas propostas neste trabalho, conclui-se que o sistema de hierarquia de memória proposto nesta dissertação atinge seu objetivo de atender às restrições impostas pela codificação MVC, no que se refere ao processamento do módulo de ME/DE.
This Master Thesis proposes a memory hierarchy for the Motion and Disparity Estimation (ME/DE) centered on the encoding references, called Reference-Centered Data Reuse (RCDR), focusing on energy reduction in the Multiview Video Coding (MVC). In the MVC encoders the ME/DE represents more than 98% of the overall energy consumption. Moreover, in the overall ME/DE energy, up to 90% is related to the memory issues, and only 10% is related to effective computation. The two items to be concerned with: (1) off-chip memory communication to fetch the reference samples (45%) and (2) on-chip memory to keep stored the search window samples and to send them to the ME/DE processing core (45%). The main goal of this work is to jointly minimize the on-chip and off-chip energy consumption in order to reduce the overall energy related to the ME/DE on MVC. The memory hierarchy is composed of an onchip video memory (which stores the entire search window), an on-chip memory gating control, and a partial results compressor. A search control unit is also proposed to exploit the search behavior to achieve further energy reduction. This work also aggregates to the memory hierarchy a low-complexity reference frame compressor. The experimental results proved that the proposed system accomplished the goal of the work of jointly minimizing the on-chip and off-chip energies. The RCDR provides off-chip energy savings of up to 68% when compared to state-of-the-art. the traditional MBcentered approach. The partial results compressor is able to reduce by 52% the off-chip memory communication to handle this RCDR penalty. When compared to techniques that do not access the entire search window, the proposed RCDR also achieve the best results in off-chip energy consumption due to the regular access pattern that allows lots of DDR burst reads (30% less off-chip energy consumption). Besides, the reference frame compressor is capable to improve by 2.6x the off-chip memory communication savings, along with negligible losses on MVC encoding performance. The on-chip video memory size required for the RCDR is up to 74% smaller than the MB-centered Level C approaches. On top of that, the power-gating control is capable to save 82% of leakage energy. The dynamic energy is treated due to the candidate merging technique, with savings of more than 65%. Due to the jointly off-chip communication and on-chip storage energy savings, the proposed memory hierarchy system is able to meet the MVC constraints for the ME/DE processing.

APA, Harvard, Vancouver, ISO, and other styles

24

Khan, Muneeb. "Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching." Doctoral thesis, Uppsala universitet, Datorarkitektur och datorkommunikation, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-272095.

Full text

Abstract:

Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefetching, to increase performance. Such complex hardware structures have helped improve performance in general, however, their full potential is not realized as software often utilizes the memory hierarchy inefficiently. Performance can be improved further by ensuring careful interaction between software and hardware. Performance can typically improve by increasing the cache utilization and by conserving the DRAM bandwidth, i.e., retaining more useful data in the caches and lowering data requests to the DRAM. One way to achieve this is to conserve space across the cache hierarchy and increase opportunity for temporal reuse of cached data. Similarly, conserving the DRAM bandwidth is essential for performance in highly utilized multicores, as it can easily become a critical resource. When multiple cores are active and the per-core share of DRAM bandwidth shrinks, its efficient utilization plays an important role in improving the overall performance. Together the cache hierarchy and the DRAM bandwidth play a significant role in defining the overall performance in multicores. Based on deep insight from memory behavior modeling of software, this thesis explores five software-only methods to analyze and increase performance in multicores. The underlying philosophy that drives these techniques is to increase cache utilization and conserve DRAM bandwidth by 1) focusing on making data prefetching more accurate, and 2) lowering the miss rate in the cache hierarchy either by preserving useful data longer by cache-bypassing the less useful data or via code size compaction using compiler options. First, we show how microarchitecture-independent memory access profiles can be used to analyze the Instruction Cache performance of software. We use this information in a compiler pass to recompile application phases (with large Instruction cache miss rate) for smaller code size in an effort to improve the application Instruction Cache behavior. Second, we demonstrate how a resourceefficient software prefetching method can be combined with hardware prefetching to improve performance in multicores when running software that exhibits irregular memory access patterns. Third, we show that hardware prefetching on high performance commodity multicores is sub-optimal and demonstrate how a resource-efficient software-only prefetching method can perform better in fully utilized multicores. Fourth, we present an adaptive prefetching approach that dynamically combines software and hardware prefetching in a runtime system to improve performance in highly utilized multicores. Finally, in the fifth work we develop a method to predict per-core prefetching configurations that deliver near-optimal overall multicore performance. These software techniques enable us to tap greater performance in multicores (up to 50%), without requiring more processing resources.

APA, Harvard, Vancouver, ISO, and other styles

25

Lodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.

Full text

Abstract:

La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos. En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible: -acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios -mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio -implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado.
Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
TESIS

APA, Harvard, Vancouver, ISO, and other styles

26

Bonatto, Alexsandro Cristóvão. "Controle adaptativo para acesso à memória compartilhada em sistemas em chip." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/109193.

Full text

Abstract:

Acessos simultâneos gerados por Elementos de Processamento (EP) contidos nos Sistemas em Chip (SoC) para um único canal de memória externa coloca desafios que requerem uma atenção especial por constituírem o gargalo para o desempenho de processamento. No caso em que os EPs são microprocessadores, a questão fica ainda mais evidente, pois a taxa de aumento da velocidade dos microprocessadores excede a taxa de aumento da velocidade da DRAM. Ambas aumentam exponencialmente, mas a expoente dos microprocessadores é maior do que a das memórias. Este efeito é denominado de “muro de memória” (Memory Wall) e representa que o gargalo de processamento está relacionado à diferença de velocidade. Neste cenário, novas estratégias de controle de acesso são necessárias para melhorar o desempenho. Plataformas heterogêneas de processamento multimídia são formadas por diversos EPs. Os acessos con- correntes à regiões de memória não contíguas em uma DRAM reduzem a largura de banda e aumentam a latência de acesso aos dados, degradando o desempenho de processamento. Esta tese mostra que a eficiência computacional pode ser melhorada com o uso de um fluxo de projeto centralizado em memória, ou seja, orientado para os aspectos funcionais da DRAM. Neste trabalho é apresentado um subsistema de memória com gerenciamento adaptativo de compar- tilhamento do canal de memória entre múltiplos clientes. Esta tese apresenta a arquitetura de um controlador de memória com comportamento predizível que faz a avaliação do pior caso de execução para as transações solicitadas pelos clientes em tempo de execução. Um modelo baseado em atrasos é utilizado para prever os piores casos para o conjunto de clientes. O sub-sistema de memória centraliza a comunicação de dados e gerencia os acessos dos diversos EPs do sistema, de forma que a comunicação seja atendida de acordo com as necessidades de cada aplicação. São apresentadas três contribuições principais: 1) um método de projeto de sistemas integrados centralizado em memória, que orienta o projeto para os aspectos funcionais da me- mória compartilhada; 2) um modelo baseado em atrasos para estimar o pior caso de execução do sistema, quanto aos tempos de resposta e largura de banda mínima alocada por cliente; 3) um árbitro adaptativo para gerenciamento dos acessos à memória externa com garantias de prazos de execução das transações.
The number of Processing Elements (PE) contained in a System-on-Chip (SoC) follows the growth of the number of transistors per chip. A SoC composed of multiple PEs, in some ap- plications such as multimedia, implements algorithms that handle large volumes of data and justify the use of an external memory with large capacity. External memory accesses are shared by multiple PEs adding challenges that may have special attention because they constitute the bottleneck for performance and relevant factor for power consumption. In the case where the PEs are microprocessors, this issue becomes even more evident as the rate of increase of speed of microprocessors exceeds the rate of increase in speed of DRAM. This effect is called “mem- ory wall” and represents that the bottleneck processing is related to the speed of data access. In this scenario, new access control strategies are needed to improve processing performance. Heterogeneous platforms for multimedia processing are formed by several PEs. The concur- rent accesses to DRAM reduce bandwidth and increase latency access to data, degrading the processing performance. This thesis shows that significant improvements in computational effi- ciency can be obtained using a design methodology oriented to the functional aspects of DRAM through a memory subsystem with adaptive management. It is presented the data communica- tion architecture for integration of PEs system based on an analytical model to reduce latency and guarantee Quality of Service (QoS). The memory subsystem is organized as a hierarchy of memories, with a proposed integration of PEs oriented centered in the main memory. The memory subsystem centralized data communication and manages the access of several PEs sys- tem so that communication is served according to the needs of each application. This thesis proposes three major contributions: 1) a methodology for design integrated systems based on the memory-centric design approach, 2) an analytical model based on delays used to evaluate the worst-case performance of the memory subsystem, 3) an arbiter for adaptive management of accesses to the external memory with guaranteed execution times of transactions.

APA, Harvard, Vancouver, ISO, and other styles

27

Jiménez, Castells Marta. "Multilevel tiling for non-rectangular interation spaces." Doctoral thesis, Universitat Politècnica de Catalunya, 1999. http://hdl.handle.net/10803/6007.

Full text

Abstract:

La motivación principal de esta tesis es el desarrollo de nuevas técnicas de compilación dirigidas a conseguir mayor rendimiento encódigos numéricos complejos que definen es pacios de iteraciones no rectangulares. En particular, nos centramos en la trasformación de "loop tiling" (también conocida como "blocking") y nuestro propósito es mejorar la transformación de loop tiling cuando se aplica a códigos numéricos complejos. Nuestro objetivo es conseguir, a través de la transformación de loop tiling, los mismos o mejores rendimientos que las librerías numéricas proporcionadas por el fabricante que están optimizadas manualmente.

En la tesis se muestra que la razón principal por la que los compiladores comerciales actuales consiguen bajos rendimiento en este tipo de aplicaciones es que no son capaces de aplicar loop tiling a nivel de registros. En su lugar, para mejorar la localidad de los datos y el ILP, los compiladores actuales usan y combinan otras transformaciones que no explotan el nivel de registros tan bien como loop tiling. Previamente no se ha considerado aplicar loop tiling a nivel de registro porque en códigos numéricos complejos no es trivial debido a la naturaleza irregular de los espacios de iteraciones. La primera contribución de esta tesis es un algoritmo general de loop tiling a nivel de registros que es aplicable a cualquier tipo de espacio de iteraciones y no sólo a los espacios rectangulares.

Nuestro método incluye una heurística muy sencilla para decidir los parámetros de los cortes a nivel de registros. A primera vista parece que loop tiling a nivel de registros (a partir de ahora, register tiling) se tiene que aplicar de tal manera que el bucle que ofrece más reuso temporal de los datos no debe de ser partido. De esta manera maximizamos la reutilización de los registros y minimizamos el número total de load/stores ejecutados. Sin embargo, mostraremos que en espacios de iteraciones no rectangulares, si solamente tenemos en cuenta las direcciones del reuso y no la forma del espacio de iteraciones, los códigos pueden sufrir una degradación en rendimiento. Nuestra segunda contribución es la propuesta de una heurística muy sencilla que determina los parámetros del tiling a nivel de registros considerando no sólo el reuso temporal sino también la forma del espacio de iteraciones. Además, la heurística es suficientemente sencilla para que pueda ser implementada en un compilador comercial.

Sin embargo, para conseguir rendimientos similares que códigos optimizados a mano, no es suficiente con aplicar loop tiling a nivel de registros. Con las arquitecturas de hoy en día que disponen de jerarquías de memoria complejas y múltiples procesadores, es necesario que el compilador aplique loop tiling en cuatro o más niveles (paralelismo, cache L2, cache L1 y registros) para conseguir altos rendimientos. Por lo tanto, en las arquitecturas actuales es crucial tener un algoritmo eficiente para aplicar loop tiling en varios niveles de la jerarquía de memoria (tiling multinivel). Además, como mostramos en esta tesis, la transformación de tiling multinivel siempre tendrá que incluir el nivel de registro porque este es el nivel de la jerarquía de memoria que ofrece mayor rendimiento cuando es tratado correctamente.

Cuando tiling multinivel incluye el nivel de registros, es necesario que los límites de los bucles sean exactos y que no haya límites redundantes. La razón es que la complejidad y la cantidad de código que se genera con nuestra técnica de register tiling depende polinómicamente del número de límites de los bucles.
Sin embargo, hasta ahora, el problema de calcular límites exactos y eliminar límites redundantes es que todas las técnicas conocidas son muy caras en términos de tiempo de compilación y, por ello, difícil de integrar en un compilador comercial. La tercera contribución de esta tesis es una nueva implementación de tiling multinivel que calcula límites exactos y es mucho menos costosa que técnicas tradicionales. Mostraremos que la complejidad de nuestra implementación es proporcional a la complejidad de aplicar una permutación de bucles en el código original (antes de aplicar loop tiling), mientras que las técnicas tradicionales tienen complejidades más altas. Además, nuestra implementación genera menos límites redundantes y permite eliminar los límites redundantes que quedan a menor coste. En conjunto, la eficiencia de nuestra implementación hace posible que pueda ser implementada dentro de un compilador comercial sin tener que preocuparse por los tiempos de compilación.

La última parte de esta tesis está dedicada al estudio del rendimiento de tiling multinivel. Se muestran los efectos de tiling en los diferentes niveles de memoria y presentamos datos que comparan los beneficios de tiling a nivel de registros, tiling a nivel de cache y tiling a los dos niveles, cache y registros, simultáneamente. Finalmente, comparamos el rendimiento de códigos optimizados automáticamente con códigos optimizados manualmente (librerías numéricas que ofrecen los fabricantes) sobre dos arquitecturas diferentes (ALPHA 21164 and MIPS R10000) para concluir que actualmente la tecnología de los compiladores hace posible que códigos numéricos complejos consigan el mismo rendimiento que códigos optimizados manualmente.
The main motivation of this thesis is to develop new compilation techniques that address the lack of performance of complex numerical codes consisting of loop nests defining non-rectangular iteration spaces. Specifically, we focus on the loop tiling transformation (also known as blocking) and our purpose is the improvement of the loop tiling transformation when dealing with complex numerical codes. Our goal is to achieve via the loop tiling transformation the same or better performance as hand-optimized vendor-supplied numerical libraries.

We will observe that the main reason why current commercial compilers perform poorly when dealing with this type of codes is that they do not apply tiling for the register level. Instead, to enhance locality at this level and to improve ILP, they use/combine other transformations that do not exploit the register level as well as loop tiling. Tiling for the register level has not generally been considered because, in complex numerical codes, it is far from being trivial due to the irregular nature of the iteration space. Our first contribution in this thesis will be a general compiler algorithm to perform tiling at the register level that handles arbitrary iteration space shapes and not only simple rectangular shapes.

Our method includes a very simple heuristic to make the tile decisions for the register level. At first sight, register tiling should be performed so that whichever loop carries the most temporal reuse is not tiled. This way, register reuse is maximized and the number of load/store instructions executed is minimized. However, we will show that, for complex loop nests, if we only consider reuse directions and do not take into account the iteration space shape, the tiled loop nest can suffer performance degradation. Our second contribution will be a proposal of a very simple heuristic to determine the tiling parameters for the register level, that considers not only temporal reuse, but also the iteration space shape. Moreover, the heuristic is simple enough to be suitable for automatic implementation by compilers.

However, to be able to achieve similar performance to hand-optimized codes, it is not enough by tiling only for the register level. With today's architectures having complex memory hierarchies and multiple processors, it is quite common that the compiler has to perform tiling at four or more levels (parallelism, L2-cache, L1-cache and registers) in order to achieve high performance. Therefore, in today's architectures it is crucial to have an efficient algorithm that can perform multilevel tiling at multiple levels of the memory hierarchy. Moreover, as we will see in this thesis, multilevel tiling should always include the register level, as this is the memory hierarchy level that yields most performance when properly tiled.

When multilevel tiling includes the register level, it is critical to compute exact loop bounds and to avoid the generation of redundant bounds. The reason is that the complexity and the amount of code generated by our register tiling technique both depend polynomially on the number of loop bounds. However, to date, the drawback of generating exact loop bounds and eliminating redundant bounds has been that all techniques known were extremely expensive in terms of compilation time and, thus, difficult to integrate in a production compiler. Our third contribution in this thesis will be a new implementation of multilevel tiling that computes exact loop bounds at a much lower complexity than traditional techniques. In fact, we will show that the complexity of our implementation is proportional to the complexity of performing a loop permutation in the original loop nest (before tiling), while traditional techniques have much larger complexities. Moreover, our implementation generates less redundant bounds in the multilevel tiled code and allows removing the remaining redundant bounds at a lower cost. Overall, the efficiency of our implementation makes it possible to integrate multilevel tiling including the register level in a production compiler without having to worry about compilation time.

The last part of this thesis is dedicated to studying the performance of multilevel tiling. We will discuss the effects of tiling for different memory levels and present quantitative data comparing the benefits of tiling only for the register level, tiling only for the cache level and tiling for both levels simultaneously. Finally, we will compare automatically-optimized codes against hand-optimized vendor-supplied numerical libraries, on two different architectures (ALPHA 21164 and MIPS R10000), to conclude that compiler technology can make it possible for complex numerical codes to achieve the same performance as hand-optimized codes on modern microprocessors.

APA, Harvard, Vancouver, ISO, and other styles

28

SOHONI, SOHUM. "IMPROVING L2 CACHE PERFORMANCE THROUGH STREAM-DIRECTED OPTIMIZATIONS." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Chen, Feng. "On Performance Optimization and System Design of Flash Memory based Solid State Drives in the Storage Hierarchy." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1280537880.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Molka, Daniel. "Performance Analysis of Complex Shared Memory Systems." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-221729.

Full text

Abstract:

Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations.

APA, Harvard, Vancouver, ISO, and other styles

31

Davari, Mahdad. "Advances Towards Data-Race-Free Cache Coherence Through Data Classification." Doctoral thesis, Uppsala universitet, Avdelningen för datorteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595.

Full text

Abstract:

Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times. In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases. Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption. In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics.

APA, Harvard, Vancouver, ISO, and other styles

32

Lima, Pilla Laércio. "Topology-aware load balancing for performance portability over parallel high performance systems." Thesis, Grenoble, 2014. http://www.theses.fr/2014GRENM028.

Full text

Abstract:

Cette thèse présente nos travaux de recherche qui ont comme principal objectif d'assurer la portabilité des performances et le passage à l'échelle des applications scientifiques complexes exécutées sur des plates-formes multi-coeurs parallèles et hiérarchiques. La portabilité des performances est obtenue lorsque l'ordonnancement des tâches d'une application permet de réduire les périodes d'inactivité des coeurs de la plate-forme. Cette portabilité des performances peut être affectée par différents problèmes tels que des déséquilibres de charge, des communications coûteuses et des surcoûts provenant de l'ordonnancement des tâches. Le déséquilibre de charge est la conséquence de comportements de charges irrégulières et dynamiques, où le volume de calcul varie dynamiquement en fonction de la tâche et de l'étape de simulation. Les communications coûteuses sont provoquées par un ordonnancement qui ne prend pas en compte les différents temps de communication entre tâches sur une plate-forme hiérarchique. Cela est accentué par des communications non uniformes et asymétriques au niveau mémoire et réseau. Enfin, ces surcoûts peuvent être générés par des algorithmes de placement trop complexes dont les coûts ne seraient pas compensés par les gains de performance.Pour atteindre cet objectif de portabilité des performances, notre approche repose sur une récolte d'informations précises sur la topologie de la machine qui vont aider les algorithmes d'ordonnancement de tâches à prendre les bonnes décisions. Dans ce contexte, nous avons proposé une modélisation générique de la topologie des plates-formes parallèles. Le modèle comprend des latences et des bandes passantes mesurées de la mémoire et du réseau qui mettent en évidence des asymétries. Ces informations sont utilisées par nos trois algorithmes d'équilibrage de charge nommés NucoLB, HwTopoLB, et HierarchicalLB. De plus, ces algorithmes utilisent des informations provenant de l'exécution de l'application. NucoLB se concentre sur les aspects non uniformes de plates-formes parallèles, alors que HwTopoLB considère l'ensemble de la hiérarchie pour ses décisions, et HierarchicalLB combine ces algorithmes hiérarchiquement pour réduire son surcoût d'ordonnancement de tâches. Ces algorithmes cherchent à atténuer le déséquilibre de charge et des communications coûteuses tout en limitant les surcoûts de migration des tâches.Les résultats expérimentaux avec les trois régulateurs de charge proposés ont montré des améliorations de performances sur les meilleurs algorithmes de l'état de l'art: NucoLB a présenté jusqu'à 19% d'amélioration de performances sur un noeud de calcul; HwTopoLB a amélioré les performances en moyenne de 19%, et HierarchicalLB a surclassé HwTopoLB de 22% en moyenne sur des plates-formes avec plus de dix noeuds de calcul. Ces résultats ont été obtenus en répartissant la charge entre les ressources disponibles, en réduisant les coûts de communication des applications, et en gardant les surcoûts d'équilibrage de charge faibles. En ce sens, nos algorithmes d'équilibrage de charge permettent la portabilité des performances pour les applications scientifiques tout en étant indépendant de l'application et de l'architecture du système
This thesis presents our research to provide performance portability and scalability to complex scientific applications running over hierarchical multicore parallel platforms. Performance portability is said to be attained when a low core idleness is achieved while mapping a given application to different platforms, and can be affected by performance problems such as load imbalance and costly communications, and overheads coming from the task mapping algorithm. Load imbalance is a result of irregular and dynamic load behaviors, where the amount of work to be processed varies depending on the task and the step of the simulation. Meanwhile, costly communications are caused by a task distribution that does not take into account the different communication times present in a hierarchical platform. This includes nonuniform and asymmetric communication costs at memory and network levels. Lastly, task mapping overheads come from the execution time of the task mapping algorithm trying to mitigate load imbalance and costly communications, and from the migration of tasks.Our approach to achieve the goal of performance portability is based on the hypothesis that precise machine topology information can help task mapping algorithms in their decisions. In this context, we proposed a generic machine topology model of parallel platforms composed of one or more multicore compute nodes. It includes profiled latencies and bandwidths at memory and network levels, and highlights asymmetries and nonuniformity at both levels. This information is employed by our three proposed topology-aware load balancing algorithms, named NucoLB, HwTopoLB, and HierarchicalLB. Besides topology information, these algorithms also employ application information gathered during runtime. NucoLB focuses on the nonuniform aspects of parallel platforms, while HwTopoLB considers the whole hierarchy in its decisions, and HierarchicalLB combines these algorithms hierarchically to reduce its task mapping overhead. These algorithms seek to mitigate load imbalance and costly communications while averting task migration overheads.Experimental results with the proposed load balancers over different platform composed of one or more multicore compute nodes showed performance improvements over state of the art load balancing algorithms: NucoLB presented improvements of up to 19% on one compute node; HwTopoLB experienced performance improvements of 19% on average; and HierarchicalLB outperformed HwTopoLB by 22% on average on parallel platforms with ten or more compute nodes. These results were achieved by equalizing work among the available resources, reducing the communication costs experienced by applications, and by keeping load balancing overheads low. In this sense, our load balancing algorithms provide performance portability to scientific applications while being independent from application and system architecture

APA, Harvard, Vancouver, ISO, and other styles

33

Silvestri, Francesco. "Oblivious Computations on Memory and Network Hierarchies." Doctoral thesis, Università degli studi di Padova, 2009. http://hdl.handle.net/11577/3426414.

Full text

Abstract:

The hierarchical organization of the memory and communication systems and the availability of numerous processing units play an important role in the performance of algorithms. Their actual exploitation is made hard by the different configurations they may assume. It is crucial, for economical and portability issues, that algorithms adapt to a wide spectrum of executing platforms, possibly in an automatic fashion. Adaptivity can be achieved through either aware algorithms, which make explicit use of suitable architectural parameters, or oblivious algorithms, whose sequence of operations is independent of the characteristics of the underlying architecture. Oblivious algorithms are more desirable in contexts where architectural parameters are unknown and hard to estimate, and, in some cases, they still exhibit optimal performance across different architectures. This thesis focuses on the study of oblivious algorithms pursuing two main objectives: the investigation of the potentialities and intrinsic limitations of oblivious computations in comparison with aware ones, and the introduction of the notion of oblivious algorithm in the parallel setting. We study various aspects concerning the execution of cache-oblivious algorithms for rational permutations, an important class of permutations including matrix transposition and the bit-reversal of a vector. We provide a lower bound, which is also valid for cache-aware algorithms, on the work complexity required by the execution of a rational permutation with an optimal cache complexity. Then, we develop a cache-oblivious algorithm performing any rational permutation, which exhibits optimal work and cache complexities under the tall-cache assumption. We then show that for certain families of rational permutations (including transposition and bit-reversal) no cache-oblivious algorithm can exhibit optimal cache complexity for all values of the cache parameters, while there exists a cache-aware algorithm with this property. We introduce the network-oblivious framework for the study of oblivious algorithms in the parallel setting. This framework explores the design of bulk-synchronous parallel algorithms that, without resorting to parameters for tuning the performance on the target platform, can execute efficiently on parallel machines with different degree of parallelism and bandwidth/latency characteristics. We illustrate the framework by providing optimal network-oblivious algorithms for a few key problems (i.e., matrix multiplication and transposition, discrete Fourier transform and sorting) and by presenting an impossibility result on the execution of network-oblivious algorithms for matrix transposition, which is similar, in spirit, to the one provided for rational permutations. Finally, we present a number of network-oblivious algorithms, which exhibit optimal communication and computation complexities, for solving a wide class of computations encompassed by the Gaussian Elimination Paradigm, including all-pairs shortest paths, Gaussian elimination without pivoting and matrix multiplication.
L'organizzazione gerarchica dei sistemi di memoria e di comunicazione e la disponibilità di molte unità di calcolo influiscono notevolmente sulle prestazioni di un algoritmo. Il loro efficiente utilizzo è limitato dalle differenti configurazioni che possono assumere. È cruciale, per motivi economici e di portabilità, che gli algoritmi si adattino allo spettro delle piattaforme esistenti e che la procedura di adattamento sia il più possibile automatizzata. L'adattività può essere raggiunta sia tramite algoritmi aware, che utilizzano esplicitamente opportuni parametri architetturali, sia tramite algoritmi oblivious, la cui sequenza di operazioni è indipendente dalle caratteristiche dell'architettura sottostante. Gli algoritmi oblivious esibiscono spesso prestazioni ottime su diverse architetture e sono attrattivi soprattutto nei contesti in cui i parametri architetturali sono difficili da stimare o non sono noti. Questa tesi si focalizza sullo studio degli algoritmi oblivious con due obiettivi principali: l'indagine delle potenzialità e limitazioni delle computazioni oblivious e l'introduzione del concetto di algoritmo oblivious in un sistema parallelo. Inizialmente, vengono affrontate varie problematiche legate all'esecuzione di algoritmi cache-oblivious per permutazioni razionali, le quali rappresentano un'importante classe di permutazioni che include la trasposizione di matrici e il bit-reversal di vettori. Si dimostra un lower bound, valido anche per algoritmi cache-aware, sul numero di operazioni macchina necessarie per eseguire una permutazione razionale assumendo un numero ottimo di accessi alla memoria. Quindi, si descrive un algoritmo cache-oblivious che esegue qualsiasi permutazione razionale e che richiede un numero ottimo di operazioni macchina e di accessi alla memoria, assumendo l'ipotesi di tall-cache. Infine, si dimostra che per certe famiglie di permutazioni razionali (tra cui la trasposizione e il bit-reversal) non può esistere un algoritmo cache-oblivious che richieda un numero ottimo di accessi alla memoria per tutti i valori dei parametri della cache, mentre esiste un algoritmo cache-aware con tali caratteristiche. Nella tesi viene poi introdotto il framework network-oblivious per lo studio di algoritmi oblivious in un sistema parallelo. Il framework esplora lo sviluppo di algoritmi paralleli di tipo bulk-synchronous che, senza usare parametri dipendenti dalla macchina, hanno prestazioni efficienti su macchine parallele con differenti gradi di parallelismo e valori di banda/latenza. Vengono inoltre forniti algoritmi network-oblivious per alcuni problemi chiave (moltiplicazione e trasposizione di matrici, trasformata discreta di Fourier, ordinamento) e viene presentato un risultato di impossibilità sull'esecuzione di algoritmi network-oblivious per la trasposizione di matrici che ricorda quello ottenuto per le permutazioni razionali. Infine, per mostrare ulteriormente le potenzialità del framework, vengono presentati algoritmi network-oblivious ottimi per eseguire un'ampia classe di computazioni risolvibili tramite il paradigma di programmazione ad eliminazione gaussiana, tra cui il calcolo dei cammini minimi in un grafo, l'eliminazione gaussiana senza pivoting e la moltiplicazione di matrici.

APA, Harvard, Vancouver, ISO, and other styles

34

Candel, Margaix Francisco. "Efficient L2 Cache Management to Boost GPGPU Performance." Doctoral thesis, Universitat Politècnica de València, 2019. http://hdl.handle.net/10251/125477.

Full text

Abstract:

[ES] En los últimos años, la creciente necesidad de la capacidad de cómputo ha supuesto un reto que ha llevado a la industria a buscar arquitecturas alternativas a los procesadores superescalares con ejecución fuera de orden convencionales, con el objetivo de incrementar la potencia de cómputo con una mayor eficiencia energética. Las GPU, que hasta hace apenas una década se dedicaban exclusivamente a la aceleración de los gráficos en los computadores, han sido una de las arquitecturas alternativas más utilizadas durante varios años para alcanzar el mencionado objetivo. Una de las características particulares de las GPU es su gran ancho de banda para acceder a memoria principal, lo que les permite ejecutar un gran número de hilos de forma muy eficiente. Esta característica, así como su elevada potencia computacional ejecutando operaciones de coma flotante, ha originado la aparición del paradigma de computación denominado GPGPU computing, paradigma en el que las GPU realizan cómputo de propósito general. Las citadas características convierten a las GPU en dispositivos especialmente apropiados para la ejecución de aplicaciones masivamente paralelas que tradicionalmente se habían ejecutado en procesadores convencionales de altas prestaciones. El trabajo desarrollado en esta tesis persigue ayudar a mejorar las prestaciones de las GPU en la ejecución de aplicaciones GPGPU. Con este fin, como primer paso, se realiza un estudio de caracterización donde se identifican las características más importantes de estas aplicaciones desde el punto de vista de la jerarquía de memoria y su impacto en las prestaciones. Para ello, se utiliza un simulador detallado ciclo a ciclo donde se modela la arquitectura de una GPU reciente. El estudio revela que es necesario modelar de forma más detallada algunos componentes críticos de la jerarquía de memoria de las GPU para obtener resultados precisos. Los resultados obtenidos muestran que las prestaciones alcanzadas pueden variar hasta en un factor de 3× dependiendo de cómo se modelen estos componentes críticos. Por este motivo, como segundo paso antes de elaborar la propuesta de mejora, el trabajo se centra en determinar qué componentes de la jerarquía de memoria de la GPU necesitan modelarse con mayor detalle para mejorar la precisión de los resultados del simulador, y en mejorar los modelos existentes de estos componentes. Además, se realiza un estudio de validación que compara los resultados obtenidos con los modelos mejorados contra los de una GPU comercial real. Las mejoras implementadas reducen la desviación de los resultados del simulador sobre los resultados reales alrededor de un 96%. Finalmente, una vez mejorada la precisión del simulador, en esta tesis se presenta una propuesta innovadora, denominada FRC (siglas en inglés de Fetch and Replacement Cache), que mejora en gran medida la potencia computacional de la GPU, gracias a que aumenta el paralelismo en el acceso a memoria principal. La propuesta incrementa el número de accesos en paralelo a memoria principal mediante la aceleración de la gestión de las acciones de búsqueda y reemplazo relacionadas con los accesos que fallan en la cache. La propuesta FRC se basa en una pequeña estructura cache auxiliar que descongestiona el subsistema de memoria eficientemente, aumentando las prestaciones de la GPU hasta un 118% de media respecto al sistema base. Además, también reduce en 57% el consumo energético de la jerarquía de memoria.
[CAT] En els últims anys, la creixent necessitat de capacitat de còmput ha suposat un repte que ha portat a la indústria a buscar arquitectures alternatives als processadors superescalars amb execució fora d'ordre convencionals, amb l'objectiu d'incrementar la potència de còmput alhora que s'aconsegueix una major eficiència energètica. Les arquitectures GPU, les quals fins fa només una dècada es dedicaven exclusivament a l'acceleració dels gràfics en els computadors, han sigut una de les alternatives més utilitzades durant alguns anys per a aconseguir l'esmentat objectiu. Una de les característiques particulars de les GPU és el seu elevat ample de banda per a accedir a memòria principal, la qual cosa permet executar un gran nombre de fils de forma molt eficient. Aquesta característica, així com la seua elevada potència computacional executant operacions de coma flotant, ha originat l'aparició del paradigma de computació anomenat GPGPU computing, paradigma on les GPU realitzen còmput de propòsit general. Les citades característiques converteixen a les GPU en dispositius especialment apropiats per a l'execució d'aplicacions massivament paral·leles que tradicionalment s'havien executat en processadors convencionals d'altes prestacions. El treball desenvolupat en aquesta tesi persegueix ajudar a millorar les prestacions de les GPU en l'execució de les aplicacions GPGPU. A aquest efecte, com a primer pas, es realitza un estudi de caracterització on s'identifiquen les característiques més importants d'aquestes aplicacions des del punt de vista de la jerarquia de memòria i el seu impacte en les prestacions. Per a això s'utilitza un simulador detallat cicle a cicle on es modela l'arquitectura d'una GPU recent. L'estudi revela que és necessari modelar de forma més detallada alguns components crítics de la jerarquia de memòria de les GPU per a obtindre resultats precisos. Els resultats obtinguts mostren que les prestacions aconseguides poden variar fins i tot en un factor de 3× depenent de com es modelen aquests components crítics. Per aquest motiu, com a segon pas abans d'elaborar la proposta de millora, el treball se centra en determinar quins components de la jerarquia de memòria de la GPU necessiten modelar-se amb major detall per a millorar la precisió dels resultats del simulador i en millorar els models existents d'aquests components. A més, es realitza un estudi de validació que compara els resultats obtinguts amb els models millorats contra els d'una GPU comercial real. Les millores implementades redueixen la desviació dels resultats del simulador sobre els resultats reals al voltant d'un 96%. Finalment, una vegada millorada la precisió del simulador, en aquesta tesi es presenta una proposta innovadora, denominada FRC (sigles en anglés de Fetch and Replacement Cache), que millora en gran manera la potència computacional de la GPU, gràcies a que augmenta el paral·lelisme en l'accés a memòria principal. La proposta incrementa el nombre d'accessos en paral·lel a memòria principal mitjançant l'acceleració de la gestió de les accions de recerca i reemplaçament relacionades amb els accessos que fallen en la cache. La proposta FRC es basa en una xicoteta estructura cache auxiliar que descongestiona el subsistema de memòria eficientment, augmentant les prestacions de la GPU fins a un 118% de mitjana respecte al sistema base. A més, també redueix, al voltant d'un 57%, el consum energètic de la jerarquia de memòria.
[EN] In recent years, the growing need for computing capacity has become a challenge that has led the industry to look for alternative architectures to conventional out-of-order superscalar processors, with the goal of enabling an increase of computing power while achieving higher energy efficiency. GPU architectures, which just a decade ago were applied to accelerate computer graphics exclusively, have been one of the most employed alternatives for several years to reach the mentioned goal. A particular characteristic of GPUs is their high main memory bandwidth, which allows executing a large number of threads in a very efficient way. This feature, as well as their high computational power regarding floating-point operations, have caused the emergence of the GPGPU computing paradigm, where GPU architectures perform general purpose computations. The aforementioned characteristics make GPU devices very appropriate for the execution of massively parallel applications that have been traditionally executed in conventional high-performance processors. The work performed in this thesis aims to help improve the performance of GPUs in the execution of GPGPU applications. To this end, as a first step, a characterization study is carried out. In this study, the most important features of GPGPU applications, with respect to the memory hierarchy and its impact on performance, are identified. For this purpose, a detailed cycle-accurate simulator is used to model the architecture of a recent GPU. The study reveals that it is necessary to model with more detail some critical components of the GPU memory hierarchy in order to obtain accurate results. In addition, it shows that the achieved benefits can vary up to a factor of 3× depending on how these critical components are modeled. Due to this reason, as a second step before realizing a novel proposal, the work in this thesis focuses on determining which components of the GPU memory hierarchy must be modeled with more detail to increase the accuracy of simulator results and improving the existing simulator models of these components. Moreover, a validation study is performed comparing the results obtained with the improved GPU models against those from a real commercial GPU. The implemented simulator improvements reduce the deviation of the results obtained with the simulator from results obtained with the real GPU by about 96%. Finally, once simulation accuracy is increased, this thesis proposes a novel approach, called FRC (Fetch and Replacement Cache), which highly improves the GPU computational power by enhancing main memory-level parallelism. The proposal increases the number of parallel accesses to main memory by accelerating the management of fetch and replacement actions corresponding to those cache accesses that miss in the cache. The FRC approach is based on a small auxiliary cache structure that efficiently unclogs the memory subsystem, enhancing the GPU performance up to 118% on average compared to the studied baseline. In addition, the FRC approach reduces the energy consumption of the memory hierarchy by a 57%.
Candel Margaix, F. (2019). Efficient L2 Cache Management to Boost GPGPU Performance [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/125477
TESIS

APA, Harvard, Vancouver, ISO, and other styles

35

Rabbah, Rodric Michel. "Design Space Exploration and Optimization of Embedded Memory Systems." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/11605.

Full text

Abstract:

Recent years have witnessed the emergence of microprocessors that are embedded within a plethora of devices used in everyday life. Embedded architectures are customized through a meticulous and time consuming design process to satisfy stringent constraints with respect to performance, area, power, and cost. In embedded systems, the cost of the memory hierarchy limits its ability to play as central a role. This is due to stringent constraints that fundamentally limit the physical size and complexity of the memory system. Ultimately, application developers and system engineers are charged with the heavy burden of reducing the memory requirements of an application. This thesis offers the intriguing possibility that compilers can play a significant role in the automatic design space exploration and optimization of embedded memory systems. This insight is founded upon a new analytical model and novel compiler optimizations that are specifically designed to increase the synergy between the processor and the memory system. The analytical models serve to characterize intrinsic program properties, quantify the impact of compiler optimizations on the memory systems, and provide deep insight into the trade-offs that affect memory system design.

APA, Harvard, Vancouver, ISO, and other styles

36

Jacquelin, Mathias. "Memory-aware algorithms : from multicores to large scale platforms." Phd thesis, Ecole normale supérieure de lyon - ENS LYON, 2011. http://tel.archives-ouvertes.fr/tel-00662525.

Full text

Abstract:

This thesis focus on memory-aware algorithms tailored for hierarchical memory architectures, found for instance within multicore processors. We first study the matrix product on multicore architectures. We model such a processor, and derive lower bounds on the communication volume. We introduce three ad hoc algorithms, and experimentally assess their performance.We then target a more complex operation: the QR factorization of tall matrices. We revisit existing algorithms to better exploit the parallelism of multicore processors. We thus study the critical paths of many algorithms, prove some of them to be asymptotically optimal, and assess their performance.In the next study, we focus on scheduling streaming applications onto a heterogeneous multicore platform, the QS 22. We introduce a model of the platform and use steady-state scheduling techniques so as to maximize the throughput. We present a mixed integer programming approach that computes an optimal solution, and propose simpler heuristics. We then focus on minimizing the amount of required memory for tree-shaped workflows, and target a classical two-level memory system. I/O represent transfers from a memory to the other. We propose a new exact algorithm, and show that there exist trees where postorder traversals are arbitrarily bad. We then study the problem of minimizing the I/O volume for a given memory, show that it is NP-hard, and provide a set of heuristics.Finally, we compare archival policies for BLUE WATERS. We introduce two archival policies and adapt the well known RAIT strategy. We provide a model of the tape storage platform, and use it to assess the performance of the three policies through simulation.

APA, Harvard, Vancouver, ISO, and other styles

37

Pottier, Loïc. "Co-scheduling for large-scale applications : memory and resilience." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEN039/document.

Full text

Abstract:

Cette thèse explore les problèmes liés à l'ordonnancement concurrent dans le contexte des applications massivement parallèle, de deux points de vue: le coté mémoire (en particulier la mémoire cache) et le coté tolérance aux fautes.Avec l'avènement récent des architectures dites many-core, tels que les récents processeurs multi-coeurs, le nombre d'unités de traitement augmente de manière importante.Dans ce contexte, les avantages fournis par les techniques d'ordonnancements concurrents ont été démontrés à travers de nombreuses études.L'ordonnancement concurrent, aussi appelé co-ordonnancement, consiste à exécuter les applications de manière concurrente plutôt que les unes après les autres, dans le but d'améliorer le débit global de la plateforme.Mais le partage des ressources peut souvent générer des interférences.Une des solutions pour réduire de manière importante ces interférences est le partitionnement de cache.À travers un modèle théorique, des simulations et des expériences sur une plateforme existante, nous montrons l'utilité et l'importance du co-ordonnancement quand nos stratégies de partitionnement de cache sont utilisées.De plus, avec ce nombre croissant de processeurs, la probabilité d'une panne augmente également.L'efficacité des techniques de co-ordonnancement a été démontrée dans un contexte sans pannes, mais les plateformes massivement parallèles sont confrontées à des pannes fréquentes, et des techniques de tolérance aux fautes doivent être mise en place pour améliorer l'efficacité de ces plateformes.Nous étudions la complexité du problème avec un modèle théorique, nous concevons des heuristiques et nous effectuons un ensemble complet de simulations avec un simulateur de pannes, qui démontre l'efficacité des heuristiques proposées
This thesis explores co-scheduling problems in the context of large-scale applications with two main focus: the memory side, in particular the cache memory and the resilience side.With the recent advent of many-core architectures such as chip multiprocessors (CMP), the number of processing units is increasing.In this context, the benefits of co-scheduling techniques have been demonstrated. Recall that, the main idea behind co-scheduling is to execute applications concurrently rather than in sequence in order to improve the global throughput of the platform.But sharing resources often generates interferences.With the arising number of processing units accessing to the same last-level cache, those interferences among co-scheduled applications becomes critical.In addition, with that increasing number of processors the probability of a failure increases too.Resiliency aspects must be taking into account, specially for co-scheduling because failure-prone resources might be shared between applications.On the memory side, we focus on the interferences in the last-level cache, one solution used to reduce these interferences is the cache partitioning.Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.We also investigate the same problem on a real cache partitioned chip multiprocessors, using the Cache Allocation Technology recently provided by Intel.In a second time, still on the memory side, we study how to model and schedule task graphs on the new many-core architectures, such as Knights Landing architecture.These architectures offer a new level in the memory hierarchy through a new on-packagehigh-bandwidth memory. Current approaches usually do not take intoaccount this new memory level, however new scheduling algorithms anddata partitioning schemes are needed to take advantage of this deepmemory hierarchy.On the resilience, we explore the impact on failures on co-scheduling performance.The co-scheduling approach has been demonstrated in a fault-free context, but large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications, and significantly degrade performance.We aim at minimizing the expected completion time of a set of co-scheduled applications in a failure-prone context by redistributing processors

APA, Harvard, Vancouver, ISO, and other styles

38

Yang, Weishuai. "Scalable and effective clustering, scheduling and monitoring of self-organizing grids." Diss., Online access via UMI:, 2008.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

39

Laga, Arezki. "Optimisation des performance des logiciels de traitement de données sur les périphériques de stockage SSD." Thesis, Brest, 2018. http://www.theses.fr/2018BRES0087/document.

Full text

Abstract:

Nous assistons aujourd’hui à une croissance vertigineuse des volumes de données. Cela exerce une pression sur les infrastructures de stockage et les logiciels de traitement de données comme les Systèmes de Gestion de Base de Données (SGBD). De nouvelles technologies ont vu le jour et permettent de réduire la pression exercée par les grandes masses de données. Nous nous intéressons particulièrement aux nouvelles technologies de mémoires secondaires comme les supports de stockage SSD (Solid State Drive) à base de mémoire Flash. Les supports de stockage SSD offrent des performances jusqu’à 10 fois plus élevées que les supports de stockage magnétiques. Cependant, ces nouveaux supports de stockage offrent un nouveau modèle de performance. Cela implique l’optimisation des coûts d’E/S pour les algorithmes de traitement et de gestion des données. Dans cette thèse, nous proposons un modèle des coûts d’E/S sur SSD pour les algorithmes de traitement de données. Ce modèle considère principalement le volume des données, l’espace mémoire alloué et la distribution des données. Nous proposons également un nouvel algorithme de tri en mémoire secondaire : MONTRES. Ce dernier est optimisé pour réduire le coût des E/S lorsque le volume de données à trier fait plusieurs fois la taille de la mémoire principale. Nous proposons enfin un mécanisme de pré-chargement de données : Lynx. Ce dernier utilise un mécanisme d’apprentissage pour prédire et anticiper les prochaines lectures en mémoire secondaire
The growing volume of data poses a real challenge to data processing software like DBMS (DataBase Management Systems) and data storage infrastructure. New technologies have emerged in order to face the data volume challenges. We considered in this thesis the emerging new external memories like flash memory-based storage devices named SSD (Solid State Drive).SSD storage devices offer a performance gain compared to the traditional magnetic devices.However, SSD devices offer a new performance model that involves 10 cost optimization for data processing and management algorithms.We proposed in this thesis an 10 cost model to evaluate the data processing algorithms. This model considers mainly the SSD 10 performance and the data distribution.We also proposed a new external sorting algorithm: MONTRES. This algorithm includes optimizations to reduce the 10 cost when the volume of data is greater than the allocated memory space by an order of magnitude. We proposed finally a data prefetching mechanism: Lynx. This one makes use of a machine learning technique to predict and to anticipate future access to the external memory

APA, Harvard, Vancouver, ISO, and other styles

40

Watson, Myles G. "Does the Halting Necessary for Hardware Trace Collection Inordinately Perturb the Results?" Diss., CLICK HERE for online access, 2004. http://contentdm.lib.byu.edu/ETD/image/etd594.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Péneau, Pierre-Yves. "Intégration de technologies de mémoires non volatiles émergentes dans la hiérarchie de caches pour améliorer l'efficacité énergétique." Thesis, Montpellier, 2018. http://www.theses.fr/2018MONTS108/document.

Full text

Abstract:

De nos jours, des efforts majeurs pour la conception de systèmes sur puces performants et efficaces énergétiquement sont en cours. Le déclin de la loi de Moore au début du XX e siècle a poussé les concepteurs à augmenter le nombre de cœurs par processeur pour continuer d’améliorer les performances. En conséquence, la surface de silicium occupée par les mémoires caches a augmentée. La finesse de gravure toujours plus petite a également fait augmenter le courant de fuite des transistors CMOS. Ainsi, la consommation énergétique des mémoires occupe une part de plus en plus importante dans la consommation globale des puces. Pour diminuer cette consommation, de nouvelles technologies de mémoires émergent depuis une dizaine d’années : les mémoires non volatiles (NVM). Ces mémoires ont la particularité d’avoir un courant de fuite très faible comparé aux technologies CMOS classiques. De fait, leur utilisation dans une architecture permettrait de diminuer la consommation globale de la hiérarchie de caches. Cependant, ces technologies souffrent de latences d’accès plus élevées que la SRAM, de coûts énergétiques d’accès plus importants et d’une durée de vie limitée. Leur intégration à des systèmes sur puces nécessite de continuer à rechercher des solutions. Cette thèse cherche à évaluer l’impact d’un changement de technologie dans la hiérarchie de caches.Plus spécifiquement, elle s’intéresse au cache de dernier niveau (LLC) et la technologie non volatile considérée est la STT-MRAM. Nos travaux adoptent un point de vue architectural dans lequel une modification de la technologie n’est pas retenue. Nous cherchons alors à intégrer les caractéristiques différentes de la STT-MRAM lors de la conception de la hiérarchie mémoire. Une première étude a permis de mettre en place un cadre d’exploration architectural pour des systèmes contenant des mémoires émergentes. Une seconde étude sur les optimisations architecturales au niveau du LLC a été menée pour identifier quelles sont les opportunités d’intégration de la STT-MRAM. Le but est d’améliorer l’efficacité énergétique tout en atténuant les pénalités d’accès dues aux fortes latences de cette technologie
Today, intensive efforts to design energy-efficient and high-performance systems-on-chip (SoCs) are underway. Moore’s end in the early 20 th century pushed designers to increase the number of core per processor to continue to improve the performance. As a result, the silicon area occupied by cache memories has increased. The ever smaller technology node also increased the leakage current of CMOS transistors. Thus, the energy consumption of memories represents an increasingly important part in the overall consumption of chips.To reduce this energy consumption, new memory technologies have emerged overthe past decade : non-volatile memories (NVM). These memories have the particularity of having a very low leakage current compared to conventional CMOS technologies. In fact, their use in an architecture would reduce the overall consumption of the cache hierarchy. However, these technologies sufferfrom higher access latencies than SRAM, higher access energy costs and limitedlifetime. Their integration into SoCs requires a continuous research effort.This thesis work aims to evaluate the impact of a change in technology in the cache hierarchy. More specifically, we are interested in the Last-Level Cache(LLC) and we consider the STT-MRAM technology. Our work adopts an architectural point of view in which a modification of the technology is not retained. Then,we try to integrate the different characteristics of the STT-MRAM atarchitectural level when designing the memory hierarchy. A first study set upan architectural exploration framework for systems containing emerging memories. A second study on architectural optimizations at LLC was conducted toidentify opportunities for the integration of STT-MRAM. The goal is to improve energy efficiency while reducing access penalties due to the high latency ofthis technology

APA, Harvard, Vancouver, ISO, and other styles

42

Puche, Lara José. "Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/165254.

Full text

Abstract:

[ES] Los procesadores multinúcleo actuales cuentan con recursos compartidos entre los diferentes núcleos. Dos de estos recursos compartidos, la cache de último nivel y el ancho de banda de memoria principal, pueden convertirse en cuellos de botella para el rendimiento. Además, con el crecimiento del número de núcleos que implementan los diseños más recientes, la red dentro del chip también se convierte en un cuello de botella que puede afectar negativamente al rendimiento, ya que las redes tradicionales pueden encontrar limitaciones a su escalabilidad en el futuro cercano. Prácticamente la totalidad de los diseños actuales implementan jerarquías de memoria que se comunican mediante rápidas redes de interconexión. Esta organización es eficaz dado que permite reducir el número de accesos que se realizan a memoria principal y la latencia media de acceso a memoria. Las caches, la red de interconexión y la memoria principal, conjuntamente con otras técnicas conocidas como la prebúsqueda, permiten reducir las enormes latencias de acceso a memoria principal, limitando así el impacto negativo ocasionado por la diferencia de rendimiento existente entre los núcleos de cómputo y la memoria. Sin embargo, compartir los recursos mencionados es fuente de diferentes problemas y retos, siendo uno de los principales el manejo de la interferencia entre aplicaciones. Hacer un uso eficiente de la jerarquía de memoria y las caches, así como contar con una red de interconexión apropiada, es necesario para sostener el crecimiento del rendimiento en los diseños tanto actuales como futuros. Esta tesis analiza y estudia los principales problemas e inconvenientes observados en estos dos recursos: la cache de último nivel y la red dentro del chip. En primer lugar, se estudia la escalabilidad de las tradicionales redes dentro del chip con topología de malla, así como esta puede verse comprometida en próximos diseños que cuenten con mayor número de núcleos. Los resultados de este estudio muestran que, a mayor número de núcleos, el impacto negativo de la distancia entre núcleos en la latencia puede afectar seriamente al rendimiento del procesador. Como solución a este problema, en esta tesis proponemos una de red de interconexión óptica modelada en un entorno de simulación detallado, que supone una solución viable a los problemas de escalabilidad observados en los diseños tradicionales. A continuación, esta tesis dedica un esfuerzo importante a identificar y proponer soluciones a los principales problemas de diseño de las jerarquías de memoria actuales como son, por ejemplo, el sobredimensionado del espacio de cache privado, la existencia de réplicas de datos y rigidez e incapacidad de adaptación de las estructuras de cache. Aunque bien conocidos, estos problemas y sus efectos adversos en el rendimiento pueden ser evitados en procesadores de alto rendimiento gracias a la enorme capacidad de la cache de último nivel que este tipo de procesadores típicamente implementan. Sin embargo, en procesadores de bajo consumo, no existe la posibilidad de contar con tales capacidades y hacer un uso eficiente del espacio disponible es crítico para mantener el rendimiento. Como solución a estos problemas en procesadores de bajo consumo, proponemos una novedosa organización de jerarquía de dos niveles cache que utiliza una red de interconexión óptica. Los resultados obtenidos muestran que, comparado con diseños convencionales, el consumo de energía estática en la arquitectura propuesta es un 60% menor, pese a que los resultados de rendimiento presentan valores similares. Por último, hemos extendido la arquitectura propuesta para dar soporte tanto a aplicaciones paralelas como secuenciales. Los resultados obtenidos con la esta nueva arquitectura muestran un ahorro de hasta el 78 % de energía estática en la ejecución de aplicaciones paralelas.
[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.
[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.
Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254
TESIS

APA, Harvard, Vancouver, ISO, and other styles

43

Milani, Emanuele. "Efficient Execution of Sequential Instructions Streams by Physical Machines." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3424604.

Full text

Abstract:

Any computational model which relies on a physical system is likely to be subject to the fact that information density and speed have intrinsic, ultimate limits. The RAM model, and in particular the underlying assumption that memory accesses can be carried out in time independent from memory size itself, is not physically implementable. This work has developed in the field of limiting technology machines, in which it is somewhat provocatively assumed that technology has achieved the physical limits. The ultimate goal for this is to tackle the problem of the intrinsic latencies of physical systems by encouraging scalable organizations for processors and memories. An algorithmic study is presented, which depicts the implementation of high concurrency programs for SP and SPE, sequential machine models able to compute direct-flow programs in optimal time. Then, a novel pieplined, hierarchical memory organization is presented, with optimal latency and bandwidth for a physical system. In order to both take full advantage of the memory capabilities and exploit the available instruction level parallelism of the code to be executed, a novel processor model is developed. Particular care is put in devising an efficient information flow within the processor itself. Both designs are extremely scalable, as they are based on fixed capacity and fixed size nodes, which are connected as a multidimensional array. Performance analysis on the resulting machine design has led to the discovery that latencies internal to the processor can be the dominating source of complexity in instruction flow execution, which adds to the effects of processor-memory interaction. A characterization of instruction flows is then developed, which is based on the topology induced by instruction dependences.
Qualsiasi modello computazionale basato su un sistema fisico e' verosimilmente soggetto al fatto che densita' e velocita' di propagazione dell’informazione sono intrinsecamente limitati. Per questo motivo, il modello RAM, in particolare per il presupposto che il costo di un accesso in memoria sia indipendente dalla taglia della stessa, non `e implementabile su sistemi fisici. Questo lavoro si inserisce nel contesto delle limiting technology machine, modelli computazionali in cui si ipotizza provocatoriamente di aver raggiunto con la tecnologia di fabbricazione i limiti fisici di densita' e velocita' dell’informazione. Questo, allo scopo di affrontare il problema delle latenze intrinseche a ogni sistema fisico evidenziando organizzazioni scalabili per processori e memorie. Viene quindi presentato uno studio algoritmico, che illustra l’implementazione di programmi a elevata concorrenza per SP ed SPE, modelli di macchine sequenziali in grado di eseguire programmi direct-flow in tempo ottimale. Successivamente, viene introdotta una innovativa organizzazione di memoria gerarchica e pipelined, con latenza e banda ottimali per un sistema fisico. Allo scopo di sfruttarne appieno le caratteristiche, e trarre vantaggio dall’eventuale instruction level parallelism del codice da eseguire, viene sviluppato un innovativo modello di processore. Particolare attenzione e' rivolta all’implementazione di un efficiente flusso di informazione all’interno del processore stesso. Entrambe le organizzazioni sono estremamente scalabili, in quanto basate su un insieme di nodi a taglia e capacita fisse, connessi con una topologia ad array multidimensionale. Lo studio delle prestazioni computazionali della macchina risultante ha evidenziato come le latenze interne al processore possono diventare la principale componente della complessita temporale per l’esecuzione di un flusso di istruzioni, che va ad aggiungersi all’effetto dell’interazione tra processore e memoria. Viene pertanto sviluppata una caratterizzazione dei flussi di istruzioni, basata sulla topologia indotta dalle dipendenze tra istruzioni.

APA, Harvard, Vancouver, ISO, and other styles

44

Muppalaneni, Nitin. "Adaptive Hierarchial RAID." Thesis, Indian Institute of Science, 1998. http://hdl.handle.net/2005/50.

Full text

Abstract:

Redundant Arrays of Inexpensive Disks or RAID is a popular method of improving the reliability and performance of disk storage. Of various levels of RAID, mirrored or RAID1 and rotating parity or RAID5 configurations have become moat popular. Mirrored or RAID1 provides best overall performance and is easier to configure, but has 100 percent storage overhead for the redundancy. Rotating parity or RAID5, on the other hand, is quite inexpensive for the redundancy it provides, shorn impressive performance for reads and full-stripe writes in normal mode, but the small write performance is poor due to the read-modify-write cycle involved. The performance drops drastically when one of the disks fails and the system enters degraded mode. Also RAID5 is relatively difficult to configure. Typical non-scientific system disk access patterns exhibit very high locality of reference. This thesis presents the design and implementation of an Adaptive Hierarchical RAID array to exploit this high locality. Frequently accessed data is migrated towards the top of the hierarchy and not so frequently acee88ed data is moved down the hierarchy, thus adaptively rearranging itself to the access patterns. Previous work on Adaptive Hierarchical RAID such as HP AutoRAID has explored one part of the design space, namely design of configurable storage at the SGSI level with no interaction with higher level layers like volume manager. This thesis explores a different design point: namely, one that is centered at the volume manager layer. This is important also for the reason that with fibre channel disks and SCSI-3, Storage Area Networks (SAN) no longer need a conventional controller but a modified version of a controller that is more close to a volume manager.

APA, Harvard, Vancouver, ISO, and other styles

45

Bottero, Margherita. "Slave trades, credit records and strategic reasoning : four essays in microeconomics." Doctoral thesis, Handelshögskolan i Stockholm, Institutionen för Nationalekonomi, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:hhs:diva-1281.

Full text

Abstract:

This thesis consists of four independent chapters, in which well-known economic theories are employed to investigate, and better understand, data and facts from the real world. Although in fairly distant topics, each paper is an example of how economics, and more precisely microeconomics, offers a rigorous and effective framework to reason about what happens around us. In this sense, my dissertation fully represents what I have learnt in these five years. The first paper addresses the experimental behavior of subjects that interact with each other, non-cooperatively, in a laboratory setup. The experimental evidence is found to be at odds with the predictions of classical game-theory, and I explore whether a model of bounded rationality can instead succeed in explaining the data. The second paper looks at another type of data, historical rather than experimental. Together with Björn Wallace, we raise doubts, methodological and interpretational, regarding the validity of a recent finding that documents a sizeable effect of Africa's past slave trades on current economic performance. The last two papers investigate the phenomenon of limited records, understood as the limited availability of past public data regarding a transacting partner. The former is a survey, written jointly with Giancarlo Spagnolo, wherein we discuss the literatures that have independently studied whether limited records may actually prompt beneficial reputation effects. We argue that what is known about this type of informational arrangement is little and scattered, and that this is problematic given the large number of real-life situations featuring limited records. These conclusions prepare the ground for the last paper of this dissertation, which presents a model of limited credit records. The model aims at providing a framework for evaluating the current privacy provisions in the credit market which mandate the removal of information about borrowers' past performance from public registers after a finite number of years.
Diss. Stockholm : Handelshögskolan i Stockholm, 2011

APA, Harvard, Vancouver, ISO, and other styles

46

Damasceno, Alexandro Lima. "O impacto da hierarquia de memória sobre a arquitetura IPNoSys." Universidade Federal Rural do Semi-Árido, 2016. http://bdtd.ufersa.edu.br:80/tede/handle/tede/654.

Full text

Abstract:

Submitted by Lara Oliveira (lara@ufersa.edu.br) on 2017-04-10T21:22:16Z No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)
Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T14:42:00Z (GMT) No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)
Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T15:00:20Z (GMT) No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)
Made available in DSpace on 2017-04-13T15:07:49Z (GMT). No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5) Previous issue date: 2016-07-27
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Over the years, with the as technology advances, the search for improvements in the performance of computer systems is notable. The computer systems have evolved in both processing capacity and complexity of the implemented architectures. In such systems it is crucial to use memories since they are responsible for storing data to be processed. Considering an ideal environment, the memories should have a unlimited storage capacity, instant data access and the extremely low cost per bit. But in real systems the memories do not exhibit these characteristics. Storage capacity, speed and cost per bit are factors that increase in proportion to each other. One technique that is used to balance these factors and improve the performance of computer systems is the memory hierarchy. In the scenario of new technologies and proposals for new organizations of processors, a model that has been adopted by designers of computer systems is the use of MPSoCs (multiprocessor systems on chip), which has a higher energy and computational e ciency. In this scenario with many processing elements, networks using on-chip (NoC - networks-on-chip) is more e cient use of the buses. An NoC consists of a set of routers and interconnected channels forming a switched network. The cores are connected to network terminals and communication occurs through the exchange of packets. These NoCs have traditionally been exclusively designed for communication SoCs. However, a project of an unconventional architecture decided to integrate processing and communication in an NoC. This architecture is known for IPNoSys. The IPNoSys (Integrated Processing NoC System) architecture is an unconventional processor that uses networks on chip and implements processing units and routing to handle and process instructions. It takes advantage of the characteristics of NoC, such as scalability and parallel communication, for implement e ectively runs programs that exploit parallelism-level threads. Currently, IPNoSys architecture has four memory physically distributed at the corners of the network, but represent a unified addressing. Each memory module is associated with an access unit in charge of managing it. Given the current organization of IPNoSys memories, this work proposes to develop a new memory hierarchy system for IPNoSys and investigate the possible impact on performance and the programming model
Aolongo dos anos,coma ascensão das tecnologias, a busca por melhorias no desempenho dos sistemas computacionais é algo notável. Os sistemas computacionais evoluíram tanto em capacidade de processamento como em complexidade das arquiteturas implementadas. Nesses sistemas é crucial a utilização de memórias uma vez que elas são responsáveis pelo armazenamento de dados que serão processados. Considerando um ambiente ideal, as memórias deveriam ter uma capacidade de armazenamento ilimitado, o acesso de dados imediato e o custo por bit extremamente baixo. Porém nos sistemas reais as memórias não apresentam essas características. Capacidade de armazenamento, velocidade e custo por bit são fatores que crescem proporcionalmente entre si. Uma técnica que é utilizada para balancear esses fatores e melhorar o desempenho dos sistemas computacionais é a hierarquia de memória. No cenário de novas tecnologias e propostas de novas organizações de processadores, um modelo que vem sendo adotada pelos projetistas de sistemas computacionais é o uso de MPSoCs (sistemas multiprocessados integrados em chip), que apresenta uma maior eficiência energética e computacional. Nesse cenário com muitos elementos de processamento, a utilização de redes em chip (NoC - networks-on-chip) se mostra mais eficiente que o uso de barramentos. Uma NoC consiste em um conjunto de roteadores e canais interligados formando uma rede chaveada. Os núcleos são conectados aos terminais da rede e a comunicação ocorre pela troca de pacotes. Essas NoCs foram tradicionalmente projetadas exclusivamente para a comunicação em SoCs. Entretanto, um projeto de uma arquitetura não convencional resolveu integrar processamento e comunicação em uma NoC. Essa arquitetura é conhecida por IPNoSys. A arquitetura IPNoSys (Integrated Processing NoC System) é um processador não convencional que utiliza redes em chip e implementa unidades de processamento e roteamento para tratar e processar instruções. Aproveita as características das NoCs, como escalabilidade e comunicação paralela, para implementar de maneira eficiente execuções de programas que exploram paralelismo em nível de threads. Atualmente, a arquitetura IPNoSys possui quatro memórias fisicamente distribuidas nos cantos da rede, mas que representam um endereçamento unificado. Cada módulo de memória é associado a uma unidade de acesso que se encarregam de gerenciá-la. Diante da atual organização de memórias da IPNoSys, esse trabalho desenvolveu um novo sistema de hierarquia de memórias para o IPNoSys e investigou os possíveis impactos sobre o desempenho e o modelo de programação
2017-04-10

APA, Harvard, Vancouver, ISO, and other styles

47

Jiménez, González Daniel. "Algoritmos de ordenación conscientes de la arquitectura y las características de los datos." Doctoral thesis, Universitat Politècnica de Catalunya, 2004. http://hdl.handle.net/10803/5982.

Full text

Abstract:

En esta tesis analizamos y presentamos algoritmos de ordenación secuencial y paralelo que explotan la jerarquía de memoria del computador usado y/o reducen la comunicación de los datos. Sin embargo, aunque los objetivos de esta tesis son los mismo que los de otros trabajos, la forma de conseguirlos es diferente.
En esta tesis los conseguimos haciendo que los algoritmos de ordenación propuestos sean conscientes de la arquitectura del computador y las características de los datos que queremos ordenar.
Los conjuntos de datos que consideramos son conjuntos que caben en memoria principal, pero no en memoria cache.

Los algoritmos presentados se adaptan, en tiempo de ejecución, a las características de los datos (duplicados, con sesgo, etc.) para evitar pérdidas de rendimiento dependiendo de estas características. Para ello, estos algoritmos realizan un particionado de los datos, utilizando una técnica que llamamos Mutant Reverse Sorting y que presentamos en esta tesis.

Mutant Reverse Sorting se adapta dinámicamente a las características de los datos y del computador. Esta técnica analiza una muestra del conjunto de datos a ordenar para seleccionar la forma más rápida de particionar los datos. Esta técnica elige entre Reverse Sorting y Counting Split en función de la distribución de los datos. Estas técnicas también son propuestas en esta tesis.
El análisis de estas técnicas, junto con los algoritmos de ordenación presentados, se realiza en un computador IBM basado en módulos p630 con procesadores Power4 y en un computador SGI O2000 con procesadores R10K. En el análisis realizado para ambos computadores se presentan modelos de comportamiento que se comparan con ejecuciones reales.

Con todo ello, conseguimos los algoritmos de ordenación secuencial y paralelo más rápidos para las características de los datos y los computadores utilizados. Esto es gracias a que estos algoritmos se adaptan a los computadores y las características de los datos mejor que el resto de algoritmos analizados.

Así, por un lado, el algoritmo secuencial propuesto, SKC-Radix sort, consigue unas mejoras de rendimiento de más de 25% comparado con los mejores algoritmos que encontramos en la literatura. Es más, cuanto mayor es el número de duplicados o el sesgo de los datos, mayor es la mejora alcanzada por el SKC-Radix sort.
Por otro lado, el algoritmo paralelo presentado, PSKC-Radix sort, es capaz de ordenar hasta 4 veces más datos que Load Balanced Radix sort en la misma cantidad de tiempo. Load Balanced Radix sort era el algoritmo más rápido de ordenación en memoria y en paralelo para los conjunto de datos que ordenamos hasta la publicación de nuestros trabajos.
In this thesis we analyze and propose parallel and sequential sorting algorithms that exploit the memory hierarchy of the computer used and/or reduce the data communication. So, the objectives of this thesis are not different from the objectives of other works. However, the way to achieve those objectives is different.
We achieve those objectives by being conscious of the computer architecture and the data characteristics of the data set we want to sort.
We have focused on the data sets that fit in main memory but not in cache.

So, the sorting algorithms that we present take into account the data characteristics (duplicates, skewed data, etc.) in order to avoid any lose of performance in execution time depending on those characteristics.
That is done by partitioning the data set using Mutant Reverse Sorting, which is a partition technique that we propose in this thesis.

Mutant Reverse Sorting dynamically adapts to the data characteristics and the computer architecture. This technique analyzes a set of samples of the data set to choose the best way to partition this data set. This chooses between Reverse Sorting and Counting Split, which are selected depending on the data distribution. Those techniques are also proposed in this thesis.
The analysis of the partitioning techniques and the sorting algorithms proposed is done in an IBM computer based on p630 nodes with Power4 processors and in a SGI O2000 with R10K processors. Also, we present models of the behavior of the algorithms for those machines.

The sequential and parallel algorithms proposed are the fastest for the computer architectures and data set characteristics tested. That is so because those algorithms adapt to the computer architecture and data characteristics better than others.

On one hand, the sequential sorting algorithm presented, SKC-Radix sort, achieves performance improvements of more than 25% compared to other sequential algorithms found in the literature. Indeed, the larger the number of duplicates or data skew, the larger the improvement achieved by SKC-Radix sort.
On the other hand, the parallel sorting algorithm proposed, PSKC-Radix sort, sorts 4 times the number of keys that Load Balanced Radix can sort in the same amount of time. Load Balanced Radix sort was the fastest in-memory parallel sorting algorithm for the kind of data set we focus on.

APA, Harvard, Vancouver, ISO, and other styles

48

Alba, de la Torre Celia. "Wh-questions in Catalan sign language." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/397751.

Full text

Abstract:

This dissertation offers a characterization and an analysis of wh-questions in Catalan Sign Language, which show the particularity of placing wh-expressions canonically in sentence final position. This feature, specific to Sign Languages, has been difficult to deal with from traditional models, which have often considered that wh-movement is universally to the left and which have also often assumed that syntactic structure encodes information about the linear order of linguistic elements. The dissertation also argues that syntactic hierarchy and linear order are two different objects with a limited impact over one another, and that the latter is mainly dependent on the mechanisms of linguistic processing and, specifically, on Working Memory. In that sense, the hypothesis that the difference in the placing of wh-elements between Sign Languages and Spoken Languages is due to differences in Working Memory is put forwards. To explore it, the results of two experiments with Deaf and hearing participants are discussed.
S'ofereix una caracterització i una anàlisi de les preguntes-que en Llengua de Signes Catalana, que presenten la particularitat d'ubicar preferentment les expressions-qu al final de l'oració. Aquesta característica, pròpia de les llengües de signes, ha estat difícil de tractar des de models tradicionals, que sovint han considerat que el moviment-qu és universalment cap a l'esquerra i que sovint han assumit que l'estructura sintàctica codifica informació respecte de l'ordre lineal dels elements lingüístics. Es proposa que la jerarquia sintàctica i l'ordre lineal són dos objectes diferents i amb un impacte limitat l'un sobre l'altre i que el segon depèn principalment de mecanismes de processament lingüístic i, específicament, de la Memòria de Treball. En aquest sentit, s'hipotetitza que la diferència en la ubicació dels elements-qu entre llengües de signes i llengües orals respon a diferències en la Memòria de Treball. Per a explorar aquesta hipòtesi, s'exposen els resultats de dos experiments amb participants Sords i oients.

APA, Harvard, Vancouver, ISO, and other styles

49

Kaeslin, Alain E. "Performance Optimisation of Discrete-Event Simulation Software on Multi-Core Computers." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-191132.

Full text

Abstract:

SIMLOX is a discrete-event simulation software developed by Systecon AB for analysing logistic support solution scenarios. To cope with ever larger problems, SIMLOX's simulation engine was recently enhanced with a parallel execution mechanism in order to take advantage of multi-core processors. However, this extension did not result in the desired reduction in runtime for all simulation scenarios even though the parallelisation strategy applied had promised linear speedup. Therefore, an in-depth analysis of the limiting scalability bottlenecks became necessary and has been carried out in this project. Through the use of a low-overhead profiler and microarchitecture analysis, the root causes were identified: atomic operations causing a high communication overhead, poor locality leading to translation lookaside buffer thrashing, and hot spots that consume significant amounts of CPU time. Subsequently, appropriate optimisations to overcome the limiting factors were implemented: eliminating the expensive operations, more efficient handling of heap memory through the use of a scalable memory allocator, and data structures that make better use of caches. Experimental evaluation using real world test cases demonstrated a speedup of at least 6.75x on an eight-core processor. Most cases even achieve a speedup of more than 7.2x. The various optimisations implemented further helped to lower run times for sequential execution by 1.5x or more. It can be concluded that achieving nearly linear speedup on a multi-core processor is possible in practice for discrete-event simulation.
SIMLOX är en kommersiell mjukvara utvecklad av Systecon AB, vars huvudsakliga funktion är en händelsestyrd simuleringskärna för analys av underhållslösningar för komplexa tekniska system. För hantering av stora problem så används parallellexekvering för simuleringen, vilket i teorin borde ge en nästan linjär skalning med antal trådar. Prestandaförbättringen som observerats i praktiken var dock ytterst begränsad, varför en ordentlig analys av skalbarheten har gjorts i detta projekt. Genom användandet av ett profileringsverktyg med liten overhead och mikroarkitektur-analys, så kunde orsakerna hittas: atomiska operationer som skapar mycket overhead för kommunikation, dålig lokalitet ger fragmentering vid översättning till fysiska adresser och dåligt utnyttjande av TLB-cachen, och vissa flaskhalsar som kräver mycket CPU-kraft. Därefter implementerades och testade optimeringar för att undvika de identifierade problem. Testade lösningar inkluderar eliminering av dyra operationer, ökad effektivitet i minneshantering genom skalbara minneshanteringsalgoritmer och implementation av datastrukturer som ger bättre lokalitet och därmed bättre användande av cache-strukturen. Verifiering på verkliga testfall visade på uppsnabbningar på åtminstone 6.75 gånger på en processor med 8 kärnor. De flesta fall visade på en uppsnabbning med en faktor större än 7.2. Optimeringarna gav även en uppsnabbning med en faktor på åtminstone 1.5 vid sekventiell exekvering i en tråd. Slutsatsen är därmed att det är möjligt att uppnå nästan linjär skalning med antalet kärnor för denna typ av händelsestyrd simulering.

APA, Harvard, Vancouver, ISO, and other styles

50

Cargnini, Luís Vitório. "Applications des technologies mémoires MRAM appliquées aux processeurs embarqués." Thesis, Montpellier 2, 2013. http://www.theses.fr/2013MON20091/document.

Full text

Abstract:

Le secteur Semi-conducteurs avec l'avènement de fabrication submicroniques coule dessous de 45 nm ont commencé à relever de nouveaux défis pour continuer à évoluer en fonction de la loi de Moore. En ce qui concerne l'adoption généralisée de systèmes embarqués une contrainte majeure est devenu la consommation d'énergie de l'IC. En outre, les technologies de mémoire comme le standard actuel de la technologie de mémoire intégré pour la hiérarchie de la mémoire, la mémoire SRAM, ou le flash pour le stockage non-volatile ont des contraintes complexes extrêmes pour être en mesure de produire des matrices de mémoire aux nœuds technologiques 45 nm ci-dessous. Un important est jusqu'à présent mémoire non volatile n'a pas été adopté dans la hiérarchie mémoire, en raison de sa densité et comme le flash sur la nécessité d'un fonctionnement multi-tension.Ces thèses ont fait, par le travail dans l'objectif de ces contraintes et de fournir quelques réponses. Dans la thèse sera présenté méthodes et les résultats extraits de ces méthodes pour corroborer notre objectif de définir une feuille de route à adopter une nouvelle technologie de mémoire non volatile, de faible puissance, à faible fuite, SEU / MEU-résistant, évolutive et avec similaire le rendement en courant de la SRAM, physiquement équivalente à SRAM, ou encore mieux, avec une densité de surface de 4 à 8 fois la surface d'une cellule SRAM, sans qu'il soit nécessaire de domaine multi-tension comme FLASH. Cette mémoire est la MRAM (mémoire magnétique), selon l'ITRS avec un candidat pour remplacer SRAM dans un proche avenir. MRAM au lieu de stocker une charge, ils stockent l'orientation magnétique fournie par l'orientation de rotation-couple de l'alliage sans la couche dans la MTJ (Magnetic Tunnel Junction). Spin est un état quantical de la matière, que dans certains matériaux métalliques peuvent avoir une orientation ou son couple tension à appliquer un courant polarisé dans le sens de l'orientation du champ souhaitée.Une fois que l'orientation du champ magnétique est réglée, en utilisant un amplificateur de lecture, et un flux de courant à travers la MTJ, l'élément de cellule de mémoire de MRAM, il est possible de mesurer l'orientation compte tenu de la variation de résistance, plus la résistance plus faible au passage de courant, le sens permettra d'identifier un zéro logique, diminuer la résistance de la SA détecte une seule logique. Donc, l'information n'est pas une charge stockée, il s'agit plutôt d'une orientation du champ magnétique, raison pour laquelle il n'est pas affecté par SEU ou MEU due à des particules de haute énergie. En outre, il n'est pas dû à des variations de tensions de modifier le contenu de la cellule de mémoire, le piégeage charges dans une grille flottante.En ce qui concerne la MRAM, cette thèse a par adresse objective sur les aspects suivants: MRAM appliqué à la hiérarchie de la mémoire:- En décrivant l'état actuel de la technique dans la conception et l'utilisation MRAM dans la hiérarchie de mémoire;- En donnant un aperçu d'un mécanisme pour atténuer la latence d'écriture dans MRAM au niveau du cache (Principe de banque de mémoire composite);- En analysant les caractéristiques de puissance d'un système basé sur la MRAM sur Cache L1 et L2, en utilisant un débit d'évaluation dédié- En proposant une méthodologie pour déduire une consommation d'énergie du système et des performances.- Et pour la dernière base dans les banques de mémoire analysant une banque mémoire Composite, une description simple sur la façon de générer une banque de mémoire, avec quelques compromis au pouvoir, mais la latence équivalente à la SRAM, qui maintient des performances similaires
The Semiconductors Industry with the advent of submicronic manufacturing flows below 45 nm began to face new challenges to keep evolving according with the Moore's Law. Regarding the widespread adoption of embedded systems one major constraint became power consumption of IC. Also, memory technologies like the current standard of integrated memory technology for memory hierarchy, the SRAM, or the FLASH for non-volatile storage have extreme intricate constraints to be able to yield memory arrays at technological nodes below 45nm. One important is up until now Non-Volatile Memory weren't adopted into the memory hierarchy, due to its density and like flash the necessity of multi-voltage operation. These theses has by objective work into these constraints and provide some answers. Into the thesis will be presented methods and results extracted from this methods to corroborate our goal of delineate a roadmap to adopt a new memory technology, non-volatile, low-power, low-leakage, SEU/MEU-resistant, scalable and with similar performance as the current SRAM, physically equivalent to SRAM, or even better with a area density between 4 to 8 times the area of a SRAM cell, without the necessity of multi-voltage domain like FLASH. This memory is the MRAM (Magnetic Memory), according with the ITRS one candidate to replace SRAM in the near future. MRAM instead of storing charge, they store the magnetic orientation provided by the spin-torque orientation of the free-layer alloy in the MTJ (Magnetic Tunnel Junction). Spin is a quantical state of matter, that in some metallic materials can have it orientation or its torque switched applying a polarized current in the sense of the field orientation desired. Once the magnetic field orientation is set, using a sense amplifier, and a current flow through the MTJ, the memory cell element of MRAM, it is possible to measure the orientation given the resistance variation, higher the resistance lower the passing current, the sense will identify a logic zero, lower the resistance the SA will sense a one logic. So the information is not a charge stored, instead it is a magnetic field orientation, reason why it is not affected by SEU or MEU caused due to high energy particles. Also it is not due to voltages variations to change the memory cell content, trapping charges in a floating gate. Regarding the MRAM, this thesis has by objective address the following aspects: MRAM applied to memory Hierarchy: - By describing the current state of the art in MRAM design and use into memory hierarchy; - by providing an overview of a mechanism to mitigate the latency of writing into MRAM at the cache level (Principle to composite memory bank); - By analyzing power characteristics of a system based on MRAM on CACHE L1 and L2, using a dedicated evaluation flow- by proposing a methodology to infer a system power consumption, and performances.- and for last based into the memory banks analysing a Composite Memory Bank, a simple description on how to generate a memory bank, with some compromise in power, but equivalent latency to the SRAM, that keeps similar performance

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!