Dissertations / Theses on the topic 'Memory hierarchy'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Memory hierarchy.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Gan, Yee Ling. "Redesigning the memory hierarchy for memory-safe programming languages." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119765.
Full textThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 69-75).
We present Hotpads, a new memory hierarchy designed from the ground up for modern, memory-safe languages like Java, Go, and Rust. Memory-safe languages hide the memory layout from the programmer. This prevents memory corruption bugs, improves programmability, and enables automatic memory management. Hotpads extends the same insight to the memory hierarchy: it hides the memory layout from software and enables hardware to take control over it, dispensing with the conventional flat address space abstraction. This avoids the need for associative caches and virtual memory. Instead, Hotpads moves objects across a hierarchy of directly-addressed memories. It rewrites pointers to avoid most associative lookups, provides hardware support for memory allocation, and unifies hierarchical garbage collection and data placement. As a result, Hotpads improves memory performance and efficiency substantially, and unlocks many new optimizations. This thesis contributes important optimizations for Hotpads and a comprehensive evaluation of Hotpads against prior work.
by Yee Ling Gan.
M. Eng.
Dublish, Saumay Kumar. "Managing the memory hierarchy in GPUs." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31205.
Full textTaylor, Nathan Bryan. "Cachekata : memory hierarchy optimization via dynamic binary translation." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/44335.
Full textKim, Jinwoo. "Memory hierarchy management through off-line computational learning." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/8194.
Full textGreenwald, Benjamin Eliot 1975. "A technique for compilation to exposed memory hierarchy." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/87163.
Full text"September 1999."
Includes bibliographical references (p. 55-56).
by Benjamin Eliot Greenwald.
S.M.
Dimić, Vladimir. "Runtime-assisted optimizations in the on-chip memory hierarchy." Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/670363.
Full textSiguiendo la ley de Moore, el número de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programación eficiente de estos sistemas. Se han desarrollado muchos modelos de programación para resolver este problema; un ejemplo particular son los modelos de programación basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicación. La información de que disponen los sistemas en tiempo de ejecución (runtime systems) asociada con estos modelos de programación ofrece un enorme potencial para la mejora del diseño del hardware. Por otro lado, las limitaciones tecnológicas hacen que la ley de Moore pueda dejar de cumplirse próximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral·leles de que dispone el runtime system para mejorar el diseño de la jerarquía de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores diseños hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribución de esta tesis consiste en un conjunto de políticas de inserción para las memorias caché compartidas de último nivel que aprovecha la información de las tareas y las dependencias de datos entre estas. La intuición tras esta propuesta se basa en la observación de que los hilos de ejecución paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las políticas que se proponen insertan líneas de caché en posiciones lógicas diferentes en función de los tipos de dependencia y tarea a los que corresponde la petición de memoria. La segunda propuesta optimiza la ejecución de las reducciones, que se definen como un patrón de programación que combina datos de entrada para conseguir la variable de reducción como resultado. Esto se consigue mediante una técnica asistida por el runtime system para la realización de reducciones en la jerarquía de la caché del procesador, con el objetivo de ser una solución aplicable de forma universal sin depender del tipo de la variable de la reducción, su tamaño o el patrón de acceso. A nivel de software, el modelo de programación se extiende para que el programador especifique las variables de reducción de las tareas, así como el nivel de caché escogido para que se realice una determinada reducción. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta información al hardware subyacente, evitando así movimientos de datos innecesarios hacia y desde el núcleo del procesador, al realizarse la operación donde se encuentran los datos de la misma. La tercera contribución proporciona un esquema de priorización asistido por el runtime system para peticiones de memoria dentro de la jerarquía de memoria del chip. La propuesta se basa en la noción de camino crítico en el contexto de los códigos paralelos y en el hecho conocido de que acelerar tareas críticas reduce el tiempo de ejecución de la aplicación completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleración de las tareas críticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.
Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programació eficient d’aquests sistemes. Per intentar solucionar-ho, s’han desenvolupat diversos models de programació; un exemple particular en són els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral·lels dins d’una aplicació. La informació que hi ha al nivell dels sistemes en temps d’execució (runtime systems) associada amb aquests models de programació ofereix un gran potencial a l’hora de millorar el disseny del maquinari. D’altra banda, les limitacions tecnològiques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tendències actuals en la millora de rendiment. L’objectiu principal d’aquesta tesi és aprofitar els coneixements que el runtime System té d’una aplicació paral·lela per millorar el disseny de la jerarquia de memòria dins el xip. L’acoblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d’aquests sistemes. La primera contribució d’aquesta tesi consisteix en un conjunt de polítiques d’inserció a les memòries cau (cache memories) compartides d’últim nivell que aprofita informació sobre tasques i les dependències de dades entre aquestes. La intuïció que hi ha al darrere d’aquesta proposta es basa en el fet que els fils d’execució paral·lels mostren diferents patrons d’accés a la memòria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les polítiques que s’hi proposen insereixen línies de la memòria cau a diferents ubicacions lògiques en funció dels tipus de dependència i de tasca als quals correspon la petició de memòria. La segona proposta optimitza l’execució de les reduccions, que es defineixen com un patró de programació que combina dades d’entrada per aconseguir la variable de reducció com a resultat. Això s’aconsegueix mitjançant una tècnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la memòria cau del processador, amb l’objectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducció, la seva mida o el patró d’accés. A nivell de programari, es realitza una extensió del model de programació per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, així com el nivell de memòria cau desitjat on s’hauria de realitzar una certa reducció. El compilador font a font (compilador source-to-source) i el runtime system s’amplien per traduir i passar aquesta informació al maquinari subjacent. A nivell de maquinari, les memòries cau privades i compartides s’equipen amb unitats funcionals i la lògica corresponent per poder dur a terme les reduccions a la pròpia memòria cau, evitant així moviments de dades innecessaris entre el nucli del processador i la jerarquia de memòria. La tercera contribució proporciona un esquema de priorització assistit pel runtime System per peticions de memòria dins de la jerarquia de memòria del xip. La proposta es basa en la noció de camí crític en el context dels codis paral·lels i en el fet conegut que l’acceleració de les tasques que formen part del camí crític redueix el temps d’execució de l’aplicació sencera. En el context d’aquest treball, la criticitat de les tasques s’observa al nivell del seu tipus ja que permet que el programador les indiqui mitjançant anotacions senzilles. L’acceleració de les tasques crítiques s’aconsegueix prioritzant les corresponents peticions de memòria dins el microprocessador.
Low, Douglas Wai Kok. "Network processor memory hierarchy designs for IP packet classification /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/6973.
Full textVitkovskiy, Arseniy <1979>. "Memory hierarchy and data communication in heterogeneous reconfigurable SoCs." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/1127/1/Tesi_Vitkovskiy_Arseniy.pdf.
Full textVitkovskiy, Arseniy <1979>. "Memory hierarchy and data communication in heterogeneous reconfigurable SoCs." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/1127/.
Full textSenni, Sophiane. "Exploration of non-volatile magnetic memory for processor architecture." Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTS264/document.
Full textWith the downscaling of the complementary metal-oxide semiconductor (CMOS) technology,designing dense and energy-efficient systems-on-chip (SoC) is becoming a realchallenge. Concerning the density, reducing the CMOS transistor size faces up to manufacturingconstraints while the cost increases exponentially. Regarding the energy, a significantincrease of the power density and dissipation obstructs further improvement inperformance. This issue is mainly due to the growth of the leakage current of the CMOStransistors, which leads to an increase of the static energy consumption. Observing currentSoCs, more and more area is occupied by embedded volatile memories, such as staticrandom access memory (SRAM) and dynamic random access memory (DRAM). As a result,a significant proportion of total power is spent into memory systems. In the past twodecades, alternative memory technologies have emerged with attractive characteristics tomitigate the aforementioned issues. Among these technologies, magnetic random accessmemory (MRAM) is a promising candidate as it combines simultaneously high densityand very low static power consumption while its performance is competitive comparedto SRAM and DRAM. Moreover, MRAM is non-volatile. This capability, if present inembedded memories, has the potential to add new features to SoCs to enhance energyefficiency and reliability. In this thesis, an area, performance and energy exploration ofembedding the MRAM technology in the memory hierarchy of a processor architectureis investigated. A first fine-grain exploration was made at cache level for multi-core architectures.A second study evaluated the possibility to design a non-volatile processorintegrating MRAM at register level. Within the context of internet of things, new featuresand the benefits brought by the non-volatility were investigated
Ramirez, Jon. "Analysis of compute cluster nodes with varying memory hierarchy distributions." To access this resource online via ProQuest Dissertations and Theses @ UTEP, 2009. http://0-proquest.umi.com.lib.utep.edu/login?COPT=REJTPTU0YmImSU5UPTAmVkVSPTI=&clientId=2515.
Full textGhosh, Mrinmoy. "Microarchitectural techniques to reduce energy consumption in the memory hierarchy." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/28265.
Full textCommittee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhakar.
Kjelso, Morten. "A quantitative evaluation of data compression in the memory hierarchy." Thesis, Loughborough University, 1997. https://dspace.lboro.ac.uk/2134/10596.
Full textAzarkhish, Erfan <1985>. "Memory Hierarchy Design for Next Generation Scalable Many-core Platforms." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amsdottorato.unibo.it/7255/1/erfan-thesis-final.pdf.
Full textAzarkhish, Erfan <1985>. "Memory Hierarchy Design for Next Generation Scalable Many-core Platforms." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amsdottorato.unibo.it/7255/.
Full textde, Souza Ferreira Tharso. "Improving Memory Hierarchy Performance on MapReduce Frameworks for Multi-Core Architectures." Doctoral thesis, Universitat Autònoma de Barcelona, 2013. http://hdl.handle.net/10803/129468.
Full textThe need of analyzing large data sets from many different application fields has fostered the use of simplified programming models like MapReduce. Its current popularity is justified by being a useful abstraction to express data parallel processing and also by effectively hiding synchronization, fault tolerance and load balancing management details from the application developer. MapReduce frameworks have also been ported to multi-core and shared memory computer systems. These frameworks propose to dedicate a different computing CPU core for each map or reduce task to execute them concurrently. Also, Map and Reduce phases share a common data structure where main computations are applied. In this work we describe some limitations of current multi-core MapReduce frameworks. First, we describe the relevance of the data structure used to keep all input and intermediate data in memory. Current multi-core MapReduce frameworks are designed to keep all intermediate data in memory. When executing applications with large data input, the available memory becomes too small to store all framework intermediate data and there is a severe performance loss. We propose a memory management subsystem to allow intermediate data structures the processing of an unlimited amount of data by the use of a disk spilling mechanism. Also, we have implemented a way to manage concurrent access to disk of all threads participating in the computation. Finally, we have studied the effective use of the memory hierarchy by the data structures of the MapReduce frameworks and proposed a new implementation of partial MapReduce tasks to the input data set. The objective is to make a better use of the cache and to eliminate references to data blocks that are no longer in use. Our proposal was able to significantly reduce the main memory usage and improves the overall performance with the increasing of cache memory usage.
Agarwal, Khushbu. "A partition based approach to approximate tree mining a memory hierarchy perspective /." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1196284256.
Full textAgarwal, Khushbu. "A partition based approach to approximate tree mining : a memory hierarchy perspective." The Ohio State University, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=osu1196284256.
Full textMehta, Nishant K. "A Hierarchy Navigation Framework: Supporting Scalable Interactive Exploration over Large Databases." Link to electronic thesis, 2004. http://www.wpi.edu/Pubs/ETD/Available/etd-0827104-114148/.
Full textXiang, Ping. "ANALYZING INSTRUCTTION BASED CACHE REPLACEMENT POLICIES." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2589.
Full textM.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Engineering MSCpE
Huang, Cheng-Chieh. "Optimizing cache utilization in modern cache hierarchies." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/19571.
Full textSeong, Nak Hee. "A reliable, secure phase-change memory as a main memory." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/50123.
Full textSampaio, Felipe Martin. "Energy-efficient memory hierarchy for motion and disparity estimation in multiview video coding." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/71292.
Full textThis Master Thesis proposes a memory hierarchy for the Motion and Disparity Estimation (ME/DE) centered on the encoding references, called Reference-Centered Data Reuse (RCDR), focusing on energy reduction in the Multiview Video Coding (MVC). In the MVC encoders the ME/DE represents more than 98% of the overall energy consumption. Moreover, in the overall ME/DE energy, up to 90% is related to the memory issues, and only 10% is related to effective computation. The two items to be concerned with: (1) off-chip memory communication to fetch the reference samples (45%) and (2) on-chip memory to keep stored the search window samples and to send them to the ME/DE processing core (45%). The main goal of this work is to jointly minimize the on-chip and off-chip energy consumption in order to reduce the overall energy related to the ME/DE on MVC. The memory hierarchy is composed of an onchip video memory (which stores the entire search window), an on-chip memory gating control, and a partial results compressor. A search control unit is also proposed to exploit the search behavior to achieve further energy reduction. This work also aggregates to the memory hierarchy a low-complexity reference frame compressor. The experimental results proved that the proposed system accomplished the goal of the work of jointly minimizing the on-chip and off-chip energies. The RCDR provides off-chip energy savings of up to 68% when compared to state-of-the-art. the traditional MBcentered approach. The partial results compressor is able to reduce by 52% the off-chip memory communication to handle this RCDR penalty. When compared to techniques that do not access the entire search window, the proposed RCDR also achieve the best results in off-chip energy consumption due to the regular access pattern that allows lots of DDR burst reads (30% less off-chip energy consumption). Besides, the reference frame compressor is capable to improve by 2.6x the off-chip memory communication savings, along with negligible losses on MVC encoding performance. The on-chip video memory size required for the RCDR is up to 74% smaller than the MB-centered Level C approaches. On top of that, the power-gating control is capable to save 82% of leakage energy. The dynamic energy is treated due to the candidate merging technique, with savings of more than 65%. Due to the jointly off-chip communication and on-chip storage energy savings, the proposed memory hierarchy system is able to meet the MVC constraints for the ME/DE processing.
Khan, Muneeb. "Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching." Doctoral thesis, Uppsala universitet, Datorarkitektur och datorkommunikation, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-272095.
Full textLodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.
Full textLodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
TESIS
Bonatto, Alexsandro Cristóvão. "Controle adaptativo para acesso à memória compartilhada em sistemas em chip." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/109193.
Full textThe number of Processing Elements (PE) contained in a System-on-Chip (SoC) follows the growth of the number of transistors per chip. A SoC composed of multiple PEs, in some ap- plications such as multimedia, implements algorithms that handle large volumes of data and justify the use of an external memory with large capacity. External memory accesses are shared by multiple PEs adding challenges that may have special attention because they constitute the bottleneck for performance and relevant factor for power consumption. In the case where the PEs are microprocessors, this issue becomes even more evident as the rate of increase of speed of microprocessors exceeds the rate of increase in speed of DRAM. This effect is called “mem- ory wall” and represents that the bottleneck processing is related to the speed of data access. In this scenario, new access control strategies are needed to improve processing performance. Heterogeneous platforms for multimedia processing are formed by several PEs. The concur- rent accesses to DRAM reduce bandwidth and increase latency access to data, degrading the processing performance. This thesis shows that significant improvements in computational effi- ciency can be obtained using a design methodology oriented to the functional aspects of DRAM through a memory subsystem with adaptive management. It is presented the data communica- tion architecture for integration of PEs system based on an analytical model to reduce latency and guarantee Quality of Service (QoS). The memory subsystem is organized as a hierarchy of memories, with a proposed integration of PEs oriented centered in the main memory. The memory subsystem centralized data communication and manages the access of several PEs sys- tem so that communication is served according to the needs of each application. This thesis proposes three major contributions: 1) a methodology for design integrated systems based on the memory-centric design approach, 2) an analytical model based on delays used to evaluate the worst-case performance of the memory subsystem, 3) an arbiter for adaptive management of accesses to the external memory with guaranteed execution times of transactions.
Jiménez, Castells Marta. "Multilevel tiling for non-rectangular interation spaces." Doctoral thesis, Universitat Politècnica de Catalunya, 1999. http://hdl.handle.net/10803/6007.
Full textEn la tesis se muestra que la razón principal por la que los compiladores comerciales actuales consiguen bajos rendimiento en este tipo de aplicaciones es que no son capaces de aplicar loop tiling a nivel de registros. En su lugar, para mejorar la localidad de los datos y el ILP, los compiladores actuales usan y combinan otras transformaciones que no explotan el nivel de registros tan bien como loop tiling. Previamente no se ha considerado aplicar loop tiling a nivel de registro porque en códigos numéricos complejos no es trivial debido a la naturaleza irregular de los espacios de iteraciones. La primera contribución de esta tesis es un algoritmo general de loop tiling a nivel de registros que es aplicable a cualquier tipo de espacio de iteraciones y no sólo a los espacios rectangulares.
Nuestro método incluye una heurística muy sencilla para decidir los parámetros de los cortes a nivel de registros. A primera vista parece que loop tiling a nivel de registros (a partir de ahora, register tiling) se tiene que aplicar de tal manera que el bucle que ofrece más reuso temporal de los datos no debe de ser partido. De esta manera maximizamos la reutilización de los registros y minimizamos el número total de load/stores ejecutados. Sin embargo, mostraremos que en espacios de iteraciones no rectangulares, si solamente tenemos en cuenta las direcciones del reuso y no la forma del espacio de iteraciones, los códigos pueden sufrir una degradación en rendimiento. Nuestra segunda contribución es la propuesta de una heurística muy sencilla que determina los parámetros del tiling a nivel de registros considerando no sólo el reuso temporal sino también la forma del espacio de iteraciones. Además, la heurística es suficientemente sencilla para que pueda ser implementada en un compilador comercial.
Sin embargo, para conseguir rendimientos similares que códigos optimizados a mano, no es suficiente con aplicar loop tiling a nivel de registros. Con las arquitecturas de hoy en día que disponen de jerarquías de memoria complejas y múltiples procesadores, es necesario que el compilador aplique loop tiling en cuatro o más niveles (paralelismo, cache L2, cache L1 y registros) para conseguir altos rendimientos. Por lo tanto, en las arquitecturas actuales es crucial tener un algoritmo eficiente para aplicar loop tiling en varios niveles de la jerarquía de memoria (tiling multinivel). Además, como mostramos en esta tesis, la transformación de tiling multinivel siempre tendrá que incluir el nivel de registro porque este es el nivel de la jerarquía de memoria que ofrece mayor rendimiento cuando es tratado correctamente.
Cuando tiling multinivel incluye el nivel de registros, es necesario que los límites de los bucles sean exactos y que no haya límites redundantes. La razón es que la complejidad y la cantidad de código que se genera con nuestra técnica de register tiling depende polinómicamente del número de límites de los bucles.
Sin embargo, hasta ahora, el problema de calcular límites exactos y eliminar límites redundantes es que todas las técnicas conocidas son muy caras en términos de tiempo de compilación y, por ello, difícil de integrar en un compilador comercial. La tercera contribución de esta tesis es una nueva implementación de tiling multinivel que calcula límites exactos y es mucho menos costosa que técnicas tradicionales. Mostraremos que la complejidad de nuestra implementación es proporcional a la complejidad de aplicar una permutación de bucles en el código original (antes de aplicar loop tiling), mientras que las técnicas tradicionales tienen complejidades más altas. Además, nuestra implementación genera menos límites redundantes y permite eliminar los límites redundantes que quedan a menor coste. En conjunto, la eficiencia de nuestra implementación hace posible que pueda ser implementada dentro de un compilador comercial sin tener que preocuparse por los tiempos de compilación.
La última parte de esta tesis está dedicada al estudio del rendimiento de tiling multinivel. Se muestran los efectos de tiling en los diferentes niveles de memoria y presentamos datos que comparan los beneficios de tiling a nivel de registros, tiling a nivel de cache y tiling a los dos niveles, cache y registros, simultáneamente. Finalmente, comparamos el rendimiento de códigos optimizados automáticamente con códigos optimizados manualmente (librerías numéricas que ofrecen los fabricantes) sobre dos arquitecturas diferentes (ALPHA 21164 and MIPS R10000) para concluir que actualmente la tecnología de los compiladores hace posible que códigos numéricos complejos consigan el mismo rendimiento que códigos optimizados manualmente.
The main motivation of this thesis is to develop new compilation techniques that address the lack of performance of complex numerical codes consisting of loop nests defining non-rectangular iteration spaces. Specifically, we focus on the loop tiling transformation (also known as blocking) and our purpose is the improvement of the loop tiling transformation when dealing with complex numerical codes. Our goal is to achieve via the loop tiling transformation the same or better performance as hand-optimized vendor-supplied numerical libraries.
We will observe that the main reason why current commercial compilers perform poorly when dealing with this type of codes is that they do not apply tiling for the register level. Instead, to enhance locality at this level and to improve ILP, they use/combine other transformations that do not exploit the register level as well as loop tiling. Tiling for the register level has not generally been considered because, in complex numerical codes, it is far from being trivial due to the irregular nature of the iteration space. Our first contribution in this thesis will be a general compiler algorithm to perform tiling at the register level that handles arbitrary iteration space shapes and not only simple rectangular shapes.
Our method includes a very simple heuristic to make the tile decisions for the register level. At first sight, register tiling should be performed so that whichever loop carries the most temporal reuse is not tiled. This way, register reuse is maximized and the number of load/store instructions executed is minimized. However, we will show that, for complex loop nests, if we only consider reuse directions and do not take into account the iteration space shape, the tiled loop nest can suffer performance degradation. Our second contribution will be a proposal of a very simple heuristic to determine the tiling parameters for the register level, that considers not only temporal reuse, but also the iteration space shape. Moreover, the heuristic is simple enough to be suitable for automatic implementation by compilers.
However, to be able to achieve similar performance to hand-optimized codes, it is not enough by tiling only for the register level. With today's architectures having complex memory hierarchies and multiple processors, it is quite common that the compiler has to perform tiling at four or more levels (parallelism, L2-cache, L1-cache and registers) in order to achieve high performance. Therefore, in today's architectures it is crucial to have an efficient algorithm that can perform multilevel tiling at multiple levels of the memory hierarchy. Moreover, as we will see in this thesis, multilevel tiling should always include the register level, as this is the memory hierarchy level that yields most performance when properly tiled.
When multilevel tiling includes the register level, it is critical to compute exact loop bounds and to avoid the generation of redundant bounds. The reason is that the complexity and the amount of code generated by our register tiling technique both depend polynomially on the number of loop bounds. However, to date, the drawback of generating exact loop bounds and eliminating redundant bounds has been that all techniques known were extremely expensive in terms of compilation time and, thus, difficult to integrate in a production compiler. Our third contribution in this thesis will be a new implementation of multilevel tiling that computes exact loop bounds at a much lower complexity than traditional techniques. In fact, we will show that the complexity of our implementation is proportional to the complexity of performing a loop permutation in the original loop nest (before tiling), while traditional techniques have much larger complexities. Moreover, our implementation generates less redundant bounds in the multilevel tiled code and allows removing the remaining redundant bounds at a lower cost. Overall, the efficiency of our implementation makes it possible to integrate multilevel tiling including the register level in a production compiler without having to worry about compilation time.
The last part of this thesis is dedicated to studying the performance of multilevel tiling. We will discuss the effects of tiling for different memory levels and present quantitative data comparing the benefits of tiling only for the register level, tiling only for the cache level and tiling for both levels simultaneously. Finally, we will compare automatically-optimized codes against hand-optimized vendor-supplied numerical libraries, on two different architectures (ALPHA 21164 and MIPS R10000), to conclude that compiler technology can make it possible for complex numerical codes to achieve the same performance as hand-optimized codes on modern microprocessors.
SOHONI, SOHUM. "IMPROVING L2 CACHE PERFORMANCE THROUGH STREAM-DIRECTED OPTIMIZATIONS." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892.
Full textChen, Feng. "On Performance Optimization and System Design of Flash Memory based Solid State Drives in the Storage Hierarchy." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1280537880.
Full textMolka, Daniel. "Performance Analysis of Complex Shared Memory Systems." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-221729.
Full textDavari, Mahdad. "Advances Towards Data-Race-Free Cache Coherence Through Data Classification." Doctoral thesis, Uppsala universitet, Avdelningen för datorteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595.
Full textLima, Pilla Laércio. "Topology-aware load balancing for performance portability over parallel high performance systems." Thesis, Grenoble, 2014. http://www.theses.fr/2014GRENM028.
Full textThis thesis presents our research to provide performance portability and scalability to complex scientific applications running over hierarchical multicore parallel platforms. Performance portability is said to be attained when a low core idleness is achieved while mapping a given application to different platforms, and can be affected by performance problems such as load imbalance and costly communications, and overheads coming from the task mapping algorithm. Load imbalance is a result of irregular and dynamic load behaviors, where the amount of work to be processed varies depending on the task and the step of the simulation. Meanwhile, costly communications are caused by a task distribution that does not take into account the different communication times present in a hierarchical platform. This includes nonuniform and asymmetric communication costs at memory and network levels. Lastly, task mapping overheads come from the execution time of the task mapping algorithm trying to mitigate load imbalance and costly communications, and from the migration of tasks.Our approach to achieve the goal of performance portability is based on the hypothesis that precise machine topology information can help task mapping algorithms in their decisions. In this context, we proposed a generic machine topology model of parallel platforms composed of one or more multicore compute nodes. It includes profiled latencies and bandwidths at memory and network levels, and highlights asymmetries and nonuniformity at both levels. This information is employed by our three proposed topology-aware load balancing algorithms, named NucoLB, HwTopoLB, and HierarchicalLB. Besides topology information, these algorithms also employ application information gathered during runtime. NucoLB focuses on the nonuniform aspects of parallel platforms, while HwTopoLB considers the whole hierarchy in its decisions, and HierarchicalLB combines these algorithms hierarchically to reduce its task mapping overhead. These algorithms seek to mitigate load imbalance and costly communications while averting task migration overheads.Experimental results with the proposed load balancers over different platform composed of one or more multicore compute nodes showed performance improvements over state of the art load balancing algorithms: NucoLB presented improvements of up to 19% on one compute node; HwTopoLB experienced performance improvements of 19% on average; and HierarchicalLB outperformed HwTopoLB by 22% on average on parallel platforms with ten or more compute nodes. These results were achieved by equalizing work among the available resources, reducing the communication costs experienced by applications, and by keeping load balancing overheads low. In this sense, our load balancing algorithms provide performance portability to scientific applications while being independent from application and system architecture
Silvestri, Francesco. "Oblivious Computations on Memory and Network Hierarchies." Doctoral thesis, Università degli studi di Padova, 2009. http://hdl.handle.net/11577/3426414.
Full textL'organizzazione gerarchica dei sistemi di memoria e di comunicazione e la disponibilità di molte unità di calcolo influiscono notevolmente sulle prestazioni di un algoritmo. Il loro efficiente utilizzo è limitato dalle differenti configurazioni che possono assumere. È cruciale, per motivi economici e di portabilità, che gli algoritmi si adattino allo spettro delle piattaforme esistenti e che la procedura di adattamento sia il più possibile automatizzata. L'adattività può essere raggiunta sia tramite algoritmi aware, che utilizzano esplicitamente opportuni parametri architetturali, sia tramite algoritmi oblivious, la cui sequenza di operazioni è indipendente dalle caratteristiche dell'architettura sottostante. Gli algoritmi oblivious esibiscono spesso prestazioni ottime su diverse architetture e sono attrattivi soprattutto nei contesti in cui i parametri architetturali sono difficili da stimare o non sono noti. Questa tesi si focalizza sullo studio degli algoritmi oblivious con due obiettivi principali: l'indagine delle potenzialità e limitazioni delle computazioni oblivious e l'introduzione del concetto di algoritmo oblivious in un sistema parallelo. Inizialmente, vengono affrontate varie problematiche legate all'esecuzione di algoritmi cache-oblivious per permutazioni razionali, le quali rappresentano un'importante classe di permutazioni che include la trasposizione di matrici e il bit-reversal di vettori. Si dimostra un lower bound, valido anche per algoritmi cache-aware, sul numero di operazioni macchina necessarie per eseguire una permutazione razionale assumendo un numero ottimo di accessi alla memoria. Quindi, si descrive un algoritmo cache-oblivious che esegue qualsiasi permutazione razionale e che richiede un numero ottimo di operazioni macchina e di accessi alla memoria, assumendo l'ipotesi di tall-cache. Infine, si dimostra che per certe famiglie di permutazioni razionali (tra cui la trasposizione e il bit-reversal) non può esistere un algoritmo cache-oblivious che richieda un numero ottimo di accessi alla memoria per tutti i valori dei parametri della cache, mentre esiste un algoritmo cache-aware con tali caratteristiche. Nella tesi viene poi introdotto il framework network-oblivious per lo studio di algoritmi oblivious in un sistema parallelo. Il framework esplora lo sviluppo di algoritmi paralleli di tipo bulk-synchronous che, senza usare parametri dipendenti dalla macchina, hanno prestazioni efficienti su macchine parallele con differenti gradi di parallelismo e valori di banda/latenza. Vengono inoltre forniti algoritmi network-oblivious per alcuni problemi chiave (moltiplicazione e trasposizione di matrici, trasformata discreta di Fourier, ordinamento) e viene presentato un risultato di impossibilità sull'esecuzione di algoritmi network-oblivious per la trasposizione di matrici che ricorda quello ottenuto per le permutazioni razionali. Infine, per mostrare ulteriormente le potenzialità del framework, vengono presentati algoritmi network-oblivious ottimi per eseguire un'ampia classe di computazioni risolvibili tramite il paradigma di programmazione ad eliminazione gaussiana, tra cui il calcolo dei cammini minimi in un grafo, l'eliminazione gaussiana senza pivoting e la moltiplicazione di matrici.
Candel, Margaix Francisco. "Efficient L2 Cache Management to Boost GPGPU Performance." Doctoral thesis, Universitat Politècnica de València, 2019. http://hdl.handle.net/10251/125477.
Full text[CAT] En els últims anys, la creixent necessitat de capacitat de còmput ha suposat un repte que ha portat a la indústria a buscar arquitectures alternatives als processadors superescalars amb execució fora d'ordre convencionals, amb l'objectiu d'incrementar la potència de còmput alhora que s'aconsegueix una major eficiència energètica. Les arquitectures GPU, les quals fins fa només una dècada es dedicaven exclusivament a l'acceleració dels gràfics en els computadors, han sigut una de les alternatives més utilitzades durant alguns anys per a aconseguir l'esmentat objectiu. Una de les característiques particulars de les GPU és el seu elevat ample de banda per a accedir a memòria principal, la qual cosa permet executar un gran nombre de fils de forma molt eficient. Aquesta característica, així com la seua elevada potència computacional executant operacions de coma flotant, ha originat l'aparició del paradigma de computació anomenat GPGPU computing, paradigma on les GPU realitzen còmput de propòsit general. Les citades característiques converteixen a les GPU en dispositius especialment apropiats per a l'execució d'aplicacions massivament paral·leles que tradicionalment s'havien executat en processadors convencionals d'altes prestacions. El treball desenvolupat en aquesta tesi persegueix ajudar a millorar les prestacions de les GPU en l'execució de les aplicacions GPGPU. A aquest efecte, com a primer pas, es realitza un estudi de caracterització on s'identifiquen les característiques més importants d'aquestes aplicacions des del punt de vista de la jerarquia de memòria i el seu impacte en les prestacions. Per a això s'utilitza un simulador detallat cicle a cicle on es modela l'arquitectura d'una GPU recent. L'estudi revela que és necessari modelar de forma més detallada alguns components crítics de la jerarquia de memòria de les GPU per a obtindre resultats precisos. Els resultats obtinguts mostren que les prestacions aconseguides poden variar fins i tot en un factor de 3× depenent de com es modelen aquests components crítics. Per aquest motiu, com a segon pas abans d'elaborar la proposta de millora, el treball se centra en determinar quins components de la jerarquia de memòria de la GPU necessiten modelar-se amb major detall per a millorar la precisió dels resultats del simulador i en millorar els models existents d'aquests components. A més, es realitza un estudi de validació que compara els resultats obtinguts amb els models millorats contra els d'una GPU comercial real. Les millores implementades redueixen la desviació dels resultats del simulador sobre els resultats reals al voltant d'un 96%. Finalment, una vegada millorada la precisió del simulador, en aquesta tesi es presenta una proposta innovadora, denominada FRC (sigles en anglés de Fetch and Replacement Cache), que millora en gran manera la potència computacional de la GPU, gràcies a que augmenta el paral·lelisme en l'accés a memòria principal. La proposta incrementa el nombre d'accessos en paral·lel a memòria principal mitjançant l'acceleració de la gestió de les accions de recerca i reemplaçament relacionades amb els accessos que fallen en la cache. La proposta FRC es basa en una xicoteta estructura cache auxiliar que descongestiona el subsistema de memòria eficientment, augmentant les prestacions de la GPU fins a un 118% de mitjana respecte al sistema base. A més, també redueix, al voltant d'un 57%, el consum energètic de la jerarquia de memòria.
[EN] In recent years, the growing need for computing capacity has become a challenge that has led the industry to look for alternative architectures to conventional out-of-order superscalar processors, with the goal of enabling an increase of computing power while achieving higher energy efficiency. GPU architectures, which just a decade ago were applied to accelerate computer graphics exclusively, have been one of the most employed alternatives for several years to reach the mentioned goal. A particular characteristic of GPUs is their high main memory bandwidth, which allows executing a large number of threads in a very efficient way. This feature, as well as their high computational power regarding floating-point operations, have caused the emergence of the GPGPU computing paradigm, where GPU architectures perform general purpose computations. The aforementioned characteristics make GPU devices very appropriate for the execution of massively parallel applications that have been traditionally executed in conventional high-performance processors. The work performed in this thesis aims to help improve the performance of GPUs in the execution of GPGPU applications. To this end, as a first step, a characterization study is carried out. In this study, the most important features of GPGPU applications, with respect to the memory hierarchy and its impact on performance, are identified. For this purpose, a detailed cycle-accurate simulator is used to model the architecture of a recent GPU. The study reveals that it is necessary to model with more detail some critical components of the GPU memory hierarchy in order to obtain accurate results. In addition, it shows that the achieved benefits can vary up to a factor of 3× depending on how these critical components are modeled. Due to this reason, as a second step before realizing a novel proposal, the work in this thesis focuses on determining which components of the GPU memory hierarchy must be modeled with more detail to increase the accuracy of simulator results and improving the existing simulator models of these components. Moreover, a validation study is performed comparing the results obtained with the improved GPU models against those from a real commercial GPU. The implemented simulator improvements reduce the deviation of the results obtained with the simulator from results obtained with the real GPU by about 96%. Finally, once simulation accuracy is increased, this thesis proposes a novel approach, called FRC (Fetch and Replacement Cache), which highly improves the GPU computational power by enhancing main memory-level parallelism. The proposal increases the number of parallel accesses to main memory by accelerating the management of fetch and replacement actions corresponding to those cache accesses that miss in the cache. The FRC approach is based on a small auxiliary cache structure that efficiently unclogs the memory subsystem, enhancing the GPU performance up to 118% on average compared to the studied baseline. In addition, the FRC approach reduces the energy consumption of the memory hierarchy by a 57%.
Candel Margaix, F. (2019). Efficient L2 Cache Management to Boost GPGPU Performance [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/125477
TESIS
Rabbah, Rodric Michel. "Design Space Exploration and Optimization of Embedded Memory Systems." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/11605.
Full textJacquelin, Mathias. "Memory-aware algorithms : from multicores to large scale platforms." Phd thesis, Ecole normale supérieure de lyon - ENS LYON, 2011. http://tel.archives-ouvertes.fr/tel-00662525.
Full textPottier, Loïc. "Co-scheduling for large-scale applications : memory and resilience." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEN039/document.
Full textThis thesis explores co-scheduling problems in the context of large-scale applications with two main focus: the memory side, in particular the cache memory and the resilience side.With the recent advent of many-core architectures such as chip multiprocessors (CMP), the number of processing units is increasing.In this context, the benefits of co-scheduling techniques have been demonstrated. Recall that, the main idea behind co-scheduling is to execute applications concurrently rather than in sequence in order to improve the global throughput of the platform.But sharing resources often generates interferences.With the arising number of processing units accessing to the same last-level cache, those interferences among co-scheduled applications becomes critical.In addition, with that increasing number of processors the probability of a failure increases too.Resiliency aspects must be taking into account, specially for co-scheduling because failure-prone resources might be shared between applications.On the memory side, we focus on the interferences in the last-level cache, one solution used to reduce these interferences is the cache partitioning.Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.We also investigate the same problem on a real cache partitioned chip multiprocessors, using the Cache Allocation Technology recently provided by Intel.In a second time, still on the memory side, we study how to model and schedule task graphs on the new many-core architectures, such as Knights Landing architecture.These architectures offer a new level in the memory hierarchy through a new on-packagehigh-bandwidth memory. Current approaches usually do not take intoaccount this new memory level, however new scheduling algorithms anddata partitioning schemes are needed to take advantage of this deepmemory hierarchy.On the resilience, we explore the impact on failures on co-scheduling performance.The co-scheduling approach has been demonstrated in a fault-free context, but large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications, and significantly degrade performance.We aim at minimizing the expected completion time of a set of co-scheduled applications in a failure-prone context by redistributing processors
Yang, Weishuai. "Scalable and effective clustering, scheduling and monitoring of self-organizing grids." Diss., Online access via UMI:, 2008.
Find full textLaga, Arezki. "Optimisation des performance des logiciels de traitement de données sur les périphériques de stockage SSD." Thesis, Brest, 2018. http://www.theses.fr/2018BRES0087/document.
Full textThe growing volume of data poses a real challenge to data processing software like DBMS (DataBase Management Systems) and data storage infrastructure. New technologies have emerged in order to face the data volume challenges. We considered in this thesis the emerging new external memories like flash memory-based storage devices named SSD (Solid State Drive).SSD storage devices offer a performance gain compared to the traditional magnetic devices.However, SSD devices offer a new performance model that involves 10 cost optimization for data processing and management algorithms.We proposed in this thesis an 10 cost model to evaluate the data processing algorithms. This model considers mainly the SSD 10 performance and the data distribution.We also proposed a new external sorting algorithm: MONTRES. This algorithm includes optimizations to reduce the 10 cost when the volume of data is greater than the allocated memory space by an order of magnitude. We proposed finally a data prefetching mechanism: Lynx. This one makes use of a machine learning technique to predict and to anticipate future access to the external memory
Watson, Myles G. "Does the Halting Necessary for Hardware Trace Collection Inordinately Perturb the Results?" Diss., CLICK HERE for online access, 2004. http://contentdm.lib.byu.edu/ETD/image/etd594.pdf.
Full textPéneau, Pierre-Yves. "Intégration de technologies de mémoires non volatiles émergentes dans la hiérarchie de caches pour améliorer l'efficacité énergétique." Thesis, Montpellier, 2018. http://www.theses.fr/2018MONTS108/document.
Full textToday, intensive efforts to design energy-efficient and high-performance systems-on-chip (SoCs) are underway. Moore’s end in the early 20 th century pushed designers to increase the number of core per processor to continue to improve the performance. As a result, the silicon area occupied by cache memories has increased. The ever smaller technology node also increased the leakage current of CMOS transistors. Thus, the energy consumption of memories represents an increasingly important part in the overall consumption of chips.To reduce this energy consumption, new memory technologies have emerged overthe past decade : non-volatile memories (NVM). These memories have the particularity of having a very low leakage current compared to conventional CMOS technologies. In fact, their use in an architecture would reduce the overall consumption of the cache hierarchy. However, these technologies sufferfrom higher access latencies than SRAM, higher access energy costs and limitedlifetime. Their integration into SoCs requires a continuous research effort.This thesis work aims to evaluate the impact of a change in technology in the cache hierarchy. More specifically, we are interested in the Last-Level Cache(LLC) and we consider the STT-MRAM technology. Our work adopts an architectural point of view in which a modification of the technology is not retained. Then,we try to integrate the different characteristics of the STT-MRAM atarchitectural level when designing the memory hierarchy. A first study set upan architectural exploration framework for systems containing emerging memories. A second study on architectural optimizations at LLC was conducted toidentify opportunities for the integration of STT-MRAM. The goal is to improve energy efficiency while reducing access penalties due to the high latency ofthis technology
Puche, Lara José. "Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/165254.
Full text[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.
[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.
Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254
TESIS
Milani, Emanuele. "Efficient Execution of Sequential Instructions Streams by Physical Machines." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3424604.
Full textQualsiasi modello computazionale basato su un sistema fisico e' verosimilmente soggetto al fatto che densita' e velocita' di propagazione dell’informazione sono intrinsecamente limitati. Per questo motivo, il modello RAM, in particolare per il presupposto che il costo di un accesso in memoria sia indipendente dalla taglia della stessa, non `e implementabile su sistemi fisici. Questo lavoro si inserisce nel contesto delle limiting technology machine, modelli computazionali in cui si ipotizza provocatoriamente di aver raggiunto con la tecnologia di fabbricazione i limiti fisici di densita' e velocita' dell’informazione. Questo, allo scopo di affrontare il problema delle latenze intrinseche a ogni sistema fisico evidenziando organizzazioni scalabili per processori e memorie. Viene quindi presentato uno studio algoritmico, che illustra l’implementazione di programmi a elevata concorrenza per SP ed SPE, modelli di macchine sequenziali in grado di eseguire programmi direct-flow in tempo ottimale. Successivamente, viene introdotta una innovativa organizzazione di memoria gerarchica e pipelined, con latenza e banda ottimali per un sistema fisico. Allo scopo di sfruttarne appieno le caratteristiche, e trarre vantaggio dall’eventuale instruction level parallelism del codice da eseguire, viene sviluppato un innovativo modello di processore. Particolare attenzione e' rivolta all’implementazione di un efficiente flusso di informazione all’interno del processore stesso. Entrambe le organizzazioni sono estremamente scalabili, in quanto basate su un insieme di nodi a taglia e capacita fisse, connessi con una topologia ad array multidimensionale. Lo studio delle prestazioni computazionali della macchina risultante ha evidenziato come le latenze interne al processore possono diventare la principale componente della complessita temporale per l’esecuzione di un flusso di istruzioni, che va ad aggiungersi all’effetto dell’interazione tra processore e memoria. Viene pertanto sviluppata una caratterizzazione dei flussi di istruzioni, basata sulla topologia indotta dalle dipendenze tra istruzioni.
Muppalaneni, Nitin. "Adaptive Hierarchial RAID." Thesis, Indian Institute of Science, 1998. http://hdl.handle.net/2005/50.
Full textBottero, Margherita. "Slave trades, credit records and strategic reasoning : four essays in microeconomics." Doctoral thesis, Handelshögskolan i Stockholm, Institutionen för Nationalekonomi, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:hhs:diva-1281.
Full textDiss. Stockholm : Handelshögskolan i Stockholm, 2011
Damasceno, Alexandro Lima. "O impacto da hierarquia de memória sobre a arquitetura IPNoSys." Universidade Federal Rural do Semi-Árido, 2016. http://bdtd.ufersa.edu.br:80/tede/handle/tede/654.
Full textApproved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T14:42:00Z (GMT) No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)
Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T15:00:20Z (GMT) No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5)
Made available in DSpace on 2017-04-13T15:07:49Z (GMT). No. of bitstreams: 1 AlexandroLD_DISSERT.pdf: 4478017 bytes, checksum: b25b015c0ae937a3ba2f2718697a3977 (MD5) Previous issue date: 2016-07-27
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Over the years, with the as technology advances, the search for improvements in the performance of computer systems is notable. The computer systems have evolved in both processing capacity and complexity of the implemented architectures. In such systems it is crucial to use memories since they are responsible for storing data to be processed. Considering an ideal environment, the memories should have a unlimited storage capacity, instant data access and the extremely low cost per bit. But in real systems the memories do not exhibit these characteristics. Storage capacity, speed and cost per bit are factors that increase in proportion to each other. One technique that is used to balance these factors and improve the performance of computer systems is the memory hierarchy. In the scenario of new technologies and proposals for new organizations of processors, a model that has been adopted by designers of computer systems is the use of MPSoCs (multiprocessor systems on chip), which has a higher energy and computational e ciency. In this scenario with many processing elements, networks using on-chip (NoC - networks-on-chip) is more e cient use of the buses. An NoC consists of a set of routers and interconnected channels forming a switched network. The cores are connected to network terminals and communication occurs through the exchange of packets. These NoCs have traditionally been exclusively designed for communication SoCs. However, a project of an unconventional architecture decided to integrate processing and communication in an NoC. This architecture is known for IPNoSys. The IPNoSys (Integrated Processing NoC System) architecture is an unconventional processor that uses networks on chip and implements processing units and routing to handle and process instructions. It takes advantage of the characteristics of NoC, such as scalability and parallel communication, for implement e ectively runs programs that exploit parallelism-level threads. Currently, IPNoSys architecture has four memory physically distributed at the corners of the network, but represent a unified addressing. Each memory module is associated with an access unit in charge of managing it. Given the current organization of IPNoSys memories, this work proposes to develop a new memory hierarchy system for IPNoSys and investigate the possible impact on performance and the programming model
Aolongo dos anos,coma ascensão das tecnologias, a busca por melhorias no desempenho dos sistemas computacionais é algo notável. Os sistemas computacionais evoluíram tanto em capacidade de processamento como em complexidade das arquiteturas implementadas. Nesses sistemas é crucial a utilização de memórias uma vez que elas são responsáveis pelo armazenamento de dados que serão processados. Considerando um ambiente ideal, as memórias deveriam ter uma capacidade de armazenamento ilimitado, o acesso de dados imediato e o custo por bit extremamente baixo. Porém nos sistemas reais as memórias não apresentam essas características. Capacidade de armazenamento, velocidade e custo por bit são fatores que crescem proporcionalmente entre si. Uma técnica que é utilizada para balancear esses fatores e melhorar o desempenho dos sistemas computacionais é a hierarquia de memória. No cenário de novas tecnologias e propostas de novas organizações de processadores, um modelo que vem sendo adotada pelos projetistas de sistemas computacionais é o uso de MPSoCs (sistemas multiprocessados integrados em chip), que apresenta uma maior eficiência energética e computacional. Nesse cenário com muitos elementos de processamento, a utilização de redes em chip (NoC - networks-on-chip) se mostra mais eficiente que o uso de barramentos. Uma NoC consiste em um conjunto de roteadores e canais interligados formando uma rede chaveada. Os núcleos são conectados aos terminais da rede e a comunicação ocorre pela troca de pacotes. Essas NoCs foram tradicionalmente projetadas exclusivamente para a comunicação em SoCs. Entretanto, um projeto de uma arquitetura não convencional resolveu integrar processamento e comunicação em uma NoC. Essa arquitetura é conhecida por IPNoSys. A arquitetura IPNoSys (Integrated Processing NoC System) é um processador não convencional que utiliza redes em chip e implementa unidades de processamento e roteamento para tratar e processar instruções. Aproveita as características das NoCs, como escalabilidade e comunicação paralela, para implementar de maneira eficiente execuções de programas que exploram paralelismo em nível de threads. Atualmente, a arquitetura IPNoSys possui quatro memórias fisicamente distribuidas nos cantos da rede, mas que representam um endereçamento unificado. Cada módulo de memória é associado a uma unidade de acesso que se encarregam de gerenciá-la. Diante da atual organização de memórias da IPNoSys, esse trabalho desenvolveu um novo sistema de hierarquia de memórias para o IPNoSys e investigou os possíveis impactos sobre o desempenho e o modelo de programação
2017-04-10
Jiménez, González Daniel. "Algoritmos de ordenación conscientes de la arquitectura y las características de los datos." Doctoral thesis, Universitat Politècnica de Catalunya, 2004. http://hdl.handle.net/10803/5982.
Full textEn esta tesis los conseguimos haciendo que los algoritmos de ordenación propuestos sean conscientes de la arquitectura del computador y las características de los datos que queremos ordenar.
Los conjuntos de datos que consideramos son conjuntos que caben en memoria principal, pero no en memoria cache.
Los algoritmos presentados se adaptan, en tiempo de ejecución, a las características de los datos (duplicados, con sesgo, etc.) para evitar pérdidas de rendimiento dependiendo de estas características. Para ello, estos algoritmos realizan un particionado de los datos, utilizando una técnica que llamamos Mutant Reverse Sorting y que presentamos en esta tesis.
Mutant Reverse Sorting se adapta dinámicamente a las características de los datos y del computador. Esta técnica analiza una muestra del conjunto de datos a ordenar para seleccionar la forma más rápida de particionar los datos. Esta técnica elige entre Reverse Sorting y Counting Split en función de la distribución de los datos. Estas técnicas también son propuestas en esta tesis.
El análisis de estas técnicas, junto con los algoritmos de ordenación presentados, se realiza en un computador IBM basado en módulos p630 con procesadores Power4 y en un computador SGI O2000 con procesadores R10K. En el análisis realizado para ambos computadores se presentan modelos de comportamiento que se comparan con ejecuciones reales.
Con todo ello, conseguimos los algoritmos de ordenación secuencial y paralelo más rápidos para las características de los datos y los computadores utilizados. Esto es gracias a que estos algoritmos se adaptan a los computadores y las características de los datos mejor que el resto de algoritmos analizados.
Así, por un lado, el algoritmo secuencial propuesto, SKC-Radix sort, consigue unas mejoras de rendimiento de más de 25% comparado con los mejores algoritmos que encontramos en la literatura. Es más, cuanto mayor es el número de duplicados o el sesgo de los datos, mayor es la mejora alcanzada por el SKC-Radix sort.
Por otro lado, el algoritmo paralelo presentado, PSKC-Radix sort, es capaz de ordenar hasta 4 veces más datos que Load Balanced Radix sort en la misma cantidad de tiempo. Load Balanced Radix sort era el algoritmo más rápido de ordenación en memoria y en paralelo para los conjunto de datos que ordenamos hasta la publicación de nuestros trabajos.
In this thesis we analyze and propose parallel and sequential sorting algorithms that exploit the memory hierarchy of the computer used and/or reduce the data communication. So, the objectives of this thesis are not different from the objectives of other works. However, the way to achieve those objectives is different.
We achieve those objectives by being conscious of the computer architecture and the data characteristics of the data set we want to sort.
We have focused on the data sets that fit in main memory but not in cache.
So, the sorting algorithms that we present take into account the data characteristics (duplicates, skewed data, etc.) in order to avoid any lose of performance in execution time depending on those characteristics.
That is done by partitioning the data set using Mutant Reverse Sorting, which is a partition technique that we propose in this thesis.
Mutant Reverse Sorting dynamically adapts to the data characteristics and the computer architecture. This technique analyzes a set of samples of the data set to choose the best way to partition this data set. This chooses between Reverse Sorting and Counting Split, which are selected depending on the data distribution. Those techniques are also proposed in this thesis.
The analysis of the partitioning techniques and the sorting algorithms proposed is done in an IBM computer based on p630 nodes with Power4 processors and in a SGI O2000 with R10K processors. Also, we present models of the behavior of the algorithms for those machines.
The sequential and parallel algorithms proposed are the fastest for the computer architectures and data set characteristics tested. That is so because those algorithms adapt to the computer architecture and data characteristics better than others.
On one hand, the sequential sorting algorithm presented, SKC-Radix sort, achieves performance improvements of more than 25% compared to other sequential algorithms found in the literature. Indeed, the larger the number of duplicates or data skew, the larger the improvement achieved by SKC-Radix sort.
On the other hand, the parallel sorting algorithm proposed, PSKC-Radix sort, sorts 4 times the number of keys that Load Balanced Radix can sort in the same amount of time. Load Balanced Radix sort was the fastest in-memory parallel sorting algorithm for the kind of data set we focus on.
Alba, de la Torre Celia. "Wh-questions in Catalan sign language." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/397751.
Full textS'ofereix una caracterització i una anàlisi de les preguntes-que en Llengua de Signes Catalana, que presenten la particularitat d'ubicar preferentment les expressions-qu al final de l'oració. Aquesta característica, pròpia de les llengües de signes, ha estat difícil de tractar des de models tradicionals, que sovint han considerat que el moviment-qu és universalment cap a l'esquerra i que sovint han assumit que l'estructura sintàctica codifica informació respecte de l'ordre lineal dels elements lingüístics. Es proposa que la jerarquia sintàctica i l'ordre lineal són dos objectes diferents i amb un impacte limitat l'un sobre l'altre i que el segon depèn principalment de mecanismes de processament lingüístic i, específicament, de la Memòria de Treball. En aquest sentit, s'hipotetitza que la diferència en la ubicació dels elements-qu entre llengües de signes i llengües orals respon a diferències en la Memòria de Treball. Per a explorar aquesta hipòtesi, s'exposen els resultats de dos experiments amb participants Sords i oients.
Kaeslin, Alain E. "Performance Optimisation of Discrete-Event Simulation Software on Multi-Core Computers." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-191132.
Full textSIMLOX är en kommersiell mjukvara utvecklad av Systecon AB, vars huvudsakliga funktion är en händelsestyrd simuleringskärna för analys av underhållslösningar för komplexa tekniska system. För hantering av stora problem så används parallellexekvering för simuleringen, vilket i teorin borde ge en nästan linjär skalning med antal trådar. Prestandaförbättringen som observerats i praktiken var dock ytterst begränsad, varför en ordentlig analys av skalbarheten har gjorts i detta projekt. Genom användandet av ett profileringsverktyg med liten overhead och mikroarkitektur-analys, så kunde orsakerna hittas: atomiska operationer som skapar mycket overhead för kommunikation, dålig lokalitet ger fragmentering vid översättning till fysiska adresser och dåligt utnyttjande av TLB-cachen, och vissa flaskhalsar som kräver mycket CPU-kraft. Därefter implementerades och testade optimeringar för att undvika de identifierade problem. Testade lösningar inkluderar eliminering av dyra operationer, ökad effektivitet i minneshantering genom skalbara minneshanteringsalgoritmer och implementation av datastrukturer som ger bättre lokalitet och därmed bättre användande av cache-strukturen. Verifiering på verkliga testfall visade på uppsnabbningar på åtminstone 6.75 gånger på en processor med 8 kärnor. De flesta fall visade på en uppsnabbning med en faktor större än 7.2. Optimeringarna gav även en uppsnabbning med en faktor på åtminstone 1.5 vid sekventiell exekvering i en tråd. Slutsatsen är därmed att det är möjligt att uppnå nästan linjär skalning med antalet kärnor för denna typ av händelsestyrd simulering.
Cargnini, Luís Vitório. "Applications des technologies mémoires MRAM appliquées aux processeurs embarqués." Thesis, Montpellier 2, 2013. http://www.theses.fr/2013MON20091/document.
Full textThe Semiconductors Industry with the advent of submicronic manufacturing flows below 45 nm began to face new challenges to keep evolving according with the Moore's Law. Regarding the widespread adoption of embedded systems one major constraint became power consumption of IC. Also, memory technologies like the current standard of integrated memory technology for memory hierarchy, the SRAM, or the FLASH for non-volatile storage have extreme intricate constraints to be able to yield memory arrays at technological nodes below 45nm. One important is up until now Non-Volatile Memory weren't adopted into the memory hierarchy, due to its density and like flash the necessity of multi-voltage operation. These theses has by objective work into these constraints and provide some answers. Into the thesis will be presented methods and results extracted from this methods to corroborate our goal of delineate a roadmap to adopt a new memory technology, non-volatile, low-power, low-leakage, SEU/MEU-resistant, scalable and with similar performance as the current SRAM, physically equivalent to SRAM, or even better with a area density between 4 to 8 times the area of a SRAM cell, without the necessity of multi-voltage domain like FLASH. This memory is the MRAM (Magnetic Memory), according with the ITRS one candidate to replace SRAM in the near future. MRAM instead of storing charge, they store the magnetic orientation provided by the spin-torque orientation of the free-layer alloy in the MTJ (Magnetic Tunnel Junction). Spin is a quantical state of matter, that in some metallic materials can have it orientation or its torque switched applying a polarized current in the sense of the field orientation desired. Once the magnetic field orientation is set, using a sense amplifier, and a current flow through the MTJ, the memory cell element of MRAM, it is possible to measure the orientation given the resistance variation, higher the resistance lower the passing current, the sense will identify a logic zero, lower the resistance the SA will sense a one logic. So the information is not a charge stored, instead it is a magnetic field orientation, reason why it is not affected by SEU or MEU caused due to high energy particles. Also it is not due to voltages variations to change the memory cell content, trapping charges in a floating gate. Regarding the MRAM, this thesis has by objective address the following aspects: MRAM applied to memory Hierarchy: - By describing the current state of the art in MRAM design and use into memory hierarchy; - by providing an overview of a mechanism to mitigate the latency of writing into MRAM at the cache level (Principle to composite memory bank); - By analyzing power characteristics of a system based on MRAM on CACHE L1 and L2, using a dedicated evaluation flow- by proposing a methodology to infer a system power consumption, and performances.- and for last based into the memory banks analysing a Composite Memory Bank, a simple description on how to generate a memory bank, with some compromise in power, but equivalent latency to the SRAM, that keeps similar performance