Dissertations / Theses: 'Computer architecture. Cache memory'

1

Gieske, Edmund Joseph. "Critical Words Cache Memory." University of Cincinnati / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1208368190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Sorenson, Elizabeth S. "Cache characterization and performance studies using locality surfaces /." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd950.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Kim, Donglok. "Extended data cache prefetching using a reference prediction table /." Thesis, Connect to this title online; UW restricted, 1997. http://hdl.handle.net/1773/6127.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Bani, Ruchi Rastogi Mohanty Saraju. "A new N-way reconfigurable data cache architecture for embedded systems." [Denton, Tex.] : University of North Texas, 2009. http://digital.library.unt.edu/ark:/67531/metadc12079.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Bani, Ruchi Rastogi. "A New N-way Reconfigurable Data Cache Architecture for Embedded Systems." Thesis, University of North Texas, 2009. https://digital.library.unt.edu/ark:/67531/metadc12079/.

Full text

Abstract:

Performance and power consumption are most important issues while designing embedded systems. Several studies have shown that cache memory consumes about 50% of the total power in these systems. Thus, the architecture of the cache governs both performance and power usage of embedded systems. A new N-way reconfigurable data cache is proposed especially for embedded systems. This thesis explores the issues and design considerations involved in designing a reconfigurable cache. The proposed reconfigurable data cache architecture can be configured as direct-mapped, two-way, or four-way set associative using a mode selector. The module has been designed and simulated in Xilinx ISE 9.1i and ModelSim SE 6.3e using the Verilog hardware description language.

APA, Harvard, Vancouver, ISO, and other styles

6

Beg, Azam Muhammad. "Improving instruction fetch rate with code pattern cache for superscalar architecture." Diss., Mississippi State : Mississippi State University, 2005. http://library.msstate.edu/etd/show.asp?etd=etd-06202005-103032.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

SOHONI, SOHUM. "IMPROVING L2 CACHE PERFORMANCE THROUGH STREAM-DIRECTED OPTIMIZATIONS." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Janapsatya, Andhi Computer Science &amp Engineering Faculty of Engineering UNSW. "Optimization of instruction memory for embedded systems." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2005. http://handle.unsw.edu.au/1959.4/24210.

Full text

Abstract:

This thesis presents methodologies for improving system performance and energy consumption by optimizing the memory hierarchy performance. The processor-memory performance gap is a well-known problem that is predicted to get worse, as the performance gap between processor and memory is widening. The author describes a method to estimate the best L1 cache configuration for a given application. In addition, three methods are presented to improve the performance and reduce energy in embedded systems by optimizing the instruction memory. Performance estimation is an important procedure to assess the performance of the system and to assess the effectiveness of any applied optimizations. A cache memory performance estimation methodology is presented in this thesis. The methodology is designed to quickly and accurately estimate the performance of multiple cache memory configurations. Experimental results showed that the methodology is on average 45 times faster compared to a widely used tool (Dinero IV). The first optimization method is a software-only method, called code placement, was implemented to improve the performance of instruction cache memory. The method involves careful placement of code within memory to ensure high cache hit rate when code is brought into the cache memory. Code placement methodology aims to improve cache hit rates to improve cache memory performance. Experimental results show that by applying the code placement method, a reduction in cache miss rate by up to 71%, and energy consumption reduction of up to 63% are observed when compared to application without code placement. The second method involves a novel architecture for utilizing scratchpad memory. The scratchpad memory is designed as a replacement of the instruction cache memory. Hardware modification was designed to allow data to be written into the scratchpad memory during program execution, allowing dynamic control of the scratchpad memory content. Scratchpad memory has a faster memory access time and a lower energy consumption per access compared to cache memory; the usage of scratchpad memory aims to improve performance and lower energy consumption of systems compared to system with cache memory. Experimental results show an average energy reduction of 26.59% and an average performance improvement of 25.63% when compared to a system with cache memory. The third is an application profiling method using statistical information to identify application???s hot-spots. Application profiling is important for identifying section in the application where performance degradation might occur and/or where maximum performance gain can be obtained through optimization. The method was applied and tested on the scratchpad based system described in this thesis. Experimental results show the effectiveness of the analysis method in reducing energy and improving performance when compared to previous method for utilizing the scratchpad memory based system (average performance improvement of 23.6% and average energy reduction of 27.1% are observed).

APA, Harvard, Vancouver, ISO, and other styles

9

Zhang, Xiushan. "L2 cache replacement based on inter-access time per access count prediction." Diss., Online access via UMI:, 2009.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

10

Elver, Marco Iskender. "Memory consistency directed cache coherence protocols for scalable multiprocessors." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/22073.

Full text

Abstract:

The memory consistency model, which formally specifies the behavior of the memory system, is used by programmers to reason about parallel programs. From a hardware design perspective, weaker consistency models permit various optimizations in a multiprocessor system: this thesis focuses on designing and optimizing the cache coherence protocol for a given target memory consistency model. Traditional directory coherence protocols are designed to be compatible with the strictest memory consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that provide more relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of scalability, in terms of per-core storage due to sharer tracking, which poses a problem with increasing number of cores in today’s CMPs, most of which no longer are sequentially consistent. The recent convergence towards programming language based relaxed memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. As a result, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86) cannot readily make use of lazy coherence protocols without either: adapting the protocol to satisfy the stricter consistency model; or changing the architecture’s consistency model to (a variant of) RC, typically at the expense of backward compatibility. The first part of this thesis explores both these options, with a focus on a practical approach satisfying backward compatibility. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers and instead relies on self-invalidation and detection of potential acquires (in the absence of explicit synchronization) using per cache line timestamps to efficiently and lazily satisfy the TSO memory consistency model. Our results show that TSO-CC achieves, on average, performance comparable to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales logarithmically with increasing core count. Next, we propose an approach for the x86-64 architecture, which is a compromise between retaining the original consistency model and using a more storage efficient lazy coherence protocol. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backward compatibility with legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC), a scalable cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3’s storage overhead per cache line scales logarithmically with increasing core count and reduces on-chip coherence storage overheads by 45% compared to TSO-CC. Finally, it is imperative that hardware adheres to the promised memory consistency model. Indeed, consistency directed coherence protocols cannot use conventional coherence definitions (e.g. SWMR) to be verified against, and few existing verification methodologies apply. Furthermore, as the full consistency model is used as a specification, their interaction with other components (e.g. pipeline) of a system must not be neglected in the verification process. Therefore, verifying a system with such protocols in the context of interacting components is even more important than before. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the consistency model. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when used for simulation-based memory consistency verification would be too slow. We propose McVerSi, a test generation framework for fast memory consistency verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to memory consistency test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering memory consistency bugs. To guide tests towards exercising as much logic as possible, the simulator’s reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI protocol due to the faulty interaction of the pipeline and the cache coherence protocol, highlighting that even conventional protocols should be verified rigorously in the context of a full-system. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests).

APA, Harvard, Vancouver, ISO, and other styles

11

Nayyar, Raman. "Performance Analysis of a Hierarchical, Cache-Coherent, Shared Memory Based, Multi-processor System." PDXScholar, 1993. https://pdxscholar.library.pdx.edu/open_access_etds/4695.

Full text

Abstract:

We have conducted a performance analysis of a large scale multiprocessor system based on shared buses organized in a hierarchical fashion and employing an easy to implement snoopy cache protocol. · This arrangement, named TREEBUS [ 5], presents a logical extension path for multiprocessor systems based on a single shared bus whose scalability is limited by the available system bus bandwidth [26]. The multiple, independent, hierarchical buses overcome the bus bandwidth limitation and the architecture can scale to relatively large sizes. We have developed an easy to use, reasonably accurate and computationally efficient analytic model for analyzing the performance of the memory hierarchy. Our analysis presents a balanced view by incorporating cost and size of the memory subsystem, two parameters which can significantly impact the feasibility of this architecture. The results indicate that the TREEBUS can deliver high performance for a maximum of about 512 processors using available technology. For larger sizes, the problem is not the limited system bus bandwidth but the unmanageable size of the main memory and a deteriorating cost/performance ratio.

APA, Harvard, Vancouver, ISO, and other styles

12

Almeida, Henrique Dante de 1982. "Implementação de cache no projeto ArchC." [s.n.], 2012. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275701.

Full text

Abstract:

Orientadores: Paulo Cesar Centoducatte, Rodolfo Jardim de Azevedo
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-20T15:21:59Z (GMT). No. of bitstreams: 1 Almeida_HenriqueDantede_M.pdf: 506967 bytes, checksum: ca41d5af5008feeb442f3b9d9322af51 (MD5) Previous issue date: 2012
Resumo: O projeto ArchC visa criar uma linguagem de descrição de arquiteturas, com o objetivo de se construir simuladores e toolchains de arquiteturas computacionais completas. O objetivo deste trabalho é dotar ArchC com capacidade para gerar simuladores de caches. Para tanto foi realizado um estudo detalhado das caches (tipos, organizações, configurações etc) e do funcionamento e do código do ArchC. O resultado foi a descrição de uma coleção de caches parametrizáveis que podem ser adicionadas 'as arquiteturas descritas em ArchC. A implementação das caches é modular, possuindo código isolado para a memória de armazenamento da cache e políticas de operação. A corretude da cache foi verificada utilizando uma sequ¿encia de simulações de diversas configurações de cache e com comparações com o simulador dinero. A cache resultante apresentou um overhead, no tempo de simulaçao, que varia entre 10% e 60%, quando comparada a um simulador sem cache
Abstract: The ArchC project aims to create an architecture description language, with the goal of building complete computer architecture simulators and toolchains. The goal of this project is to add support in ArchC for simulating caches. To achieve this, a detailed study about caches (types, organization, configuration etc) and about the ArchC code was done. The result was a collection of parameterized caches that may be included on the architectures described with ArchC. The cache implementation is modular, having isolated code for the storage and operation policies. Implementation correctness was verified using a set of many cache configurations and with comparisons with the results from dinero simulator. The resulting cache showed an overhead varying between 10% and 60%, when compared to a simulator without caches
Mestrado
Ciência da Computação
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

13

Baek, Seungcheol. "High-performance memory system architectures using data compression." Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/51863.

Full text

Abstract:

The Chip Multi-Processor (CMP) paradigm has cemented itself as the archetypal philosophy of future microprocessor design. Rapidly diminishing technology feature sizes have enabled the integration of ever-increasing numbers of processing cores on a single chip die. This abundance of processing power has magnified the venerable processor-memory performance gap, which is known as the ”memory wall”. To bridge this performance gap, a high-performing memory structure is needed. An attractive solution to overcoming this processor-memory performance gap is using compression in the memory hierarchy. In this thesis, to use compression techniques more efficiently, compressed cacheline size information is studied, and size-aware cache management techniques and hot-cacheline prediction for dynamic early decompression technique are proposed. Also, the proposed works in this thesis attempt to mitigate the limitations of phase change memory (PCM) such as low write performance and limited long-term endurance. One promising solution is the deployment of hybridized memory architectures that fuse dynamic random access memory (DRAM) and PCM, to combine the best attributes of each technology by using the DRAM as an off-chip cache. A dual-phase compression technique is proposed for high-performing DRAM/PCM hybrid environments and a multi-faceted wear-leveling technique is proposed for the long-term endurance of compressed PCM. This thesis also includes a new compression-based hybrid multi-level cell (MLC)/single-level cell (SLC) PCM management technique that aims to combine the performance edge of SLCs with the higher capacity of MLCs in a hybrid environment.

APA, Harvard, Vancouver, ISO, and other styles

14

Lodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.

Full text

Abstract:

La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos. En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible: -acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios -mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio -implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado.
Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
TESIS

APA, Harvard, Vancouver, ISO, and other styles

15

Kiriwas, Anton. "Directory-based Cache Coherence in SMTp Machines without Memory Overhead using Sparse Directories." Honors in the Major Thesis, University of Central Florida, 2004. http://digital.library.ucf.edu/cdm/ref/collection/ETH/id/714.

Full text

Abstract:

This item is only available in print in the UCF Libraries. If this is your Honors Thesis, you can help us make it available online for use by researchers around the world by following the instructions on the distribution consent form at http://library.ucf
Bachelors
Engineering and Computer Science
Computer Science

APA, Harvard, Vancouver, ISO, and other styles

16

Ghosh, Mrinmoy. "Microarchitectural techniques to reduce energy consumption in the memory hierarchy." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/28265.

Full text

Abstract:

Thesis (M. S.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhakar.

APA, Harvard, Vancouver, ISO, and other styles

17

Ballapuram, Chinnakrishnan S. "Semantics-oriented low power architecture." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/22567.

Full text

Abstract:

Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2008.
Committee Chair: Hsien-Hsin Sean Lee; Committee Member: Abhijit Chatterjee; Committee Member: Bernard Kippelen; Committee Member: Gabriel H. Loh; Committee Member: SungKyu Lim.

APA, Harvard, Vancouver, ISO, and other styles

18

Choudhary, Dhruv. "Micro-scheduling and its interaction with cache partitioning." Thesis, Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/41167.

Full text

Abstract:

The thesis explores the sources of energy inefficiency in asymmetric multi- core architectures where energy efficiency is measured by the energy-delay squared product. The insights gathered from this study drive the development of optimized thread scheduling and coordinated cache management strategies in an important class of asymmetric shared memory architectures. The proposed techniques are founded on well known mathematical optimization techniques yet are lightweight enough to be implemented in practical systems.

APA, Harvard, Vancouver, ISO, and other styles

19

Goldstein, Felipe Portavales. "Um modelo de memória transacional para arquiteturas heterogêneas baseado em software Cache." [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275780.

Full text

Abstract:

Orientador: Rodolfo Jardim de Azevedo
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação Científica
Made available in DSpace on 2018-08-17T02:02:14Z (GMT). No. of bitstreams: 1 Goldstein_FelipePortavales_M.pdf: 2303926 bytes, checksum: c44512059a990654552904a0f94d74f2 (MD5) Previous issue date: 2010
Resumo: A adoção de processadores com múltiplos núcleos pela indústria, levou à necessidade de novas técnicas para facilitar a programação de software paralelo. A técnica chamada memórias transacionais é uma das mais promissoras. Esta técnica é capaz de executar tarefas concorrentemente de forma otimista, o que permite um bom desempenho. Outra vantagem é que a sua utilização é muito mais simples comparada com a técnica clássica de exclusão mútua. Neste trabalho é proposto o primeiro modelo de memória transacional para arquiteturas híbridas, neste caso a arquitetura alvo é o processador Cell BE. O processador Cell BE é especialmente complexo por causa das dificuldades que a arquitetura deste processador impõe ao programador quando se necessita acessar a memória global compartilhada. O modelo proposto age como uma camada entre o programa e a memória principal, permitindo um acesso transparente aos dados, garantindo coerência e realizando o controle de concorrência de forma automática. O modelo proposto utiliza Software Cache combinado com a memória transacional para facilitar o acesso à memória externa a partir dos SPEs. Ele foi implementado e testado utilizando 8 aplicativos benchmark diferentes, mostrando sua viabilidade para casos de uso reais. Foi feita uma análise detalhada de cada parte da arquitetura proposta com relação ao impacto no desempenho geral do sistema. Este modelo foi capaz de obter um desempenho até duas vezes superior à implementação utilizando um mutex global. As vantagens da utilização se concentram principalmente na facilidade de uso, garantias de coerência e por evitar alguns tipos de bugs que seriam comuns em uma implementação com mutex, como por exemplo dead-locks. Este trabalho obteve o prêmio de melhor artigo no SBAC-PAD 2008
Abstract: The adoption of multi-core processors by the industry has pushed towards the development of new techniques to simplify programming parallel software. The technique called transactional memories is one of the most promising. This technique is able to execute multiple tasks concurrently in an optimistic way to achieve a better performance. Another advantage is that the usage of this technique is simpler than the classic mutual exclusion. This work proposes the first transactional memory model for hybrid architectures, in this case the target architecture is the Cell BE processor. The Cell BE is specially complex because of the dificulties when acessing the main shared memory from one of the SPEs. The proposed model acts as a layer between the program running and the main shared memory, allowing transparent access to the data, guaranteeing coherency and automatic concurrency control. The proposed model uses a Software Cache combined with a transactional memory to facilitate the acess to the main memory from the SPEs. This model was implemented and tested using 8 benchmark applications, showing its feasability in real use cases. A detailed analysis of its internal parts has been made to show the impact of each part in the overal system performance. The model was able to achieve a performance up to two times better than a similar implementation using a global mutex. The advantages of this model rely on its usability, coherency guaranty and because it is able to avoid concurrency programming bugs such as dead-lock, which are common in a mutex implementation. This work won the best paper award at SBAC-PAD 2008
Mestrado
Arquitetura de Computadores
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

20

Guo, Ruirui. "High performance cache architectures for IP routing : replacement, compaction and sampling schemes." Online access for everyone, 2007. http://www.dissertations.wsu.edu/Dissertations/Summer2007/r_guo_072307.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Qin, Xiaohan. "On the use and performance of communication primitives in software controlled cache-coherent cluster architectures /." Thesis, Connect to this title online; UW restricted, 1997. http://hdl.handle.net/1773/6925.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Alves, Marco Antonio Zanata. "Avaliação do compartilhamento das memórias cache no desempenho de arquiteturas multi-core." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2009. http://hdl.handle.net/10183/16129.

Full text

Abstract:

No atual contexto de inovações em multi-core, em que as novas tecnologias de integração estão fornecendo um número crescente de transistores por chip, o estudo de técnicas de aumento de vazão de dados é de suma importância para os atuais e futuros processadores multi-core e many-core. Com a contínua demanda por desempenho computacional, as memórias cache vêm sendo largamente adotadas nos diversos tipos de projetos arquiteturais de computadores. Os atuais processadores disponíveis no mercado apontam na direção do uso de memórias cache L2 compartilhadas. No entanto, ainda não está claro quais os ganhos e custos inerentes desses modelos de compartilhamento da memória cache. Assim, nota-se a importância de estudos que abordem os diversos aspectos do compartilhamento de memória cache em processadores com múltiplos núcleos. Portanto, essa dissertação visa avaliar diferentes compartilhamentos de memória cache, modelando e aplicando cargas de trabalho sobre as diferentes organizações, a fim de obter resultados significativos sobre o desempenho e a influência do compartilhamento da memória cache em processadores multi-core. Para isso, foram avaliados diversos compartilhamentos de memória cache, utilizando técnicas tradicionais de aumento de desempenho, como aumento da associatividade, maior tamanho de linha, maior tamanho de memória cache e também aumento no número de níveis de memória cache, investigando a correlação entre essas arquiteturas de memória cache e os diversos tipos de aplicações da carga de trabalho. Os resultados mostram a importância da integração entre os projetos de arquitetura de memória cache e o projeto físico da memória, a fim de obter o melhor equilíbrio entre tempo de acesso à memória cache e redução de faltas de dados. Nota-se nos resultados, dentro do espaço de projeto avaliado, que devido às limitações físicas e de desempenho, as organizações 1Core/L2 e 2Cores/L2, com tamanho total igual a 32 MB (bancos de 2 MB compartilhados), tamanho de linha igual a 128 bytes, representam uma boa escolha de implementação física em sistemas de propósito geral, obtendo um bom desempenho em todas aplicações avaliadas sem grandes sobrecustos de ocupação de área e consumo de energia. Além disso, como conclusão desta dissertação, mostra-se que, para as atuais e futuras tecnologias de integração, as tradicionais técnicas de ganho de desempenho obtidas com modificações na memória cache, como aumento do tamanho das memórias, incremento da associatividade, maiores tamanhos da linha, etc. não devem apresentar ganhos reais de desempenho caso o acréscimo de latência gerado por essas técnicas não seja reduzido, a fim de equilibrar entre a redução na taxa de faltas de dados e o tempo de acesso aos dados.
In the current context of innovations in multi-core processors, where the new integration technologies are providing an increasing number of transistors inside chip, the study of techniques for increasing data throughput has great importance for the current and future multi-core and many-core processors. With the continuous demand for performance, the cache memories have been widely adopted in various types of architectural designs of computers. Nowadays, processors on the market point out for the use of shared L2 cache memory. However, it is not clear the gains and costs of these shared cache memory models. Thus, studies that address different aspects of shared cache memory have great importance in context of multi-core processors. Therefore, this dissertation aims to evaluate different shared cache memory, modeling and applying workloads on different organizations in order to obtain significant results from the performance and the influence of the shared cache memory multi-core processors. Thus, several types of shared cache memory were evaluated using traditional techniques to increase performance, such as increasing the associativity, larger line size, larger cache memory and also the increase on the cache memory hierarchy, investigating the correlation between the cache memory architecture and the workload applications. The results show the importance of integration between cache memory architecture project and memory physical design in order to obtain the best trade-off between cache memory access time and cache misses. According to the results, within evaluations, due to physical limitations and performance, organizations 1Core/L2 and 2Cores/L2 with total cache size equal to 32MB, using banks of 2 MB, line size equal to 128 bytes, represent a good choice for physical implementation in general purpose systems, obtaining a good performance in all evaluated applications without major extra costs of area occupation and power consumption. Furthermore, as a conclusion in this dissertation is shown that, for current and future integration technologies, traditional techniques for performance gain obtained with changes in the cache memory such as, increase of the memory size, increasing the associativity, larger line sizes etc.. should not lead to real performance gains if the additional latency generated by these techniques was not treated, in order to balance between the reduction of cache miss rate and the data access time.

APA, Harvard, Vancouver, ISO, and other styles

23

Zeng, Hui. "Managing datapath resources in an out-of-order processor for performance and energy efficiency." Diss., Online access via UMI:, 2009.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

24

Davari, Mahdad. "Advances Towards Data-Race-Free Cache Coherence Through Data Classification." Doctoral thesis, Uppsala universitet, Avdelningen för datorteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595.

Full text

Abstract:

Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times. In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases. Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption. In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics.

APA, Harvard, Vancouver, ISO, and other styles

25

Kong, Jingfei. "ARCHITECTURAL SUPPORT FOR IMPROVING COMPUTER SECURITY." Doctoral diss., University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2610.

Full text

Abstract:

Computer security and privacy are becoming extremely important nowadays. The task of protecting computer systems from malicious attacks and potential subsequent catastrophic losses is, however, challenged by the ever increasing complexity and size of modern hardware and software design. We propose several methods to improve computer security and privacy from architectural point of view. They provide strong protection as well as performance efficiency. In our first approach, we propose a new dynamic information flow method to protect systems from popular software attacks such as buffer overflow and format string attacks. In our second approach, we propose to deploy encryption schemes to protect the privacy of an emerging non-volatile main memory technology Â phase change memory (PCM). The negative impact of the encryption schemes on PCM lifetime is evaluated and new methods including a new encryption counter scheme and an efficient error correct code (ECC) management are proposed to improve PCM lifetime. In our third approach, we deconstruct two previously proposed secure cache designs against software data-cache-based side channel attacks and demonstrate their weaknesses. We propose three hardware-software integrated approaches as secure protections against those data cache attacks. Also we propose to apply them to protect instruction caches from similar threats. Furthermore, we propose a simple change to the update policy of Branch Target Buffer (BTB) to defend against BTB attacks. Our experiments show that our proposed schemes are both security effective and performance efficient.
Ph.D.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science PhD

APA, Harvard, Vancouver, ISO, and other styles

26

Rabbah, Rodric Michel. "Design Space Exploration and Optimization of Embedded Memory Systems." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/11605.

Full text

Abstract:

Recent years have witnessed the emergence of microprocessors that are embedded within a plethora of devices used in everyday life. Embedded architectures are customized through a meticulous and time consuming design process to satisfy stringent constraints with respect to performance, area, power, and cost. In embedded systems, the cost of the memory hierarchy limits its ability to play as central a role. This is due to stringent constraints that fundamentally limit the physical size and complexity of the memory system. Ultimately, application developers and system engineers are charged with the heavy burden of reducing the memory requirements of an application. This thesis offers the intriguing possibility that compilers can play a significant role in the automatic design space exploration and optimization of embedded memory systems. This insight is founded upon a new analytical model and novel compiler optimizations that are specifically designed to increase the synergy between the processor and the memory system. The analytical models serve to characterize intrinsic program properties, quantify the impact of compiler optimizations on the memory systems, and provide deep insight into the trade-offs that affect memory system design.

APA, Harvard, Vancouver, ISO, and other styles

27

Puche, Lara José. "Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/165254.

Full text

Abstract:

[ES] Los procesadores multinúcleo actuales cuentan con recursos compartidos entre los diferentes núcleos. Dos de estos recursos compartidos, la cache de último nivel y el ancho de banda de memoria principal, pueden convertirse en cuellos de botella para el rendimiento. Además, con el crecimiento del número de núcleos que implementan los diseños más recientes, la red dentro del chip también se convierte en un cuello de botella que puede afectar negativamente al rendimiento, ya que las redes tradicionales pueden encontrar limitaciones a su escalabilidad en el futuro cercano. Prácticamente la totalidad de los diseños actuales implementan jerarquías de memoria que se comunican mediante rápidas redes de interconexión. Esta organización es eficaz dado que permite reducir el número de accesos que se realizan a memoria principal y la latencia media de acceso a memoria. Las caches, la red de interconexión y la memoria principal, conjuntamente con otras técnicas conocidas como la prebúsqueda, permiten reducir las enormes latencias de acceso a memoria principal, limitando así el impacto negativo ocasionado por la diferencia de rendimiento existente entre los núcleos de cómputo y la memoria. Sin embargo, compartir los recursos mencionados es fuente de diferentes problemas y retos, siendo uno de los principales el manejo de la interferencia entre aplicaciones. Hacer un uso eficiente de la jerarquía de memoria y las caches, así como contar con una red de interconexión apropiada, es necesario para sostener el crecimiento del rendimiento en los diseños tanto actuales como futuros. Esta tesis analiza y estudia los principales problemas e inconvenientes observados en estos dos recursos: la cache de último nivel y la red dentro del chip. En primer lugar, se estudia la escalabilidad de las tradicionales redes dentro del chip con topología de malla, así como esta puede verse comprometida en próximos diseños que cuenten con mayor número de núcleos. Los resultados de este estudio muestran que, a mayor número de núcleos, el impacto negativo de la distancia entre núcleos en la latencia puede afectar seriamente al rendimiento del procesador. Como solución a este problema, en esta tesis proponemos una de red de interconexión óptica modelada en un entorno de simulación detallado, que supone una solución viable a los problemas de escalabilidad observados en los diseños tradicionales. A continuación, esta tesis dedica un esfuerzo importante a identificar y proponer soluciones a los principales problemas de diseño de las jerarquías de memoria actuales como son, por ejemplo, el sobredimensionado del espacio de cache privado, la existencia de réplicas de datos y rigidez e incapacidad de adaptación de las estructuras de cache. Aunque bien conocidos, estos problemas y sus efectos adversos en el rendimiento pueden ser evitados en procesadores de alto rendimiento gracias a la enorme capacidad de la cache de último nivel que este tipo de procesadores típicamente implementan. Sin embargo, en procesadores de bajo consumo, no existe la posibilidad de contar con tales capacidades y hacer un uso eficiente del espacio disponible es crítico para mantener el rendimiento. Como solución a estos problemas en procesadores de bajo consumo, proponemos una novedosa organización de jerarquía de dos niveles cache que utiliza una red de interconexión óptica. Los resultados obtenidos muestran que, comparado con diseños convencionales, el consumo de energía estática en la arquitectura propuesta es un 60% menor, pese a que los resultados de rendimiento presentan valores similares. Por último, hemos extendido la arquitectura propuesta para dar soporte tanto a aplicaciones paralelas como secuenciales. Los resultados obtenidos con la esta nueva arquitectura muestran un ahorro de hasta el 78 % de energía estática en la ejecución de aplicaciones paralelas.
[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.
[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.
Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254
TESIS

APA, Harvard, Vancouver, ISO, and other styles

28

Giordano, Omar. "Design and Implementation of an Architecture-aware In-memory Key- Value Store." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291213.

Full text

Abstract:

Key-Value Stores (KVSs) are a type of non-relational databases whose data is represented as a key-value pair and are often used to represent cache and session data storage. Among them, Memcached is one of the most popular ones, as it is widely used in various Internet services such as social networks and streaming platforms. Given the continuous and increasingly rapid growth of networked devices that use these services, the commodity hardware on which the databases are based must process packets faster to meet the needs of the market. However, in recent years, the performance improvements characterising the new hardware has become thinner and thinner. From here, as the purchase of new products is no longer synonymous with significant performance improvements, companies need to exploit the full potential of the hardware already in their possession, consequently postponing the purchase of more recent hardware. One of the latest ideas for increasing the performance of commodity hardware is the use of slice-aware memory management. This technique exploits the Last Level of Cache (LLC) by making sure that the individual cores take data from memory locations that are mapped to their respective cache portions (i.e., LLC slices). This thesis focuses on the realisation of a KVS prototype—based on Intel Haswell micro-architecture—built on top of the Data Plane Development Kit (DPDK), and to which the principles of slice-aware memory management are applied. To test its performance, given the non-existence of a DPDKbased traffic generator that supports the Memcached protocol, an additional prototype of a traffic generator that supports these features has also been developed. The performances were measured using two distinct machines: one for the traffic generator and one for the KVS. First, the “regular” KVS prototype was tested, then, to see the actual benefits, the slice-aware one. Both KVS prototypeswere subjected to two types of traffic: (i) uniformtraffic where the keys are always different from each other, and (ii) skewed traffic, where keys are repeated and some keys are more likely to be repeated than others. The experiments show that, in real-world scenario (i.e., characterised by skewed key distributions), the employment of a slice-aware memory management technique in a KVS can slightly improve the end-to-end latency (i.e.,~2%). Additionally, such technique highly impacts the look-up time required by the CPU to find the key and the corresponding value in the database, decreasing the mean time by ~22.5%, and improving the 99th percentile by ~62.7%.
Key-Value Stores (KVSs) är en typ av icke-relationsdatabaser vars data representeras som ett nyckel-värdepar och används ofta för att representera lagring av cache och session. Bland dem är Memcached en av de mest populära, eftersom den används ofta i olika internettjänster som sociala nätverk och strömmande plattformar. Med tanke på den kontinuerliga och allt snabbare tillväxten av nätverksenheter som använder dessa tjänster måste den råvaruhårdvara som databaserna bygger på bearbeta paket snabbare för att möta marknadens behov. Under de senaste åren har dock prestandaförbättringarna som kännetecknar den nya hårdvaran blivit tunnare och tunnare. Härifrån, eftersom inköp av nya produkter inte längre är synonymt med betydande prestandaförbättringar, måste företagen utnyttja den fulla potentialen för hårdvaran som redan finns i deras besittning, vilket skjuter upp köpet av nyare hårdvara. En av de senaste idéerna för att öka prestanda för råvaruhårdvara är användningen av skivmedveten minneshantering. Denna teknik utnyttjar den Sista Nivån av Cache (SNC) genom att se till att de enskilda kärnorna tar data från minnesplatser som är mappade till deras respektive cachepartier (dvs. SNCskivor). Denna avhandling fokuserar på förverkligandet av en KVS-prototyp— baserad på Intel Haswell mikroarkitektur—byggd ovanpå Data Plane Development Kit (DPDK), och på vilken principerna för skivmedveten minneshantering tillämpas. För att testa dess prestanda, med tanke på att det inte finns en DPDK-baserad trafikgenerator som stöder Memcachedprotokollet, har en ytterligare prototyp av en trafikgenerator som stöder dessa funktioner också utvecklats. Föreställningarna mättes med två olika maskiner: en för trafikgeneratorn och en för KVS. Först testades den “vanliga” KVSprototypen, för att se de faktiska fördelarna, den skivmedvetna. Båda KVSprototyperna utsattes för två typer av trafik: (i) enhetlig trafik där nycklarna alltid skiljer sig från varandra och (ii) sned trafik, där nycklar upprepas och vissa nycklar är mer benägna att upprepas än andra. Experimenten visar att i verkliga scenarier (dvs. kännetecknas av snedställda nyckelfördelningar) kan användningen av en skivmedveten minneshanteringsteknik i en KVS förbättra förbättringen från slut till slut (dvs. ~2%). Dessutom påverkar sådan teknik i hög grad uppslagstiden som krävs av CPU: n för att hitta nyckeln och motsvarande värde i databasen, vilket minskar medeltiden med ~22, 5% och förbättrar 99th percentilen med ~62, 7%.

APA, Harvard, Vancouver, ISO, and other styles

29

Zeffer, Håkan. "Towards Low-Complexity Scalable Shared-Memory Architectures." Doctoral thesis, Uppsala University, Department of Information Technology, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-7135.

Full text

Abstract:

Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility.

This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support.

The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs.

Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs.

We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.

APA, Harvard, Vancouver, ISO, and other styles

30

Tang, Ho-Shing. "POPCA : optimizing segment caching for peer-to-peer on-demand streaming /." View abstract or full-text, 2008. http://library.ust.hk/cgi/db/thesis.pl?CSED%202008%20TANG.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Eberhart, Paul S. "A Compiler Target Model for Line Associative Registers." UKnowledge, 2019. https://uknowledge.uky.edu/ece_etds/138.

Full text

Abstract:

LARs (Line Associative Registers) are very wide tagged registers, used for both register-wide SWAR (SIMD Within a Register )operations and scalar operations on arbitrary fields. LARs include a large data field, type tags, source addresses, and a dirty bit, which allow them to not only replace both caches and registers in the conventional memory hierarchy, but improve on both their functions. This thesis details a LAR-based architecture, and describes the design of a compiler which can generate code for a LAR-based design. In particular, type conversion, alignment, and register allocation are discussed in detail.

APA, Harvard, Vancouver, ISO, and other styles

32

Radovic, Zoran. "Software Techniques for Distributed Shared Memory." Doctoral thesis, Uppsala University, Department of Information Technology, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-6058.

Full text

Abstract:

In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: nonuniform communication architecture, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations.

The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics.

This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence.

APA, Harvard, Vancouver, ISO, and other styles

33

Chen, Weisong. "A pervasive information framework based on semantic routing and cooperative caching." Click to view the E-thesis via HKUTO, 2004. http://sunzi.lib.hku.hk/hkuto/record/B31065326.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Chen, Weisong, and 陳偉松. "A pervasive information framework based on semantic routing and cooperative caching." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31065326.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Batistella, Rafael Fernandes. "PBIW : um esquema de codificação baseado em padrões de instrução." [s.n.], 2008. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276093.

Full text

Abstract:

Orientador: Rodolfo Jardim de Azevedo
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-11T00:49:37Z (GMT). No. of bitstreams: 1 Batistella_RafaelFernandes_M.pdf: 3411156 bytes, checksum: 7e6b46824189243405a180e949db65c6 (MD5) Previous issue date: 2008
Resumo: Trabalhos não muito recentes já mostravam que o aumento de velocidade nas memórias DRAM não acompanha o aumento de velocidade dos processadores. Mesmo assim, pesquisadores na área de arquitetura de computadores continuam buscando novas abordagens para aumentar o desempenho dos processadores. Dentro do objetivo de minimizar essa diferença de velocidade entre memória e processador, este trabalho apresenta um novo esquema de codificação baseado em instruções codificadas e padrões de instruções ¿ PBIW (Pattern Based Instruction Word). Uma instrução codificada não contém redundância de dados e é armazenada em uma I-cache. Os padrões de instrução, de forma diferente, são armazenados em uma nova cache, chamada Pattern cache (P-cache) e são utilizados pelo circuito decodificador na preparação da instrução que será repassada aos estágios de execução. Esta técnica se mostrou uma boa alternativa para estilos arquiteturais conhecidos como arquiteturas VLIW e EPIC. Foi realizado um estudo de caso da técnica PBIW sobre uma arquitetura de alto desempenho chamada de 2D-VLIW. O desempenho da técnica de codificação foi avaliado através de experimentos com programas dos benchmarks MediaBench, SPECint e SPECfp. Os experimentos estáticos avaliaram a eficiência da codificação PBIW no aspecto de redução de código. Nestes experimentos foram alcançadas reduções no tamanho dos programas de até 81% sobre programas codificados com a estratégia de codifica¸c¿ao 2D-VLIW e reduções de até 46% quando comparados á programas utilizando o modelo de codificação EPIC. Experimentos dinâmicos mostraram que a codificação PBIW também é capaz que gerar ganhos com relação ao tempo de execução dos programas. Quando comparada à codificação 2D-VLIW, o speedup alcançado foi de at'e 96% e quando comparada à EPIC, foi de até 69%
Abstract: Past works has shown that the increase of DRAM memory speed is not the same of processor speed. Even though, computer architecture researchers keep searching for new approaches to enhance the processor performance. In order to minimize this difference between the processor and memory speed, this work presents a new encoding technique based on encoded instructions and instruction patterns - PBIW (Pattern Based Instruction Word). An encoded instruction contains no redundancy of data and it is stored into an I-cache. The instruction patterns, on the other hand, are stored into a new cache, named Pattern cache (P-cache) and are used by the decoder circuit to build the instruction to be executed in the execution stages. This technique has shown a suitable alternative to well-known architectural styles such as VLIW and EPIC architectures. A case study of this technique was carried out in a high performance architecture called 2D-VLIW. The performance of the encoding technique has been evaluated through trace-driven experiments with MediaBench, SPECint and SPECfp programs. The static experiments have evaluated the PBIW code reduction efficiency. In these experiments, PBIW encoding has achieved up to 81% code reduction over 2D-VLIW encoded programs and up to 46% code reduction over EPIC encoded programs. Dynamic experiments have shown that PBIW encoding can also improve the processor performance. When compared to 2D-VLIW encoding, the speedup was up to 96% while compared to EPIC, the speedup was up to 69%
Mestrado
Arquitetura de Computadores
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

36

Zlatohlávková, Lucie. "Návrh a implementace prostředků pro zvýšení výkonu procesoru." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2007. http://www.nusl.cz/ntk/nusl-412764.

Full text

Abstract:

This masters thesis is focused on the issue of processor architecture. The ground of this project is a design of a simple processor, which is enriched by modern components in processor architecture such as pipelining, cache memory and branch prediction. The processor has been made in VHDL programming language and was simulated in ModelSim simulation tool.

APA, Harvard, Vancouver, ISO, and other styles

37

Cruz, Eduardo Henrique Molina da. "Dynamic detection of the communication pattern in shared memory environments for thread mapping." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2012. http://hdl.handle.net/10183/49758.

Full text

Abstract:

As threads de aplicações paralelas cooperam a fim de cumprir suas tarefas, dessa forma, comunicação é realizada entre elas. A latência de comunicação entre os núcleos em arquiteturas multiprocessadas diferem dependendo da hierarquia de memória e das interconexões. Com o aumento do número de núcleos por chip e número de threads por núcleo, esta diferença entre as latências de comunicação está aumentando. Portanto, é importante mapear as threads de aplicações paralelas levando em conta a comunicação entre elas. Em aplicações paralelas baseadas no paradigma de memória compartilhada, a comunicação é implícita e ocorre através de acessos à variáveis compartilhadas, o que torna difícil a descoberta do padrão de comunicação entre as threads. Mecanismos tradicionais usam simulação para monitorar os acessos à memória realizados pela aplicação, requerendo modificações no código fonte e aumentando drasticamente a sobrecarga. Nesta dissertação de mestrado, são introduzidos dois mecanismos inovadores com uma baixa sobrecarga para se detectar o padrão de comunicação entre threads. O primeiro mecanismo faz uso de informações sobre linhas compartilhadas de caches providas por protocolos de coerência de cache. O segundo mecanismo utiliza a Translation Lookaside Buffer (TLB) para detectar quais páginas de memória cada núcleo está acessando. Ambos os mecanismos dependem totalmente do hardware, o que torna o mapeamento de threads transparente aos programadores e permite que ele seja realizado dinamicamente pelo sistema operacional. Além disto, nenhuma tarefa de alta sobrecarga, como simulação, é requerida. As propostas foram avaliadas com o NAS Parallel Benchmarks (NPB), obtendo representações precisas dos padrões de comunicação. Mapeamentos para as threads foram gerados utilizando os padrões de comunicação descobertos e um algoritmo de mapeamento. O problema do mapeamento é NP-Difícil. Portanto, de forma a se atingir uma complexidade polinomial, o algoritmo empregado é heurístico, baseado no algoritmo de emparelhamento de grafos de Edmonds. Executando as aplicações com o mapeamento resultou em um ganho de desempenho de até 15; 3%. O número de faltas na cache, invalidações em linhas de cache e transações de espionagem foram reduzidos em até 31; 9%, 41% e 65; 4%, respectivamente.
The threads of parallel applications cooperate in order to fulfill their tasks, thereby communication is performed among themselves. The communication latency between the cores in a multiprocessor architecture differs depending on the memory hierarchy and the interconnections. With the increase in the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables, which makes difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this master thesis, we introduce two novel light-weight mechanisms to find the communication pattern of threads. The first mechanism makes use of the information about shared cache lines provided by cache coherence protocols. The second mechanism makes use of the Translation Lookaside Buffer (TLB) to detect which memory pages each core is accessing. Both our mechanisms rely entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanisms with the NAS Parallel Benchmarks (NPB) and obtained accurate representations of the communication patterns. We generated thread mappings from the detected communication patterns using a mapping algorithm. Mapping is a NP-Hard problem. Therefore, in order to achieve a polynomial complexity, we designed a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3% compared to the original scheduler of the operating system. The number of cache misses, cache line invalidations and snoop transactions were reduced by up to 31.9%, 41% and 65.4%, respectively.

APA, Harvard, Vancouver, ISO, and other styles

38

Pan, Xiang. "Designing Future Low-Power and Secure Processors with Non-Volatile Memory." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1492631536670669.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Nordén, Markus. "Parallel PDE solvers on cc-NUMA systems /." Uppsala : Univ. : Dept. of Information Technology, Univ, 2004.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

40

Tiosso, Fernando. "Utilização de objetos de aprendizagem para melhoria da qualidade do ensino de hierarquia de memória." Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-13082015-142049/.

Full text

Abstract:

O ensino e a aprendizagem do tema hierarquia de memória não são tarefas simples, pois muitos assuntos que são abordados em teoria podem desmotivar a aprendizagem em virtude de sua complexidade. Este projeto de mestrado apresenta a transformação do módulo de memória cache da ferramenta Amnesia em um objeto de aprendizagem, que visa facilitar a construção do conhecimento através da simulação da estrutura e da funcionalidade da hierarquia de memória na arquitetura von Neumann de uma maneira mais prática e didática. Este processo permitiu que funcionalidades existentes na ferramenta fossem readequadas e novas funcionalidades desenvolvidas. Aliado a isso, planos de aula e questionários de avaliação e usabilidade também foram concebidos, validados e implementados junto à elaboração de um tutorial para descrever o funcionamento do novo objeto. Os estudos experimentais realizados analisaram dois aspectos: o primeiro, se o objeto de aprendizagem melhorou, de fato, a aprendizagem dos alunos no assunto memória cache; o segundo, a opinião dos alunos em relação à utilização do objeto. Após a análise e avaliação dos resultados obtidos nos experimentos, foi possível demonstrar uma evolução na aprendizagem quando se fez o uso do objeto, além de se perceber que a motivação dos alunos em utilizar outros objetos de aprendizagem aumentou.
The teaching and learning of memory hierarchy are not simple tasks, because many subjects that are covered in theory may demotivate learning because of its complexity. This Master\'s thesis presents the process of transformation of the cache memory module of Amnesia tool in a learning object, aiming to facilitate the construction of knowledge by simulating the structure and functionality of memory hierarchy of von Neumann architecture in a more practice and didactic way. This process allowed existing features in the tool to be adequate and new features developed. In addition, lesson plans and questionnaires of assessment and usability have also been designed, validated and implemented and a tutorial to describe the operation of the new object was developed. Experimental studies have examined two aspects: the first, if the learning object improved, in fact, the students\' learning in the subject cache memory; the second, students\' opinions regarding the use of the object. After the analysis and evaluation of the results obtained in the experiments, was possible show an evolution in learning when it made the use of the object, and also to perceive that students\' motivation to use other learning objects increased.

APA, Harvard, Vancouver, ISO, and other styles

41

Sandberg, Andreas. "Understanding Multicore Performance : Efficient Memory System Modeling and Simulation." Doctoral thesis, Uppsala universitet, Avdelningen för datorteknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-220652.

Full text

Abstract:

To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results. In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.
CoDeR-MP
UPMARC

APA, Harvard, Vancouver, ISO, and other styles

42

Lesage, Benjamin. "Architecture multi-coeurs et temps d'exécution au pire cas." Phd thesis, Université Rennes 1, 2013. http://tel.archives-ouvertes.fr/tel-00870971.

Full text

Abstract:

Les tâches critiques en systèmes temps-réel sont soumises à des contraintes temporelles et de correction. La validation d'un tel système repose sur l'estimation du comportement temporel au pire cas de ses tâches. Le partage de ressources, inhérent aux architectures multi-cœurs, entrave le calcul de ces estimations. Le comportement temporel d'une tâche dépend de ses rivales du fait de l'arbitrage de l'accès aux ressources ou de modifications concurrentes de leur état. Cette étude vise à l'estimation de la contribution temporelle de la hiérarchie mémoire au pire temps d'exécution de tâches critiques. Les méthodes existantes, pour caches d'instructions, sont étendues afin de supporter caches de données privés et partagés, et permettre l'analyse de hiérarchies mémoires riches. Le court-circuitage de cache est ensuite utilisé pour réduire la pression sur les caches partagés. Nous proposons à cette fin différentes heuristiques basées sur la capture de la réutilisation de blocs de cache entre différents accès mémoire. Notre seconde proposition est la politique de partitionnement Preti qui permet l'allocation d'un espace sans conflits à une tâche. Preti favorise aussi les performances de tâches non critiques concurrentes aux temps-réel dans les systèmes de criticité hybride.

APA, Harvard, Vancouver, ISO, and other styles

43

Guo, Lei. "Insights into access patterns of Internet media systems measurements, analysis, and system design /." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1198696679.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Brewer, Jeffery Ramon. "Reconfigurable Cache Memory." OpenSIUC, 2009. https://opensiuc.lib.siu.edu/theses/48.

Full text

Abstract:

AN ABSTRACT OF THE THESIS OF Jeffery R. Brewer, for the Master degree in Electrical Computer Engineer, presented on May 22, 2009 at Southern Illinois University Carbondale. TITLE: Reconfigurable Cache Memory MAJOR PROFESSOR: Dr. Nazeih Botros As chip designers continue to push the performance of microprocessors to higher levels the energy demand grows. The increase need for integrated chips that provide energy savings without degrading performance is paramount. The cache memory is typically over fifty percent of the size of today's microprocessor chip, and consumes a significant percentage of the total power. Therefore, by designing a reconfigurable cache that's able to dynamically adjust to a smaller cache size without encountering a significant degrade in performance, we are able to realize power conservation. Tournament caching is a reconfigurable method that tracks the current performance of the cache and compares it to possible smaller or larger cache size [1] . The results in this thesis shows that reconfigurable cache memory implemented with a configuration mechanism like Tournament caching would take advantage of associativity and cache size while providing energy conservation. i

APA, Harvard, Vancouver, ISO, and other styles

45

Van, Vleet Taylor. "Dynamic cache-line sizes /." Thesis, Connect to this title online; UW restricted, 2000. http://hdl.handle.net/1773/6899.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Sehat, Kamiar. "Evaluation of caches and cache coherency." Thesis, University of Cambridge, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.335240.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

He, Bingsheng. "Cache-oblivious query processing /." View abstract or full-text, 2008. http://library.ust.hk/cgi/db/thesis.pl?CSED%202008%20HE.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Romer, Theodore H. "Using virtual memory to improve cache and TLB performance /." Thesis, Connect to this title online; UW restricted, 1998. http://hdl.handle.net/1773/6913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Pattabiraman, Aishwariya. "Heterogeneous Cache Architecture in Network-on-Chips." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1321371508.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Ammari, Rami J. "A study for reducing conflict misses in data cache." Master's thesis, Mississippi State : Mississippi State University, 2004. http://library.msstate.edu/etd/show.asp?etd=etd-04032004-211908.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Computer architecture. Cache memory'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles