Dissertations / Theses: 'Cache memory – Design'

1

Gieske, Edmund Joseph. "Critical Words Cache Memory." University of Cincinnati / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1208368190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Jakšić, Zoran. "Cache memory design in the FinFET era." Doctoral thesis, Universitat Politècnica de Catalunya, 2015. http://hdl.handle.net/10803/316394.

Full text

Abstract:

The major problem in the future technology scaling is the variations in process parameters that are interpreted as imperfections in the development process. Moreover, devices are more sensitive to the environmental changes of temperature and supply volt- age as well as to ageing. All these influences are manifested in the integrated circuits as increased power consumption, reduced maximal operating frequency and increased number of failures. These effects have been partially overcome with the introduction of the FinFET technology which have solved the problem of variability caused by Random Dopant Fluctuations. However, in the next ten years channel length is projected to shrink to 10nm where the variability source generated by Line Edge Roughness will dominate, and its effects on the threshold voltage variations will become critical. The embedded memories with their cells as the basic building unit are the most prone to these effects due to their the smallest dimensions. Because of that, memories should be designed with particular care in order to make possible further technology scaling. This thesis explores upcoming 10nm FinFETs and the existing issues in the cache memory design with this technology. More- over, it tries to present some original and novel techniques on the different level of design abstraction for mitigating the effects of process and environmental variability. At first original method for simulating variability of Tri-Gate Fin- FETs is presented using conventional HSPICE simulation environment and BSIM-CMG model cards. When that is accomplished, thorough characterisation of traditional SRAM cell circuits (6T and 8T) is performed. Possibility of using Independent Gate FinFETs for increasing cell stability has been explored, also. Gain Cells appeared in the recent past as an attractive alternative for in the cache memory design. This thesis partially explores this idea by presenting and performing detailed circuit analysis of the dynamic 3T gain cell for 10nm FinFETs. At the top of this work, thesis shows one micro-architecture optimisation of high-speed cache when it is implemented by 3T gain cells. We show how the cache coherency states can be used in order to reduce refresh energy of the memory as well as reduce memory ageing.
El principal problema de l'escalat la tecnologia són les variacions en els paràmetres de disseny (imperfeccions) durant procés de fabricació. D'altra banda, els dispositius també són més sensibles als canvis ambientals de temperatura, la tensió d'alimentació, així com l'envelliment. Totes aquestes influències es manifesten en els circuits integrats com l'augment de consum d'energia, la reducció de la freqüència d'operació màxima i l'augment del nombre de xips descartats. Aquests efectes s'han superat parcialment amb la introducció de la tecnologia FinFET que ha resolt el problema de la variabilitat causada per les fluctuacions de dopants aleatòries. No obstant això, en els propers deu anys, l'ample del canal es preveu que es reduirà a 10nm, on la font de la variabilitat generada per les rugositats de les línies de material dominarà, i els seu efecte en les variacions de voltatge llindar augmentarà. Les memòries encastades amb les seves cel·les com la unitat bàsica de construcció són les més propenses a sofrir aquests efectes a causa de les seves dimensions més petites. A causa d'això, cal dissenyar les memòries amb una especial cura per tal de fer possible l'escalat de la tecnologia. Aquesta tesi explora la tecnologia de FinFETs de 10nm i els problemes existents en el disseny de memòries amb aquesta tecnologia. A més a més, presentem noves tècniques originals sobre diferents nivells d'abstracció del disseny per a la mitigació dels efectes les variacions tan de procés com ambientals. En primer lloc, presentem un mètode original per a la simulació de la variabilitat de Tri-Gate FinFETs usant entorn de simulació HSPICE convencional i models de tecnologia BSIMCMG. Després, es realitza la caracterització completa dels circuits de cel·les SRAM tradicionals (6T i 8T) conjuntament amb l'ús de Gate-independent FinFETs per augmentar l'estabilitat de la cèl·lula.

APA, Harvard, Vancouver, ISO, and other styles

3

Pendyala, Ragini. "Cache memory design with embedded LRU replacement policy /." Available to subscribers only, 2006. http://proquest.umi.com/pqdweb?did=1240704191&sid=10&Fmt=2&clientId=1509&RQT=309&VName=PQD.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Rasquinha, Mitchelle. "An energy efficient cache design using spin torque transfer (STT) RAM." Thesis, Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42715.

Full text

Abstract:

The advent of many core architectures has coincided with the energy and power limited design of modern processors. Projections for main memory clearly show widening of the processor-memory gap. Cache capacity increased to help reduce this gap will lead to increased energy and area usage and due to small growth in die size, impede performance scaling that has accompanied Moore's Law to date. Among the dominant sources of energy consumption is the on-chip memory hierar- chy, specically the L2 cache and the Last Level Cache (LLC). This work explores the use of a novel non-volatile memory technology - Spin Torque Transfer RAM (STT RAM)" for the design of the L2/LLC caches. While STTRAM is a promising memory technology, it has some limitations, particularly in terms of write energy and write latencies. The main objectives of this thesis is to use a novel cell design for a non-volatile 1T1MTJ cell and demonstrate its use at the L2 and LLC cache levels with architectural optimizations to maximize energy reduction. The proposed cache hierarchy dissipates significantly lesser energy (both leakage and dynamic) and uses less area in comparison to a conventional SRAM based cache designs.

APA, Harvard, Vancouver, ISO, and other styles

5

Karlsson, Martin. "Cache memory design trade-offs for current and emerging workloads." Licentiate thesis, Uppsala universitet, Avdelningen för datorteknik, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-86156.

Full text

Abstract:

The memory system is the key to performance in contemporary computer systems. When designing a new memory system, architectural decisions are often arbitrated based on their expected performance effect. It is therefore very important to make performance estimates based on workloads that accurately reflect the future use of the system. This thesis presents the first memory system characterization study of Java-based middleware, which is an emerging workload likely to be an important design consideration for next generation processors and servers. Manufacturing technology has reached a point where it is now possible to fit multiple full-scale processors and integrate board-level features on a chip. The raised competition for chip resources has increased the need to design more effective caches without trading off area or power. Two common ways to improve cache performance is to increase the size or associativity of the cache. Both of these approaches come at a high cost in chip area as well as power. This thesis presents two new cache organizations, each aimed at more efficient use of either power or area. First, the Elbow cache is presented, which is shown to be a power-efficient alternative to highly set-associative caches. Secondly, a selective cache allocation algorithm is presented, RASCAL, that significantly reduces the miss ratio at a limited cost in area.

APA, Harvard, Vancouver, ISO, and other styles

6

Lodde, Mario. "Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35325.

Full text

Abstract:

La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos. En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible: -acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios -mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio -implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado.
Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
TESIS

APA, Harvard, Vancouver, ISO, and other styles

7

Bond, Paul Joseph. "Design and analysis of reconfigurable and adaptive cache structures." Diss., Georgia Institute of Technology, 1995. http://hdl.handle.net/1853/14983.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Chandran, Pravin Chander. "Design of ALU and Cache memory for an 8 bit microprocessor." Connect to this title online, 2007. http://etd.lib.clemson.edu/documents/1202498822/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Tan, Yudong. "Cache design and timing analysis for preemptive multi-tasking real-time uniprocessor systems." Diss., Available online, Georgia Institute of Technology, 2005, 2005. http://etd.gatech.edu/theses/available/etd-04132005-212947/unrestricted/yudong%5Ftan%5F200505%5Fphd.pdf.

Full text

Abstract:

Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2005.
Schimmel, David, Committee Member ; Meliopoulos, A. P. Sakis, Committee Member ; Mooney, Vincent, Committee Chair ; Prvulovic, Milos, Committee Member ; Yalamanchili, Sudhakar, Committee Member. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

10

Rabbah, Rodric Michel. "Design Space Exploration and Optimization of Embedded Memory Systems." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/11605.

Full text

Abstract:

Recent years have witnessed the emergence of microprocessors that are embedded within a plethora of devices used in everyday life. Embedded architectures are customized through a meticulous and time consuming design process to satisfy stringent constraints with respect to performance, area, power, and cost. In embedded systems, the cost of the memory hierarchy limits its ability to play as central a role. This is due to stringent constraints that fundamentally limit the physical size and complexity of the memory system. Ultimately, application developers and system engineers are charged with the heavy burden of reducing the memory requirements of an application. This thesis offers the intriguing possibility that compilers can play a significant role in the automatic design space exploration and optimization of embedded memory systems. This insight is founded upon a new analytical model and novel compiler optimizations that are specifically designed to increase the synergy between the processor and the memory system. The analytical models serve to characterize intrinsic program properties, quantify the impact of compiler optimizations on the memory systems, and provide deep insight into the trade-offs that affect memory system design.

APA, Harvard, Vancouver, ISO, and other styles

11

Chae, Youngsu. "Algorithms, protocols and services for scalable multimedia streaming." Diss., Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/8148.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Akgul, Bilge Ebru Saglam. "The System-on-a-Chip Lock Cache." Diss., Georgia Institute of Technology, 2004. http://hdl.handle.net/1853/5253.

Full text

Abstract:

In this dissertation, we implement efficient lock-based synchronization by a novel, high performance, simple and scalable hardware technique and associated software for a target shared-memory multiprocessor System-on-a-Chip (SoC). The custom hardware part of our solution is provided in the form of an intellectual property (IP) hardware unit which we call the SoC Lock Cache (SoCLC). SoCLC provides effective lock hand-off by reducing on-chip memory traffic and improving performance in terms of lock latency, lock delay and bandwidth consumption. The proposed solution is independent from the memory hierarchy, cache protocol and the processor architectures used in the SoC, which enables easily applicable implementations of the SoCLC (e.g., as a reconfigurable or partially/fully custom logic), and which distinguishes SoCLC from previous approaches. Furthermore, the SoCLC mechanism has been extended to support priority inheritance with an immediate priority ceiling protocol (IPCP) implemented in hardware, which enhances the hard real-time performance of the system. Our experimental results in a four-processor SoC indicate that SoCLC can achieve up to 37% overall speedup over spin-lock and up to 48% overall speedup over MCS for a microbenchmark with false sharing. The priority inheritance implemented as part of the SoCLC hardware, on the other hand, achieves 1.43X speedup in overall execution time of a robot application when compared to the priority inheritance implementation under the Atalanta real-time operating system. Furthermore, it has been shown that with the IPCP mechanism integrated into the SoCLC, all of the tasks of the robot application could meet their deadlines (e.g., a high priority task with 250us worst case response time could complete its execution in 93us with SoCLC, however the same task missed its deadline by completing its execution in 283us without SoCLC). Therefore, with IPCP support, our solution can provide better real-time guarantees for real-time systems. To automate SoCLC design, we have also developed an SoCLC-generator tool, PARLAK, that generates user specified configurations of a custom SoCLC. We used PARLAK to generate SoCLCs from a version for two processors with 32 lock variables occupying 2,520 gates up to a version for fourteen processors with 256 lock variables occupying 78,240 gates.

APA, Harvard, Vancouver, ISO, and other styles

13

Carlson, Ingvar. "Design and Evaluation of High Density 5T SRAM Cache for Advanced Microprocessors." Thesis, Linköping University, Department of Electrical Engineering, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2286.

Full text

Abstract:

This thesis presents a five-transistor SRAM intended for the advanced microprocessor cache market. The goal is to reduce the area of the cache memory array while maintaining competitive performance. Various existing technologies are briefly discussed with their strengths and weaknesses. The design metrics for the five-transistor cell are discussed in detail and performance and stability are evaluated. Finally a comparison is done between a 128Kb memory of an existing six-transistor technology and the proposed technology. The comparisons include area, performance and stability of the memories. It is shown that the area of the memory array can be reduced by 23% while maintaining comparable performance. The new cell also has 43% lower total leakage current. As a trade-off for these advantages some of the stability margin is lost but the cell is still stable in all process corners. The performance and stability has been validated through post-layout simulations using Cadence Spectre.

APA, Harvard, Vancouver, ISO, and other styles

14

Fazli, Yeknami Ali. "Design and Evaluation of A Low-Voltage, Process-Variation-Tolerant SRAM Cache in 90nm CMOS Technology." Thesis, Linköping University, Department of Electrical Engineering, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-12260.

Full text

Abstract:

This thesis presents a novel six-transistor SRAM intended for advanced

microprocessor cache application. The objectives are to reduce power

consumption through scaling the supply voltage and to design a SRAM that is fully process-variation-tolerant, utilizing separate read and write access ports as well as exploiting asymmetry. Traditional six-transistor SRAM is designed and its strengths and weaknesses are discussed in detail. Afterwards, a new SRAM technology developed in the division of Electronic Devices, Linköping University is proposed and its capabilities and drawbacks are illustrated deeply. Subsequently, the impact of mismatch and process variation on both standard 6T and proposed asymmetric 6T SRAM cells is investigated. Eventually, the cells are compared regarding the voltage scalability, stability, and tolerability to variations in process parameters. It is shown that the new cell functions in 430mV while maintaining acceptable SNM margin in all process corners. It is also demonstrated that the proposed SRAM is fully process-variation-tolerant.

Additionally, a dual-V t asymmetric 6T cell is introduced having wide SNM margin comparable with that of conventional 6T cell such that it is capable of functioning in 580mV.

APA, Harvard, Vancouver, ISO, and other styles

15

Giordano, Omar. "Design and Implementation of an Architecture-aware In-memory Key- Value Store." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291213.

Full text

Abstract:

Key-Value Stores (KVSs) are a type of non-relational databases whose data is represented as a key-value pair and are often used to represent cache and session data storage. Among them, Memcached is one of the most popular ones, as it is widely used in various Internet services such as social networks and streaming platforms. Given the continuous and increasingly rapid growth of networked devices that use these services, the commodity hardware on which the databases are based must process packets faster to meet the needs of the market. However, in recent years, the performance improvements characterising the new hardware has become thinner and thinner. From here, as the purchase of new products is no longer synonymous with significant performance improvements, companies need to exploit the full potential of the hardware already in their possession, consequently postponing the purchase of more recent hardware. One of the latest ideas for increasing the performance of commodity hardware is the use of slice-aware memory management. This technique exploits the Last Level of Cache (LLC) by making sure that the individual cores take data from memory locations that are mapped to their respective cache portions (i.e., LLC slices). This thesis focuses on the realisation of a KVS prototype—based on Intel Haswell micro-architecture—built on top of the Data Plane Development Kit (DPDK), and to which the principles of slice-aware memory management are applied. To test its performance, given the non-existence of a DPDKbased traffic generator that supports the Memcached protocol, an additional prototype of a traffic generator that supports these features has also been developed. The performances were measured using two distinct machines: one for the traffic generator and one for the KVS. First, the “regular” KVS prototype was tested, then, to see the actual benefits, the slice-aware one. Both KVS prototypeswere subjected to two types of traffic: (i) uniformtraffic where the keys are always different from each other, and (ii) skewed traffic, where keys are repeated and some keys are more likely to be repeated than others. The experiments show that, in real-world scenario (i.e., characterised by skewed key distributions), the employment of a slice-aware memory management technique in a KVS can slightly improve the end-to-end latency (i.e.,~2%). Additionally, such technique highly impacts the look-up time required by the CPU to find the key and the corresponding value in the database, decreasing the mean time by ~22.5%, and improving the 99th percentile by ~62.7%.
Key-Value Stores (KVSs) är en typ av icke-relationsdatabaser vars data representeras som ett nyckel-värdepar och används ofta för att representera lagring av cache och session. Bland dem är Memcached en av de mest populära, eftersom den används ofta i olika internettjänster som sociala nätverk och strömmande plattformar. Med tanke på den kontinuerliga och allt snabbare tillväxten av nätverksenheter som använder dessa tjänster måste den råvaruhårdvara som databaserna bygger på bearbeta paket snabbare för att möta marknadens behov. Under de senaste åren har dock prestandaförbättringarna som kännetecknar den nya hårdvaran blivit tunnare och tunnare. Härifrån, eftersom inköp av nya produkter inte längre är synonymt med betydande prestandaförbättringar, måste företagen utnyttja den fulla potentialen för hårdvaran som redan finns i deras besittning, vilket skjuter upp köpet av nyare hårdvara. En av de senaste idéerna för att öka prestanda för råvaruhårdvara är användningen av skivmedveten minneshantering. Denna teknik utnyttjar den Sista Nivån av Cache (SNC) genom att se till att de enskilda kärnorna tar data från minnesplatser som är mappade till deras respektive cachepartier (dvs. SNCskivor). Denna avhandling fokuserar på förverkligandet av en KVS-prototyp— baserad på Intel Haswell mikroarkitektur—byggd ovanpå Data Plane Development Kit (DPDK), och på vilken principerna för skivmedveten minneshantering tillämpas. För att testa dess prestanda, med tanke på att det inte finns en DPDK-baserad trafikgenerator som stöder Memcachedprotokollet, har en ytterligare prototyp av en trafikgenerator som stöder dessa funktioner också utvecklats. Föreställningarna mättes med två olika maskiner: en för trafikgeneratorn och en för KVS. Först testades den “vanliga” KVSprototypen, för att se de faktiska fördelarna, den skivmedvetna. Båda KVSprototyperna utsattes för två typer av trafik: (i) enhetlig trafik där nycklarna alltid skiljer sig från varandra och (ii) sned trafik, där nycklar upprepas och vissa nycklar är mer benägna att upprepas än andra. Experimenten visar att i verkliga scenarier (dvs. kännetecknas av snedställda nyckelfördelningar) kan användningen av en skivmedveten minneshanteringsteknik i en KVS förbättra förbättringen från slut till slut (dvs. ~2%). Dessutom påverkar sådan teknik i hög grad uppslagstiden som krävs av CPU: n för att hitta nyckeln och motsvarande värde i databasen, vilket minskar medeltiden med ~22, 5% och förbättrar 99th percentilen med ~62, 7%.

APA, Harvard, Vancouver, ISO, and other styles

16

Kofuji, Jussara Marândola. "Método otimizado de arquitetura de coerência de cache baseado em sistemas embarcados multinúcleos." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/3/3142/tde-03042012-082623/.

Full text

Abstract:

A tese apresenta um método de arquitetura de coerência de cache especializado por sistemas embarcados. Um das contribuições principais deste método é apresentar uma proposição de arquitetura CMP de memória compartilhada orientada a padrões de acesso a memória e de um protocolo de coerência híbrido. A contribuição principal é a especificação do novo componente de hardware, chamado tabela de padrões, o qual é validado por representação formal e pela implementação da estrutura da tabela de padrões. A partir desta tabela foi desenvolvido um modelo de transação de mensagens do protocolo híbrido que diferencia as mensagens em clássicas e especulativas. A contribuição final apresenta um modelo analítico do custo efetivo de desempenho do protocolo híbrido.
This thesis presents the optimized method of cache coherent architecture based on embedded systems. The main contribution of this method presents the proposal of shared memory architecture CMP oriented by memory access patterns and cache coherent hybrid protocol. The cache coherent architecture provided the hardware specification called pattern table which can be validated by formal representation and the first implementation of pattern table. Through pattern table was developed the model of messages transaction to hybrid protocol witch differ the messages in classical and speculative. The final contribution presents the analytic model of effective cost of hybrid protocol performance.

APA, Harvard, Vancouver, ISO, and other styles

17

Pan, Xiang. "Designing Future Low-Power and Secure Processors with Non-Volatile Memory." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1492631536670669.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Guo, Lei. "Insights into access patterns of Internet media systems measurements, analysis, and system design /." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1198696679.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Liang, Shuang. "Algorithms Designs and Implementations for Page Allocation in SSD Firmware and SSD Caching in Storage Systems." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1268420517.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Kwon, Woo Cheol. "Co-design of on-chip caches and networks for scalable shared-memory many-core CMPs." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/118084.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 169-180).
Chip Multi-Processors(CMPs) have become mainstream in recent years, providing increased parallelism as core counts scale. While a tiled CMP is widely accepted to be a scalable architecture for the many-core era, on-chip cache organization and coherence are far from solved problems. As the on-chip interconnect directly influences the latency and bandwidth of on-chip cache, scalable interconnect is an essential part of on-chip cache design. On the other hand, optimal design of interconnect can be determined by the traffic forms that it should handle. Thus, on-chip cache organization is inherently interleaved with on-chip interconnect design and vice versa. This dissertation aims to motivate the need for re-organization of on-chip caches to leverage the advancement of on-chip network technology to harness the full potential of future many-core CMPs. Conversely, we argue that on-chip network should also be designed to support specific functionalities required by the on-chip cache. We propose such co-design techniques to offer significant improvement of on-chip cache performance, and thus to provide scalable CMP cache solutions towards future many-core CMPs. The dissertation starts with the problem of remote on-chip cache access latency. Prior locality-aware approaches fundamentally attempt to keep data as close as possible to the requesting cores. In this dissertation, we challenge this design approach by introducing new cache organization that leverages a co-designed on-chip network that allows multi-hop single-cycle traversals. Next, the dissertation moves to cache coherence request ordering. Without built-in ordering capability within the interconnect, cache coherence protocols have to rely on external ordering points. This dissertation proposes a scalable ordered Network-on-Chip which supports ordering of requests for snoopy cache coherence. Lastly, we describe development of a 36-core research prototype chip to demonstrate that the proposed Network-on-Chip enables shared-memory CMPs to be readily scalable to many-core platforms.
by Woo Cheol Kwon.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

21

Kong, Jingfei. "ARCHITECTURAL SUPPORT FOR IMPROVING COMPUTER SECURITY." Doctoral diss., University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2610.

Full text

Abstract:

Computer security and privacy are becoming extremely important nowadays. The task of protecting computer systems from malicious attacks and potential subsequent catastrophic losses is, however, challenged by the ever increasing complexity and size of modern hardware and software design. We propose several methods to improve computer security and privacy from architectural point of view. They provide strong protection as well as performance efficiency. In our first approach, we propose a new dynamic information flow method to protect systems from popular software attacks such as buffer overflow and format string attacks. In our second approach, we propose to deploy encryption schemes to protect the privacy of an emerging non-volatile main memory technology Â phase change memory (PCM). The negative impact of the encryption schemes on PCM lifetime is evaluated and new methods including a new encryption counter scheme and an efficient error correct code (ECC) management are proposed to improve PCM lifetime. In our third approach, we deconstruct two previously proposed secure cache designs against software data-cache-based side channel attacks and demonstrate their weaknesses. We propose three hardware-software integrated approaches as secure protections against those data cache attacks. Also we propose to apply them to protect instruction caches from similar threats. Furthermore, we propose a simple change to the update policy of Branch Target Buffer (BTB) to defend against BTB attacks. Our experiments show that our proposed schemes are both security effective and performance efficient.
Ph.D.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science PhD

APA, Harvard, Vancouver, ISO, and other styles

22

Agarwal, Vikas. "Scalable primary cache memory architectures." Thesis, 2004. http://hdl.handle.net/2152/1862.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

"Real-time cache design." Chinese University of Hong Kong, 1996. http://library.cuhk.edu.hk/record=b5888780.

Full text

Abstract:

by Hon-Kai, Cheung.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.
Includes bibliographical references (leaves 102-105).
Abstract --- p.i
Acknowledgement --- p.iii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Overview --- p.1
Chapter 1.2 --- Scheduling In Real-time Systems --- p.4
Chapter 1.3 --- Cache Memories --- p.5
Chapter 1.4 --- Outline Of The Dissertation --- p.8
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Introduction --- p.9
Chapter 2.2 --- Predictable Cache Designs --- p.9
Chapter 2.2.1 --- Locking Cache Lines Design --- p.9
Chapter 2.2.2 --- Partially Dynamic And Static Cache Partition Allocation Design --- p.10
Chapter 2.2.3 --- SMART (Strategic Memory Allocation for Real Time) Cache Design --- p.10
Chapter 2.3 --- Prefetching --- p.11
Chapter 2.3.1 --- Introduction --- p.11
Chapter 2.3.2 --- Hardware Support Prefetching --- p.12
Chapter 2.3.3 --- Software Assisted Prefetching --- p.12
Chapter 2.3.4 --- Partial Cache Hit --- p.13
Chapter 2.3.5 --- Cache Pollution Problems --- p.13
Chapter 2.4 --- Cache Line Replacement Policies --- p.13
Chapter 2.5 --- Main Memory Update Policies --- p.14
Chapter 2.6 --- Summaries --- p.15
Chapter 3 --- Problems And Motivations --- p.16
Chapter 3.1 --- Introduction --- p.16
Chapter 3.2 --- Problems --- p.16
Chapter 3.2.1 --- Modern Cache Architecture Is Inappropriate For Real-time Systems --- p.16
Chapter 3.2.2 --- Intertask Interference: The Effects Of Preemption --- p.17
Chapter 3.2.3 --- Intratask Interference: Cache Line Collision --- p.20
Chapter 3.3 --- Motivations --- p.21
Chapter 3.3.1 --- Improvement Of The Cache Performance In Real-time Systems --- p.21
Chapter 3.3.2 --- Hiding of Preemption Effects --- p.22
Chapter 3.4 --- Conclusions --- p.25
Chapter 4 --- Proposed Real-Time Cache Design --- p.26
Chapter 4.1 --- Introduction --- p.26
Chapter 4.2 --- Concepts Definition --- p.26
Chapter 4.2.1 --- Tasks Definition --- p.26
Chapter 4.2.2 --- Cache Performance Values --- p.27
Chapter 4.3 --- Issues Related To Proposed Real-Time Cache Design --- p.28
Chapter 4.3.1 --- A Task Serving Policy --- p.30
Chapter 4.3.2 --- Number Of Private And Shared Cache Partitions --- p.31
Chapter 4.3.3 --- Controlling The Cache Partitions: Cache Partition Table And Pro- cess Info Table --- p.32
Chapter 4.3.4 --- Re-organization Of Task Owns Cache Partition(s) --- p.34
Chapter 4.3.5 --- Handling The Bus Bandwidth: Memory Requests Queue ( MRQ ) --- p.35
Chapter 4.3.6 --- How To Address The Cache Models --- p.37
Chapter 4.3.7 --- Data Coherence Problems For Partitioned Cache Model And Non- partitioned Cache Model --- p.39
Chapter 4.4 --- Mechanism For Proposed Real-Time Cache Design --- p.43
Chapter 4.4.1 --- Basic Operation Of Proposed Real-Time Cache Design --- p.43
Chapter 4.4.2 --- Assumptions And Rules --- p.43
Chapter 4.4.3 --- First Round Dynamic Cache Partition Re-allocation --- p.44
Chapter 4.4.4 --- Later Round Dynamic Cache Partition Re-allocation --- p.45
Chapter 5 --- Simulation Environments --- p.56
Chapter 5.1 --- Proposed Architectural Model --- p.56
Chapter 5.2 --- Working Environment For Proposed Real-time Cache Models --- p.57
Chapter 5.2.1 --- Cost Model --- p.57
Chapter 5.2.2 --- System Model --- p.64
Chapter 5.2.3 --- Fair Comparsion Between The Unified Cache And The Separate Caches --- p.64
Chapter 5.2.4 --- Operations Within The Preemption --- p.65
Chapter 5.3 --- Benchmark Programs --- p.65
Chapter 5.3.1 --- The NASA7 Benchmark --- p.66
Chapter 5.3.2 --- The SU2COR Benchmark --- p.66
Chapter 5.3.3 --- The TOMCATV Benchmark --- p.66
Chapter 5.3.4 --- The WAVE5 Benchmark --- p.67
Chapter 5.3.5 --- The COMPRESS Benchmark --- p.67
Chapter 5.3.6 --- The ESPRESSO Benchmark --- p.68
Chapter 5.4 --- Simulations Parameters --- p.68
Chapter 6 --- Analysis Of Simulations --- p.71
Chapter 6.1 --- Introduction --- p.71
Chapter 6.2 --- Trace Files Statistics --- p.71
Chapter 6.3 --- Interpretation Of Partial Cache Hit --- p.72
Chapter 6.4 --- The Effects Of Cache Size --- p.72
Chapter 6.4.1 --- "Performances Of Model 1, Model 2, Model 3 And Model 4" --- p.72
Chapter 6.5 --- The Effects Of Cache Partition Size --- p.76
Chapter 6.5.1 --- Performance Of Model 3 --- p.79
Chapter 6.5.2 --- Performance Of Model 1 --- p.79
Chapter 6.6 --- The Effects Of Line Size --- p.80
Chapter 6.6.1 --- "Performance Of Model 1, Model 2, Model 3 And Model 4" --- p.80
Chapter 6.7 --- The Effects Of Set Associativity --- p.83
Chapter 6.7.1 --- "Performance Of Model 1, Model 2, Model 3 And Model 4" --- p.83
Chapter 6.8 --- The Effects Of The Best-expected Cache Performance --- p.84
Chapter 6.8.1 --- Performance of Model 1 --- p.87
Chapter 6.8.2 --- Performance of Model 3 --- p.88
Chapter 6.9 --- The Effects Of The Standard-expected Cache Performance --- p.89
Chapter 6.9.1 --- Performance Of Model 1 --- p.89
Chapter 6.9.2 --- Performance Of Model 3 --- p.91
Chapter 6.10 --- The Effects Of Cycle Execution Time/Cycle Deadline Period --- p.92
Chapter 6.10.1 --- "Performances Of Model 1, Model 2, Model 3 And Model 4" --- p.92
Chapter 7 --- Conclusions And Future Work --- p.95
Chapter 7.1 --- Conclusions --- p.95
Chapter 7.1.1 --- Unified Cache Model Is More Suitable In Real-time Systems --- p.99
Chapter 7.1.2 --- Comments On Aperiodic Tasks --- p.100
Chapter 7.2 --- Future Work --- p.100

APA, Harvard, Vancouver, ISO, and other styles

24

Lin, Ya-Ching, and 林雅清. "Design of Efficient Cache Memory Systems for Multimedia Applications." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/52464734104099086424.

Full text

Abstract:

碩士
國立中正大學
資訊工程研究所
87
Since the more and more popular of using computer, it is obvious that the multimedia applications are more and more common and important. Also, depending on the higher and higher developing rate of CPU, we can enjoy the smooth and real-time audio/video entertainment by computer now. However, if the scale of the multimedia data is too large, the application is probably unable to run smoothly. This is because of the memory latency. If the accessing time of memory storage can not be improved efficiently, the memory latency will always be the bottleneck of computer execution time no matter how fast the CPU speed is. According to the program and data characteristics of multimedia applications, the thesis describes how to reduce the reuseless data to be cached into cache by a RCT (Reference Contribution Table). For hiding the memory latency for multimedia applications, a Packed Cache, which stores the prefetched and packed data, is discussed in this thesis. Multiple expected data can be prefetched into the Packed Cache. This not only can improve the miss rate of normal cache, but also can hide a portion of memory latency.

APA, Harvard, Vancouver, ISO, and other styles

25

"Design of disk cache for high performance computing." Chinese University of Hong Kong, 1995. http://library.cuhk.edu.hk/record=b5888561.

Full text

Abstract:

by Vincent, Kwan Chi Wai.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.
Includes bibliographical references (leaves 123-127).
Abstract --- p.i
Acknowledgement --- p.ii
List of Tables --- p.vii
List of Figures --- p.viii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- I/O System --- p.2
Chapter 1.2 --- Disk Cache --- p.4
Chapter 1.3 --- Dissertation Outline --- p.5
Chapter 2 --- Related Work --- p.7
Chapter 2.1 --- Prefetching --- p.7
Chapter 2.2 --- Cache Partitioning --- p.9
Chapter 2.2.1 --- Hardware Assisted Mechanism --- p.9
Chapter 2.2.2 --- Software Assisted Mechanism --- p.10
Chapter 2.3 --- Replacement Policy --- p.12
Chapter 2.4 --- Caching Write Operation --- p.13
Chapter 2.5 --- Others --- p.14
Chapter 2.6 --- Summary --- p.15
Chapter 3 --- Methodology and Models --- p.17
Chapter 3.1 --- Performance Measurement --- p.17
Chapter 3.1.1 --- Partial Hit --- p.17
Chapter 3.1.2 --- Time Model --- p.17
Chapter 3.2 --- Terminology --- p.19
Chapter 3.2.1 --- Transfer Block --- p.19
Chapter 3.2.2 --- Multiple-sector Request --- p.19
Chapter 3.2.3 --- "Dynamic Block, Heading Sectors and Content Sectors" --- p.20
Chapter 3.2.4 --- Heading Reuse and Non-heading Reuse --- p.22
Chapter 3.3 --- New Models --- p.23
Chapter 3.3.1 --- Unified Cache with Always Prefetch --- p.24
Chapter 3.3.2 --- Partitioned Cache: Branch Target Cache and Prefetch Buffer --- p.25
Chapter 3.3.3 --- BTC + PB with Alternative Storing Sector Technique --- p.29
Chapter 3.3.4 --- BTC + PB with ASST Applying to Dynamic Block --- p.34
Chapter 3.3.5 --- BTC + PB with Storing Enough Head Technique --- p.35
Chapter 3.4 --- Impact of Block Size --- p.38
Chapter 4 --- Trace Driven Simulation --- p.41
Chapter 4.1 --- Simulation Environment --- p.41
Chapter 4.2 --- Two Kinds Of Disk --- p.43
Chapter 4.3 --- Control Models --- p.43
Chapter 4.3.1 --- Model 1: No Cache --- p.43
Chapter 4.3.2 --- Model 2: Unified Cache without Prefetch --- p.44
Chapter 4.3.3 --- Model 3: Unified Cache with Prefetch on Miss --- p.44
Chapter 4.4 --- Two Comparison Standards --- p.45
Chapter 4.5 --- Trace Properties --- p.46
Chapter 5 --- Performance Evaluation of Common Disk --- p.54
Chapter 5.1 --- The Effect Of Cache Size --- p.54
Chapter 5.1.1 --- Trends of Absolute Reduction in Time --- p.55
Chapter 5.1.2 --- Trends of Relative Reduction in Time --- p.55
Chapter 5.2 --- The Effect Of Block Size --- p.68
Chapter 5.2.1 --- Trends of Absolute Reduction in Time --- p.68
Chapter 5.2.2 --- Trends of Relative Reduction in Time --- p.73
Chapter 5.3 --- The Effect Of Set Associativity --- p.77
Chapter 5.3.1 --- Trends of Absolute Reduction in Time --- p.77
Chapter 5.4 --- The Effect Of Start-up Time C1 --- p.79
Chapter 5.4.1 --- Trends of Absolute Reduction in Time --- p.80
Chapter 5.4.2 --- Trends of Relative Reduction in Time --- p.80
Chapter 5.5 --- The Effect Of Transfer Time C2 --- p.83
Chapter 5.5.1 --- Trends of Absolute Reduction in Time --- p.83
Chapter 5.5.2 --- Trends of Relative Reduction in Time --- p.83
Chapter 5.5.3 --- Impact of C2=0.5 on Cache Size --- p.86
Chapter 5.5.4 --- Impact of C2=0.5 on Block Size --- p.87
Chapter 5.6 --- The Effect Of Prefetch Buffer Size --- p.90
Chapter 5.7 --- Others --- p.93
Chapter 5.7.1 --- In The Case of Very Small Cache with Large Block Size --- p.93
Chapter 5.7.2 --- Comparing Performance of Model 6 and Model 7 --- p.94
Chapter 5.8 --- Conclusion --- p.95
Chapter 5.8.1 --- The Number of Actual Sectors Transferred between Disk and Cache . --- p.95
Chapter 5.8.2 --- The Efficiency of Our Models on Common Disk --- p.96
Chapter 6 --- Performance Evaluation of High Performance Disk --- p.98
Chapter 6.1 --- Difference Between Common Disk And High Performance Disk --- p.98
Chapter 6.2 --- The Effect Of Cache Size --- p.99
Chapter 6.2.1 --- Trends of Absolute Reduction in Time --- p.99
Chapter 6.2.2 --- Trends of Relative Reduction in Time --- p.99
Chapter 6.3 --- The Effect Of Block Size --- p.103
Chapter 6.3.1 --- Trends of Absolute Reduction in Time --- p.105
Chapter 6.3.2 --- Trends of Relative Reduction in Time --- p.105
Chapter 6.4 --- The Effect Of Start-up Time C1 --- p.110
Chapter 6.4.1 --- Trends of Relative Reduction in Time --- p.110
Chapter 6.5 --- The Effect Of Transfer Time C2 --- p.110
Chapter 6.5.1 --- Trends of Relative Reduction in Time --- p.112
Chapter 6.5.2 --- Impact of C2=0.5 on Cache Size --- p.112
Chapter 6.5.3 --- Impact of C2=0.5 on Block Size --- p.116
Chapter 6.6 --- Conclusion --- p.117
Chapter 7 --- Conclusions and Future Work --- p.119
Chapter 7.1 --- Conclusions --- p.119
Chapter 7.2 --- Future Work --- p.122
Bibliography --- p.123

APA, Harvard, Vancouver, ISO, and other styles

26

HUNG, CHI-CHAO, and 洪啟超. "Efficient Cache Bypassing and Adaptive Threads Controlling of Cache Memory for High Performance GPU Design." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/gy2437.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Ting, Kuo-Chang, and 丁國章. "A Range-Based Cache Design For Shared-Memory Multiprocessors System." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/40544672647805802592.

Full text

Abstract:

碩士
國立中正大學
資訊工程研究所
83
Efficient memory system is crucial to the performance of the shared-memory multiprocessors system. The private cache for each processor was used generally in shared-memory multiprocessors in order to achieve good performance. However, due to the introduction of the private cache, the cache coherent problems occur. In the multilevel hierarchical cache, the multi-levels inclusion (MLI)property was proposed to solve the memory block invalidation problem. However, the higher the tree level, the problem of the associativity constraint becomes more serious. We propose the range-based cache to overcome the associativity overflow problem of the MLI cache. There are start variation tag end variation tag to indicate the range of the block address owned by the processor in the range-based cache. The vector field of the entry of the cache directory is used to record which processor owns the data block in this range. In order to solve the null empty problems, we devise a counter for each entry of all of the sets to indicate the number of active blocks maintained in this entry. If the counter is zero, we can ensure this is a empty entry, and we may clear all the fields of the entry. We use Cachemire, a program driven simulation tool, to evaluate our proposed scheme. The simulation results show that our range-based cache is better than the MLI cache. With only a small cache (eg. 8K), range-based cache can be better than that of the large cache (eg. 64K) for the MLI cache. We conclude that the range-based cache is more cost-effective than the MLI cache.

APA, Harvard, Vancouver, ISO, and other styles

28

Mohammad, Baker Shehadah. "Cache design for low power and yield enhancement." 2008. http://hdl.handle.net/2152/17884.

Full text

Abstract:

One of the major limiters to computer systems and systems on chip (SOC) designs is accessing the main memory, which is typically two orders of magnitude slower than the processor. To bridge this gap, modern processors already devote more than half of the on-chip transistors to the last-level cache. Caches have negative impact on area, power, and yield. This research goal is to design caches that operate at lower voltages while enhancing yield. Our strategy is to improve the static noise margin (SNM) and the writability of the conventional six-transistor SRAM cell by reducing the effect of parametric variations on the cell. This is done using a novel circuit that reduces the voltage swing on the word line during read operations and reduces the memory supply voltage during write operations. The proposed circuit increases the SRAM’s SNM and write margin using a single voltage supply that has minimal impacts on chip area, complexity, and timing. A test chip with an 8-kilobyte SRAM block manufactured in 45- nm technology is used to verify the practicality of the contribution and demonstrate the effectiveness of the new circuit’s implementation. Cache organization is one of the most important factors that affect cache design complexity, performance, area, and power. The main architectural choice for caches is whether to implement the tag array using a standard SRAM or using a content addressable memory (CAM). The choice made has far-reaching consequences on several aspects of the cache design, and in particular on power consumption. Our contribution in this area is an in-depth study of the complex tradeoffs of area, timing, power, and design complexity between an SRAM-based tag and a CAM-based one. Our results indicate that an SRAM-based tag design often provides a better overall design point and is superior with respect to energy, especially for interleaved multi-threading processors. Being able to test and screen chips is a key factor in achieving high yield. Most industry standard CAD tools used to analyze fault coverage and generate test vectors require gate level models. However, since caches are typically designed using a transistor-level flow, there is a need for an abstraction step to generate the gate models, which must be equivalent to the actual design (transistor level). The third contribution of the research is a framework to verify that the gate level representation of custom designs is equivalent to the transistor-level design.
text

APA, Harvard, Vancouver, ISO, and other styles

29

MA, JING-HUA, and 馬靖華. "A design of memory management unit and cache controller for the MARS system." Thesis, 1992. http://ndltd.ncl.edu.tw/handle/97824129942546114382.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Huh, Jaehyuk. "Hardware techniques to reduce communication costs in multiprocessors." Thesis, 2006. http://hdl.handle.net/2152/2533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

"An Analysis of the Memory Bottleneck and Cache Performance of Most Apparent Distortion Image Quality Assessment Algorithm on GPU." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.41229.

Full text

Abstract:

abstract: As digital images are transmitted over the network or stored on a disk, image processing is done as part of the standard for efficient storage and bandwidth. This causes some amount of distortion or artifacts in the image which demands the need for quality assessment. Subjective image quality assessment is expensive, time consuming and influenced by the subject's perception. Hence, there is a need for developing mathematical models that are capable of predicting the quality evaluation. With the advent of the information era and an exponential growth in image/video generation and consumption, the requirement for automated quality assessment has become mandatory to assess the degradation. The last few decades have seen research on automated image quality assessment (IQA) algorithms gaining prominence. However, the focus has been on achieving better predication accuracy, and not on improving computational performance. As a result, existing serial implementations require a lot of time in processing a single frame. In the last 5 years, research on general-purpose graphic processing unit (GPGPU) based image quality assessment (IQA) algorithm implementation has shown promising results for single images. Still, the implementations are not efficient enough for deployment in real world applications, especially for live videos at high resolution. Hence, in this thesis, it is proposed that microarchitecture-conscious coding on a graphics processing unit (GPU) combined with detailed understanding of the image quality assessment (IQA) algorithm can result in non-trivial speedups without compromising quality prediction accuracy. This document focusses on the microarchitectural analysis of the most apparent distortion (MAD) algorithm. The results are analyzed in-depth and one of the major bottlenecks is identified. With the knowledge of underlying microarchitecture, the implementation is restructured thereby resolving the bottleneck and improving the performance.
Dissertation/Thesis
Masters Thesis Computer Science 2016

APA, Harvard, Vancouver, ISO, and other styles

32

Dai, Zefu. "Appliction-driven Memory System Design on FPGAs." Thesis, 2013. http://hdl.handle.net/1807/43538.

Full text

Abstract:

Moore's Law has helped Field Programmable Gate Arrays (FPGAs) scale continuously in speed, capacity and energy efficiency, allowing the integration of ever-larger systems into a single FPGA chip. This brings challenges to the productivity of developers in leveraging the sea of FPGA resources. Higher level of design abstractions and programming models are needed to improve the design productivity, which in turn require memory architectural supports on FPGAs. While previous efforts focus on computation-centric applications, we take a bandwidth-centric approach in designing memory systems. In particular, we investigate the scheduling, buffered switching and searching problems, which are common to a wide range of FPGA applications. Despite that the bandwidth problem has been extensively studied for general-purpose computing and application specific integrated circuit (ASIC) designs, the proposed techniques are often not applicable to FPGAs. In order to achieve optimized design implementations, designers need to take into consideration both the underlying FPGA physical characteristics as well as the requirements from applications. We therefore extract design requirements from four driving applications for the selected problems, and address them by exploiting the physical architectures and available resources of FPGAs. Towards solving the selected problems, we manage to advance state-of-the-art with a scheduling algorithm, a switch organization and a cache analytical model. These lead to performance improvements, resource savings and feasibilities of new approaches for well-known problems.

APA, Harvard, Vancouver, ISO, and other styles

33

Shastri, Vijnan. "Caching Strategies And Design Issues In CD-ROM Based Multimedia Storage." Thesis, 1997. http://etd.iisc.ernet.in/handle/2005/1805.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Dwarakanath, Nagendra Gulur. "Multi-Core Memory System Design : Developing and using Analytical Models for Performance Evaluation and Enhancements." Thesis, 2015. http://etd.iisc.ernet.in/2005/3935.

Full text

Abstract:

Memory system design is increasingly inﬂuencing modern multi-core architectures from both performance and power perspectives. Both main memory latency and bandwidth have im-proved at a rate that is slower than the increase in processor core count and speed. Off-chip memory, primarily built from DRAM, has received signiﬁcant attention in terms of architecture and design for higher performance. These performance improvement techniques include sophisticated memory access scheduling, use of multiple memory controllers, mitigating the impact of DRAM refresh cycles, and so on. At the same time, new non-volatile memory technologies have become increasingly viable in terms of performance and energy. These alternative technologies offer different performance characteristics as compared to traditional DRAM. With the advent of 3D stacking, on-chip memory in the form of 3D stacked DRAM has opened up avenues for addressing the bandwidth and latency limitations of off-chip memory. Stacked DRAM is expected to offer abundant capacity — 100s of MBs to a few GBs — at higher bandwidth and lower latency. Researchers have proposed to use this capacity as an extension to main memory, or as a large last-level DRAM cache. When leveraged as a cache, stacked DRAM provides opportunities and challenges for improving cache hit rate, access latency, and off-chip bandwidth. Thus, designing off-chip and on-chip memory systems for multi-core architectures is complex, compounded by the myriad architectural, design and technological choices, combined with the characteristics of application workloads. Applications have inherent spatial local-ity and access parallelism that inﬂuence the memory system response in terms of latency and bandwidth. In this thesis, we construct an analytical model of the off-chip main memory system to comprehend this diverse space and to study the impact of memory system parameters and work-load characteristics from latency and bandwidth perspectives. Our model, called ANATOMY, uses a queuing network formulation of the memory system parameterized with workload characteristics to obtain a closed form solution for the average miss penalty experienced by the last-level cache. We validate the model across a wide variety of memory conﬁgurations on four-core, eight-core and sixteen-core architectures. ANATOMY is able to predict memory latency with average errors of 8.1%, 4.1%and 9.7%over quad-core, eight-core and sixteen-core conﬁgurations respectively. Further, ANATOMY identiﬁe better performing design points accurately thereby allowing architects and designers to explore the more promising design points in greater detail. We demonstrate the extensibility and applicability of our model by exploring a variety of memory design choices such as the impact of clock speed, beneﬁt of multiple memory controllers, the role of banks and channel width, and so on. We also demonstrate ANATOMY’s ability to capture architectural elements such as memory scheduling mechanisms and impact of DRAM refresh cycles. In all of these studies, ANATOMY provides insight into sources of memory performance bottlenecks and is able to quantitatively predict the beneﬁt of redressing them. An insight from the model suggests that the provisioning of multiple small row-buffers in each DRAM bank achieves better performance than the traditional one (large) row-buffer per bank design. Multiple row-buffers also enable newer performance improvement opportunities such as intra-bank parallelism between data transfers and row activations, and smart row-buffer allocation schemes based on workload demand. Our evaluation (both using the analytical model and detailed cycle-accurate simulation) shows that the proposed DRAM re-organization achieves signiﬁcant speed-up as well as energy reduction. Next we examine the role of on-chip stacked DRAM caches at improving performance by reducing the load on off-chip main memory. We extend ANATOMY to cover DRAM caches. ANATOMY-Cache takes into account all the key parameters/design issues governing DRAM cache organization namely, where the cache metadata is stored and accessed, the role of cache block size and set associativity and the impact of block size on row-buffer hit rate and off-chip bandwidth. Yet the model is kept simple and provides a closed form solution for the aver-age miss penalty experienced by the last-level SRAM cache. ANATOMY-Cache is validated against detailed architecture simulations and shown to have latency estimation errors of 10.7% and 8.8%on average in quad-core and eight-core conﬁgurations respectively. An interesting in-sight from the model suggests that under high load, it is better to bypass the congested DRAM cache and leverage the available idle main memory bandwidth. We use this insight to propose a refresh reduction mechanism that virtually eliminates refresh overhead in DRAM caches. We implement a low-overhead hardware mechanism to record accesses to recent DRAM cache pages and refresh only these pages. Older cache pages are considered invalid and serviced from the (idle) main memory. This technique achieves average refresh reduction of 90% with resulting memory energy savings of 9%and overall performance improvement of 3.7%. Finally, we propose a new DRAM cache organization that achieves higher cache hit rate, lower latency and lower off-chip bandwidth demand. Called the Bi-Modal Cache, our cache organization brings three independent improvements together: (i) it enables parallel tag and data accesses, (ii) it eliminates a large fraction of tag accesses entirely by use of a novel way locator and (iii) it improves cache space utilization by organizing the cache sets as a combination of some big blocks (512B) and some small blocks (64B). The Bi-Modal Cache reduces hit latency by use of the way locator and parallel tag and data accesses. It improves hit rate by leveraging the cache capacity efficiently – blocks with low spatial reuse are allocated in the cache at 64B granularity thereby reducing both wasted off-chip bandwidth as well as cache internal fragmentation. Increased cache hit rate leads to reduction in off-chip bandwidth demand. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement of 10.8%, 13.8% and 14.0% in quad-core, eight-core and sixteen-core workloads respectively over an aggressive baseline.

APA, Harvard, Vancouver, ISO, and other styles

35

(9179468), Timothy A. Pritchett. "Workload Driven Designs for Cost-Effective Non-Volatile Memory Hierarchies." Thesis, 2020.

Find full text

Abstract:

Compared to traditional hard-disk drives (HDDs), non-volatile (NV) memory technologies oﬀer signiﬁcant performance advantages on one hand, but also incur signiﬁcant cost and asymmetric write-performance on the other. A common strategy to manage such cost- and performance-diﬀerentials is to use hierarchies such that a small, but intensely accessed, working set is staged in the NV storage (selective caching). However, when this working set includes write-heavy data, the low write-lifetime of NV storage necessitates signiﬁcant over-provisioning to maintain required lifespans (e.g., storage lifespan must match or exceed 3 year server lifespan). One may think that employing DRAM-based write-buﬀers can ﬁlter writes that trickle through to the NV storage and thus alleviate the write-pressure felt at the NV storage. Unfortunately, selective caches, when used with common recency-based or frequency-based replacement, have access patterns that require large write buﬀers (e.g., 100MB+ relative to a 12GB cache) to ﬁlter writes adequately. Further, these large DRAM write-buﬀers also require backup-power to ensure the durability of disk writes. More sophisticated replacement policies that combine recency and frequency can reduce the size of the DRAM buﬀer (while preserving write-ﬁltering), but are so computationally-expensive that they can limit the I/O rate, especially for simple controllers (e.g., RAID controller).
My ﬁrst contribution is the design and implementation of WriteGuard– a self-tuning sieving write-buﬀer algorithm that ﬁlters writes as well as the highly-eﬀective (but computationally-expensive) algorithms while requiring lightweight computation comparable to a simple LRU-based write-buﬀer. While WriteGuard reduces the capacity needed for DRAM buﬀering (to approx. 64 MB), it does not eliminate the need for DRAM buﬀers (and corresponding power backup).
For my second thrust, I identify two speciﬁc application characteristics – (1) the vast majority of the write-buﬀer’s contents is composed of write-dominant blocks, and (2) the vast majority of blocks in the write-buﬀer are overwritten within a period of 28 hours. I show that these characteristics help enable a high-density, optimized STT-MRAM as a replacement for DRAM, which enables durable write-buﬀers (thus eliminating the cost of power backup for the write-buﬀer). My optimized STT-MRAM-based write buﬀer achieves higher density by (a) trading oﬀ superﬂuous durability by exploiting characteristic (2), and (b) deoptimizing the read-performance of STT-MRAM by leveraging characteristic (1). Together, the techniques increase the density of STT-MRAM by 20% with low or no impact on write-buﬀer performance.

APA, Harvard, Vancouver, ISO, and other styles

36

Chang, Da-Wei, and 張大緯. "A Design and Implementation of Memory Caches in World Wide Web Servers." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/80244474560521073343.

Full text

Abstract:

碩士
國立交通大學
資訊科學學系
86
With the popularity of the World Wide Web( WWW ), web traffic has become the fastest growing component among all kinds of Internet traffics. However, so large volumes of traffic causes the document retrieval latency perceived by web users becomes longer.Many researchers notice the problem and have made efforts on improving the WWW latency. The latency can be reduced in two ways: the reduction of network delay and the improvement of web server's throughput. Our research aims at improving web server's throughput by keeping a memory cache in a web server's address space. In this thesis, we focus on the design and implementation of a memory cache. We propose a novel web cache management policy. The experiment results show three things. First, our memory cache is beneficial since, by keeping a cache which it size is only 1.8% of total document size, the throughput improvement can achieve 16.9%.Second, our cache management policy is suitable for current web traffic. Third, with the increasing popularity of multimedia files, it is very likely that a file is larger than the total size of the memory cache. Under this condition, our policy will outperform others that are currently used in WWW.

APA, Harvard, Vancouver, ISO, and other styles

37

Ting-JyunLin and 林霆鈞. "Distributed In-Memory Caches for NoSQL Persistent Stores: Design Considerations and Performance Impacts." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/hx6fsd.

Full text

Abstract:

碩士
國立成功大學
資訊工程學系碩博士班
101
NoSQL key-value persistent data stores are emerging, which are designed to accommodate a potentially large volume of data increased rapidly from a variety of sources. Examples of such key-value stores include Google Bigtable/Spanner, Hadoop HBase, Cassandra, MongoDB, Counchbase, etc. As typical key-value stores manipulate data objects stored in disks, the accesses are inefficiency. We present in this paper a caching model to store data objects tentatively in memories distributed in a number of cache servers. We reveal a number of design issues involved in developing such a caching model. These issues are relevant to consistency, scalability and availability. Specifically, our study notes that to develop a cache cluster shall have an in-depth understanding regarding the backend persistent key-value stores. We then present a caching model, assuming that Hadoop HBase serves as our backend. In addition, performing range search, one of major operations offered by most NoSQL key-value stores, in our caching model together with the HBase backend is illustrated. Computer simulations demonstrate our performance results.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Cache memory – Design'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles