Academic literature on the topic 'Cache hierarchy'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Cache hierarchy.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Cache hierarchy"

1

Yavits, Leonid, Amir Morad, and Ran Ginosar. "Cache Hierarchy Optimization." IEEE Computer Architecture Letters 13, no. 2 (July 29, 2014): 69–72. http://dx.doi.org/10.1109/l-ca.2013.18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zhao, Huatao, Xiao Luo, Chen Zhu, Takahiro Watanabe, and Tianbo Zhu. "Behavior-aware cache hierarchy optimization for low-power multi-core embedded systems." Modern Physics Letters B 31, no. 19-21 (July 27, 2017): 1740067. http://dx.doi.org/10.1142/s021798491740067x.

Full text
Abstract:
In modern embedded systems, the increasing number of cores requires efficient cache hierarchies to ensure data throughput, but such cache hierarchies are restricted by their tumid size and interference accesses which leads to both performance degradation and wasted energy. In this paper, we firstly propose a behavior-aware cache hierarchy (BACH) which can optimally allocate the multi-level cache resources to many cores and highly improved the efficiency of cache hierarchy, resulting in low energy consumption. The BACH takes full advantage of the explored application behaviors and runtime cache resource demands as the cache allocation bases, so that we can optimally configure the cache hierarchy to meet the runtime demand. The BACH was implemented on the GEM5 simulator. The experimental results show that energy consumption of a three-level cache hierarchy can be saved from 5.29% up to 27.94% compared with other key approaches while the performance of the multi-core system even has a slight improvement counting in hardware overhead.
APA, Harvard, Vancouver, ISO, and other styles
3

Tabak, Daniel. "Cache and Memory Hierarchy Design." ACM SIGARCH Computer Architecture News 23, no. 3 (June 1995): 28. http://dx.doi.org/10.1145/203618.564957.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Franaszek, P. A., L. A. Lastras-Montano, S. R. Kunkel, and A. C. Sawdey. "Victim management in a cache hierarchy." IBM Journal of Research and Development 50, no. 4.5 (July 2006): 507–23. http://dx.doi.org/10.1147/rd.504.0507.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Garashchenko, A. V., and L. G. Gagarina. "An Approach to the Formation of Test Sequences Based on the Graph Model of the Cache Memory Hierarchy." Proceedings of Universities. ELECTRONICS 25, no. 6 (December 2020): 548–57. http://dx.doi.org/10.24151/1561-5405-2020-25-6-548-557.

Full text
Abstract:
The verification of the cache memory hierarchy in modern SoC due to the large state space requires a huge number of complex tests. This becomes the main problem for functional verification. To cover the entire state space, a graph model of the cache memory hierarchy as well as the methods of generating the formation of the test sequences based on this model have been proposed. The graph model vertices are a set of states (tags, values, etc.) of each hierarchy level, and the edges are a set of transitions between states (instructions for reading, records). The graph model, describing all states of the cache-memory hierarchy states, has been developed. Each edge in the graph is a separate check sequence. In case of the non-deterministic situations, such as the choice of a channel (port) for multichannel cache memory, it will not be possible to resolve them at the level of the graph model, since the choice of the channel depends on many factors not considered within the model framework. It has been proposed to create a separate instance of a subgraph for each channel. The described approach has revealed, in verification of the multiport cache-memory hierarchy of the developed core with the new vector architecture VLIW DSP, a few architectural and functional errors. This approach can be used to test other processor cores and their blocks
APA, Harvard, Vancouver, ISO, and other styles
6

Ding, Wei, Yuanrui Zhang, Mahmut Kandemir, and Seung Woo Son. "Compiler-Directed File Layout Optimization for Hierarchical Storage Systems." Scientific Programming 21, no. 3-4 (2013): 65–78. http://dx.doi.org/10.1155/2013/167581.

Full text
Abstract:
File layout of array data is a critical factor that effects the behavior of storage caches, and has so far taken not much attention in the context of hierarchical storage systems. The main contribution of this paper is a compiler-driven file layout optimization scheme for hierarchical storage caches. This approach, fully automated within an optimizing compiler, analyzes a multi-threaded application code and determines a file layout for each disk-resident array referenced by the code, such that the performance of the target storage cache hierarchy is maximized. We tested our approach using 16 I/O intensive application programs and compared its performance against two previously proposed approaches under different cache space management schemes. Our experimental results show that the proposed approach improves the execution time of these parallel applications by 23.7% on average.
APA, Harvard, Vancouver, ISO, and other styles
7

CARAZO, PABLO, RUBÉN APOLLONI, FERNANDO CASTRO, DANIEL CHAVER, LUIS PINUEL, and FRANCISCO TIRADO. "REDUCING CACHE HIERARCHY ENERGY CONSUMPTION BY PREDICTING FORWARDING AND DISABLING ASSOCIATIVE SETS." Journal of Circuits, Systems and Computers 21, no. 07 (November 2012): 1250057. http://dx.doi.org/10.1142/s0218126612500570.

Full text
Abstract:
The first level data cache in modern processors has become a major consumer of energy due to its increasing size and high frequency access rate. In order to reduce this high energy consumption, we propose in this paper a straightforward filtering technique based on a highly accurate forwarding predictor. Specifically, a simple structure predicts whether a load instruction will obtain its corresponding data via forwarding from the load-store structure — thus avoiding the data cache access — or if it will be provided by the data cache. This mechanism manages to reduce the data cache energy consumption by an average of 21.5% with a negligible performance penalty of less than 0.1%. Furthermore, in this paper we focus on the cache static energy consumption too by disabling a portion of sets of the L2 associative cache. Overall, when merging both proposals, the combined L1 and L2 total energy consumption is reduced by an average of 29.2% with a performance penalty of just 0.25%.
APA, Harvard, Vancouver, ISO, and other styles
8

Feliu, Josue, Salvador Petit, Julio Sahuquillo, and Jose Duato. "Cache-Hierarchy Contention-Aware Scheduling in CMPs." IEEE Transactions on Parallel and Distributed Systems 25, no. 3 (March 2014): 581–90. http://dx.doi.org/10.1109/tpds.2013.61.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zahran, Mohamed M. "On cache memory hierarchy for Chip-Multiprocessor." ACM SIGARCH Computer Architecture News 31, no. 1 (March 2003): 39–48. http://dx.doi.org/10.1145/773365.773370.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Yan, Mengjia, Bhargava Gopireddy, Thomas Shull, and Josep Torrellas. "Secure Hierarchy-Aware Cache Replacement Policy (SHARP)." ACM SIGARCH Computer Architecture News 45, no. 2 (September 14, 2017): 347–60. http://dx.doi.org/10.1145/3140659.3080222.

Full text
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Cache hierarchy"

1

Huang, Cheng-Chieh. "Optimizing cache utilization in modern cache hierarchies." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/19571.

Full text
Abstract:
Memory wall is one of the major performance bottlenecks in modern computer systems. SRAM caches have been used to successfully bridge the performance gap between the processor and the memory. However, SRAM cache’s latency is inversely proportional to its size. Therefore, simply increasing the size of caches could result in negative impact on performance. To solve this problem, modern processors employ multiple levels of caches, each of a different size, forming the so called memory hierarchy. Upon a miss, the processor will start to lookup the data from the highest level (L1 cache) to the lowest level (main memory). Such a design can effectively reduce the negative performance impact of simply using a large cache. However, because SRAM has lower storage density compared to other volatile storage, the size of an SRAM cache is restricted by the available on-chip area. With modern applications requiring more and more memory, researchers are continuing to look at techniques for increasing the effective cache capacity. In general, researchers are approaching this problem from two angles: maximizing the utilization of current SRAM caches or exploiting new technology to support larger capacity in cache hierarchies. The first part of this thesis focuses on how to maximize the utilization of existing SRAM cache. In our first work, we observe that not all words belonging to a cache block are accessed around the same time. In fact, a subset of words are consistently accessed sooner than others. We call this subset of words as critical words. In our study, we found these critical words can be predicted by using access footprint. Based on this observation, we propose critical-words-only cache (co cache). Unlike the conventional cache which stores all words that belongs to a block, co-cache only stores the words that we predict as critical. In this work, we convert an L2 cache to a co-cache and use L1s access footprint information to predict critical words. Our experiments show the co-cache can outperform a conventional L2 cache in the workloads whose working-set-sizes are greater than the L2 cache size. To handle the workloads whose working-set-sizes fit in the conventional L2, we propose the adaptive co-cache (acocache) which allows the co-cache to be configured back to the conventional cache. The second part of this thesis focuses on how to efficiently enable a large capacity on-chip cache. In the near future, 3D stacking technology will allow us to stack one or multiple DRAM chip(s) onto the processor. The total size of these chips is expected to be on the order of hundreds of megabytes or even few gigabytes. Recent works have proposed to use this space as an on-chip DRAM cache. However, the tags of the DRAM cache have created a classic space/time trade-off issue. On the one hand, we would like the latency of a tag access to be small as it would contribute to both hit and miss latencies. Accordingly, we would like to store these tags in a faster media such as SRAM. However, with hundreds of megabytes of die-stacked DRAM cache, the space overhead of the tags would be huge. For example, it would cost around 12 MB of SRAM space to store all the tags of a 256MB DRAM cache (if we used conventional 64B blocks). Clearly this is too large, considering that some of the current chip multiprocessors have an L3 that is smaller. Prior works have proposed to store these tags along with the data in the stacked DRAM array (tags-in-DRAM). However, this scheme increases the access latency of the DRAM cache. To optimize access latency in the DRAM cache, we propose aggressive tag cache (ATCache). Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In addition, we also address the high miss latency issue and cache pollution caused by excessive prefetching. To reduce this overhead, we propose a cost-effective prefetching, which is a combination of dynamic prefetching granularity tunning and hit-prefetching, to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average. The last proposed work in this thesis is a DRAM-Cache-Aware (DCA) DRAM controller. In this work, we first address the challenge of scheduling requests in the DRAM cache. While many recent DRAM works have built their techniques based on a tagsin- DRAM scheme, storing these tags in the DRAM array, however, increases the complexity of a DRAM cache request. In contrast to a conventional request to DRAM main memory, a request to the DRAM cache will now translate into multiple DRAM cache accesses (tag and data). In this work, we address challenges of how to schedule these DRAM cache accesses. We start by exploring whether or not a conventional DRAM controller will work well in this scenario. We introduce two potential designs and study their limitations. From this study, we derive a set of design principles that an ideal DRAM cache controller must satisfy. We then propose a DRAM-cache-aware (DCA) DRAM controller that is based on these design principles. Our experimental results show that DCA can outperform the baseline over 14%.
APA, Harvard, Vancouver, ISO, and other styles
2

Settle, M. W. Alexander. "An adaptive chip multiprocessor cache hierarchy." Connect to online resource, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3256380.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Kurian, George Ph D. Massachusetts Institute of Technology. "Locality-aware cache hierarchy management for multicore processors." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/97806.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 185-194).
Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage requirement for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional virtual channels. Second, a locality-aware replication scheme that better manages the private caches is proposed. This scheme controls replication based on data reuse information and seamlessly adapts between private and logically shared caching of on-chip data at the fine granularity of cache lines. A low-overhead runtime profiling capability to measure the locality of each cache line is built into hardware. Private caching is only allowed for data blocks with high spatio-temporal locality. Third, a Timestamp-based memory ordering validation scheme is proposed that enables the locality-aware private cache replication scheme to be implementable in processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation violations, and hence is applicable to the locality-aware protocol. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. Fourth, a locality-aware last-level cache (LLC) replication scheme that better manages the LLC is proposed. This scheme adapts replication at runtime based on fine-grained cache line reuse information and thereby, balances data locality and off-chip miss rate for optimized execution. Finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. The design of this scheme is motivated by the experimental observation that both locality-aware private cache & LLC replication enable varying performance improvements across benchmarks. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% over a state-of-the-art baseline while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache capacity of each core).
George Kurian.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
4

Valls, Mompó Joan Josep. "Improving Energy and Area Scalability of the Cache Hierarchy in CMPs." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/79551.

Full text
Abstract:
As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of larger core counts. Directory-based protocols constitute the most scalable alternative. A conventional directory, however, suffers from an inefficient use of storage and energy. First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single core. Second, although increasing directory size and associativity could boost system performance by reducing the coverage misses, it would come at the expense of area and energy consumption. This thesis focuses and exploits the important differences of behavior between private and shared blocks from the directory point of view. These differences claim for a separate management of both types of blocks at the directory. First, we propose the PS-Directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared Directory Cache, and uses a larger and slower second-level Private Directory Cache to track the large amount of private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption. In this thesis we also show that the shared/private ratio of entries in the directory varies across applications and across different execution phases within the applications, which encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory reduces the number of ways with storage for shared blocks and it allows this storage to be powered off or on at run-time according to the dynamic requirements of the applications following a repartitioning algorithm. Results show similar performance as a traditional directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power consumption savings. This dissertation also deals with the scalability issues in terms of power found in processor caches. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowledge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Results show significant dynamic power consumption savings. Finally, we propose an energy-efficient architectural design that can be effectively applied to any kind of set-associative cache memory, not only to processor caches. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the X least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that this filtering mechanism achieves energy consumption in set-associative caches similar to direct mapped ones. Experimental results show that the proposals presented in this thesis offer a good tradeoff among these three major design axes.
Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético para cumplir con las demandas de un número núcleos mayor. Los protocolos basados en directorio constituyen la alternativa más escalable. Un directorio convencional, no obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que la mayoría de los bloques en un directorio solo se encuentran en la cache de un único núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el consumo energético. Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel (concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y lenta en el segundo nivel (Private Directory Cache) para poder mantener la información de los bloques privados. Los resultados experimentales muestran que, comparado con un directorio convencional, el PS-Directory consigue mejorar las prestaciones a la vez que reduce el área de silicio y el consumo energético. Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones de consumo estático y dinámico. Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos importantes ahorros de energía dinámica. Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un consumo energético en caches asociativas por conjunto similar de las caches de mapeado directo. Los resultados experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen compromiso entre estos tres importantes pilares de diseño.
Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del directori augmentaria les prestacions del sistema, això suposaria un increment notable en el consum d'energia. Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private Directory Cache) per poder mantenir la informació dels blocs privats. Els resultats experimentals mostren que, comparat amb un directori convencional, el PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici i el consum energètic. Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic. Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro- bar en les memòries cache. Les caches on-chip consumeixen una part significativa del consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia dinàmica. Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés. Els resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un consum energètic en caches associatives per conjunt similar al de les caches de mapejada directa. Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon compromís entre aquestros tres importants pilars de diseny.
Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551
TESIS
APA, Harvard, Vancouver, ISO, and other styles
5

Xiang, Ping. "ANALYZING INSTRUCTTION BASED CACHE REPLACEMENT POLICIES." Master's thesis, University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2589.

Full text
Abstract:
The increasing speed gap between microprocessors and off-chip DRAM makes last-level caches (LLCs) a critical component for computer performance. Multi core processors aggravate the problem since multiple processor cores compete for the LLC. As a result, LLCs typically consume a significant amount of the die area and effective utilization of LLCs is mandatory for both performance and power efficiency. We present a novel replacement policy for last-level caches (LLCs). The fundamental observation is to view LLCs as a shared resource among multiple address streams with each stream being generated by a static memory access instruction. The management of LLCs in both single-core and multi-core processors can then be modeled as a competition among multiple instructions. In our proposed scheme, we prioritize those instructions based on the number of LLC accesses and reuses and only allow cache lines having high instruction priorities to replace those of low priorities. The hardware support for our proposed replacement policy is light-weighted. Our experimental results based on a set of SPEC 2006 benchmarks show that it achieves significant performance improvement upon the least-recently used (LRU) replacement policy for benchmarks with high numbers of LLC misses. To handle LRU-friendly workloads, the set sampling technique is adopted to retain the benefits from the LRU replacement policy.
M.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Engineering MSCpE
APA, Harvard, Vancouver, ISO, and other styles
6

Ibrahim, Mohamed Assem Abd ElMohsen. "Rethinking Cache Hierarchy And Interconnect Design For Next-Generation Gpus." W&M ScholarWorks, 2020. https://scholarworks.wm.edu/etd/1627047836.

Full text
Abstract:
To match the increasing computational demands of GPGPU applications and to improve peak compute throughput, the core counts in GPUs have been increasing with every generation. However, the famous memory wall is a major performance determinant in GPUs. In other words, in most cases, peak throughput in GPUs is ultimately dictated by memory bandwidth. Therefore, to serve the memory demands of thousands of concurrently executing threads, GPUs are equipped with several sources of bandwidth such as on-chip private/shared caching resources and off-chip high bandwidth memories. However, the existing sources of bandwidth are often not sufficient for achieving optimal GPU performance. Therefore, it is important to conserve and improve memory bandwidth utilization. To achieve the aforementioned goal, this dissertation focuses on improving on-chip cache bandwidth by managing cache line (data) replication across L1 caches via rethinking the cache hierarchy and the interconnect design. Such data replication stems from the private nature of the L1 caches and inter-core locality. Specifically, each GPU core can independently request and store a given cache line (in its local L1 cache) while being oblivious to the previous requests of other cores. This dissertation treats inter-core locality (i.e., data replication) as a double-edged sword, and proposes the following. First, this dissertation shows that efficient inter-core communication can exploit data replication across the L1 caches to unlock an additional potential source of on-chip bandwidth, which we call as remote-core bandwidth. We propose to efficiently coordinate the data movement across GPU cores to exploit this remote-core bandwidth by investigating: a) which data is replicated across cores, b) which cores have the replicated data, and c) how to fetch the replicated data as soon as possible. Second, this dissertation shows that if data replication is eliminated (or reduced), then the L1 caches can effectively cache more data, leading to higher hit rates and more on-chip bandwidth. We propose designing a shared L1 cache organization, which restricts each core to cache only a unique slice of the address range, eliminating data replication. We develop lightweight mechanisms to: a) reduce the inter-core communication overheads and b) to identify applications that prefer the private L1 organization and hence execute them accordingly. Third, to improve the performance, area, and energy efficiency of the shared L1 organization, this dissertation proposes DC-L1 (DeCoupled-L1) cache, an L1 cache separated from the GPU core. We show how the decoupled nature of the DC-L1 caches provides an opportunity to aggregate the L1 caches, and enables low-overhead efficient data placement designs. These optimizations reduce data replication across the L1s and increase their bandwidth utilization. Altogether, this dissertation develops several innovative techniques to improve the efficiency of the GPU on-chip memory system, which are necessary to address the memory wall problem. The future work will explore other designs and techniques to improve on-chip bandwidth utilization by considering other bandwidth sources (e.g., scratchpad and L2 cache).
APA, Harvard, Vancouver, ISO, and other styles
7

Ibrahim, Mohamed Assem Abd ElMohsen. "Rethinking Cache Hierarchy And Interconnect Design For Next-Generation Gpus." W&M ScholarWorks, 2021. https://scholarworks.wm.edu/etd/1627047836.

Full text
Abstract:
To match the increasing computational demands of GPGPU applications and to improve peak compute throughput, the core counts in GPUs have been increasing with every generation. However, the famous memory wall is a major performance determinant in GPUs. In other words, in most cases, peak throughput in GPUs is ultimately dictated by memory bandwidth. Therefore, to serve the memory demands of thousands of concurrently executing threads, GPUs are equipped with several sources of bandwidth such as on-chip private/shared caching resources and off-chip high bandwidth memories. However, the existing sources of bandwidth are often not sufficient for achieving optimal GPU performance. Therefore, it is important to conserve and improve memory bandwidth utilization. To achieve the aforementioned goal, this dissertation focuses on improving on-chip cache bandwidth by managing cache line (data) replication across L1 caches via rethinking the cache hierarchy and the interconnect design. Such data replication stems from the private nature of the L1 caches and inter-core locality. Specifically, each GPU core can independently request and store a given cache line (in its local L1 cache) while being oblivious to the previous requests of other cores. This dissertation treats inter-core locality (i.e., data replication) as a double-edged sword, and proposes the following. First, this dissertation shows that efficient inter-core communication can exploit data replication across the L1 caches to unlock an additional potential source of on-chip bandwidth, which we call as remote-core bandwidth. We propose to efficiently coordinate the data movement across GPU cores to exploit this remote-core bandwidth by investigating: a) which data is replicated across cores, b) which cores have the replicated data, and c) how to fetch the replicated data as soon as possible. Second, this dissertation shows that if data replication is eliminated (or reduced), then the L1 caches can effectively cache more data, leading to higher hit rates and more on-chip bandwidth. We propose designing a shared L1 cache organization, which restricts each core to cache only a unique slice of the address range, eliminating data replication. We develop lightweight mechanisms to: a) reduce the inter-core communication overheads and b) to identify applications that prefer the private L1 organization and hence execute them accordingly. Third, to improve the performance, area, and energy efficiency of the shared L1 organization, this dissertation proposes DC-L1 (DeCoupled-L1) cache, an L1 cache separated from the GPU core. We show how the decoupled nature of the DC-L1 caches provides an opportunity to aggregate the L1 caches, and enables low-overhead efficient data placement designs. These optimizations reduce data replication across the L1s and increase their bandwidth utilization. Altogether, this dissertation develops several innovative techniques to improve the efficiency of the GPU on-chip memory system, which are necessary to address the memory wall problem. The future work will explore other designs and techniques to improve on-chip bandwidth utilization by considering other bandwidth sources (e.g., scratchpad and L2 cache).
APA, Harvard, Vancouver, ISO, and other styles
8

Dublish, Saumay Kumar. "Managing the memory hierarchy in GPUs." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31205.

Full text
Abstract:
Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU architectures to address the needs of upcoming application domains. One such vital improvement is the introduction of the on-chip cache hierarchy, used primarily to filter the high bandwidth demand to the off-chip memory. However, in contrast to traditional CPUs, the cache hierarchy in GPUs is presented with significantly different challenges such as cache thrashing and bandwidth bottlenecks, arising due to small caches and high levels of memory traffic. These challenges lead to severe congestion across the memory hierarchy, resulting in high memory access latencies. In memory-intensive applications, such high memory access latencies often get exposed and can no longer be hidden through multithreading, and therefore adversely impact system performance. In this thesis, we address the inefficiencies across the memory hierarchy in GPUs that lead to such high levels of congestion. We identify three major factors contributing to poor memory system performance: first, disproportionate and insufficient bandwidth resources in the cache hierarchy; second, poor cache management policies; and third, high levels of multithreading. In order to revitalize the memory hierarchy by addressing the above limitations, we propose a three-pronged approach. First, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs and identify the architectural parameters that are most critical in alleviating congestion. Subsequently, we explore the architectural design space to mitigate the bandwidth bottlenecks in a cost-effective manner. Second, we identify significant inter-core reuse in GPUs, presenting an opportunity to reuse data among the L1s. We exploit this reuse by connecting the L1 caches with a lightweight ring network to facilitate inter-core communication of shared data. We show that this technique reduces traffic to the L2 cache, freeing up the bandwidth for other accesses. Third, we present Poise, a machine learning approach to mitigate cache thrashing and bandwidth bottlenecks by altering the levels of multi-threading. Poise comprises a supervised learning model that is trained offline on a set of profiled kernels to make good warp scheduling decisions. Subsequently, a hardware inference engine is used to predict good warp scheduling decisions at runtime using the model learned during training. In summary, we address the problem of bandwidth bottlenecks across the memory hierarchy in GPUs by exploring how to best scale, supplement and utilize the existing bandwidth resources. These techniques provide an effective and comprehensive methodology to mitigate the bandwidth bottlenecks in the GPU memory hierarchy.
APA, Harvard, Vancouver, ISO, and other styles
9

SOHONI, SOHUM. "IMPROVING L2 CACHE PERFORMANCE THROUGH STREAM-DIRECTED OPTIMIZATIONS." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Delgado, Nuno Miguel de Brito. "A system’s approach to cache hierarchy-aware decomposition of data-parallel computations." Master's thesis, Faculdade de Ciências e Tecnologia, 2014. http://hdl.handle.net/10362/13014.

Full text
Abstract:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
The architecture of nowadays’ processors is very complex, comprising several computational cores and an intricate hierarchy of cache memories. The latter, in particular, differ considerably between the many processors currently available in the market, resulting in a wide variety of configurations. Application development is typically oblivious of this complexity and diversity, taking only into consideration the number of available execution cores. This oblivion prevents such applications from fully harnessing the computing power available in these architectures. This problem has been recognized by the community, which has proposed languages and models to express and tune applications according to the underlying machine’s hierarchy. These, however, lack the desired abstraction level, forcing the programmer to have deep knowledge of computer architecture and parallel programming, in order to ensure performance portability across a wide range of architectures. Realizing these limitations, the goal of this thesis is to delegate these hierarchy-aware optimizations to the runtime system. Accordingly, the programmer’s responsibilities are confined to the definition of procedures for decomposing an application’s domain, into an arbitrary number of partitions. With this, the programmer has only to reason about the application’s data representation and manipulation. We prototyped our proposal on top of a Java parallel programming framework, and evaluated it from a performance perspective, against cache neglectful domain decompositions. The results demonstrate that our optimizations deliver significant speedups against decomposition strategies based solely on the number of execution cores, without requiring the programmer to reason about the machine’s hardware. These facts allow us to conclude that it is possible to obtain performance gains by transferring hierarchyaware optimizations concerns to the runtime system.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Books on the topic "Cache hierarchy"

1

Przybylski, Steven A. Cache and memory hierarchy design: A performance-directed approach. San Mateo, Calif: Morgan Kaufmann Publishers, 1990.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Park, Won-Ho. Tagfilter :a power-aware tag hierarchy for high-level caches. Ottawa: National Library of Canada, 2003.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

Cache and Memory Hierarchy Design. Elsevier, 1990. http://dx.doi.org/10.1016/c2009-0-27582-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Cache hierarchy"

1

Candel, Francisco, Salvador Petit, Alejandro Valero, and Julio Sahuquillo. "Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache." In Euro-Par 2018: Parallel Processing, 235–48. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-96983-1_17.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Liu, Rui-fang, Change-sheng Xie, Zhi-hu Tan, and Qing Yang. "A New Hierarchy Cache Scheme Using RAM and Pagefile." In Advances in Computer Systems Architecture, 515–26. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004. http://dx.doi.org/10.1007/978-3-540-30102-8_43.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Wang, Weixun, Prabhat Mishra, and Sanjay Ranka. "Energy Optimization of Cache Hierarchy in Multicore Real-Time Systems." In Dynamic Reconfiguration in Real-Time Systems, 63–84. New York, NY: Springer New York, 2012. http://dx.doi.org/10.1007/978-1-4614-0278-7_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Machanick, Philip, and Zunaid Patel. "L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy." In Advances in Computer Systems Architecture, 305–19. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003. http://dx.doi.org/10.1007/978-3-540-39864-6_25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Krietemeyer, Michael, Daniel Versick, and Djamshid Tavangarian. "A Mathematical Model for the Transitional Region Between Cache Hierarchy Levels." In Innovative Internet Community Systems, 178–88. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11553762_18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Ma, Zhe, Trevor Carlson, Wim Heirman, and Lieven Eeckhout. "Evaluating Application Vulnerability to Soft Errors in Multi-level Cache Hierarchy." In Euro-Par 2011: Parallel Processing Workshops, 272–81. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-29740-3_31.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Li, Hai, Zhenyu Sun, Xiuyuan Bi, Weng-Fai Wong, Xiaochun Zhu, and Wenqing Wu. "STT-RAM Cache Hierarchy Design and Exploration with Emerging Magnetic Devices." In Emerging Memory Technologies, 169–99. New York, NY: Springer New York, 2013. http://dx.doi.org/10.1007/978-1-4419-9551-3_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Silva-Filho, A. G., F. R. Cordeiro, R. E. Sant’Anna, and M. E. Lima. "Heuristic for Two-Level Cache Hierarchy Exploration Considering Energy Consumption and Performance." In Lecture Notes in Computer Science, 75–83. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11847083_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Chen, Naikuo, Zhilou Yu, and Ruidong Zhao. "A Hybrid Memory Hierarchy to Improve Cache Reliability with Non-volatile STT-RAM." In Lecture Notes in Computer Science, 459–68. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-52015-5_47.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Novac, O., St Vari-Kakas, Mihaela Novac, Ecaterina Vladu, and Liliana Indrie. "Dependability Aspects Regarding the Cache Level of a Memory Hierarchy using Hamming Codes." In Innovations in Computing Sciences and Software Engineering, 567–70. Dordrecht: Springer Netherlands, 2010. http://dx.doi.org/10.1007/978-90-481-9112-3_98.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Cache hierarchy"

1

Yavits, Leonid, Amir Morad, and Ran Ginosar. "3D cache hierarchy optimization." In 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 2013. http://dx.doi.org/10.1109/3dic.2013.6702346.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Van Laer, Anouk, William Wang, and Chris Emmons. "Inefficiencies in the Cache Hierarchy." In MEMSYS '15: International Symposium on Memory Systems. New York, NY, USA: ACM, 2015. http://dx.doi.org/10.1145/2818950.2818980.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Sivaramakrishnan, Ram, and Sumti Jairath. "Next generation SPARC processor cache hierarchy." In 2014 IEEE Hot Chips 26 Symposium (HCS). IEEE, 2014. http://dx.doi.org/10.1109/hotchips.2014.7478828.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Khairy, Mahmoud, Mohamed Zahran, and Amr G. Wassal. "Efficient utilization of GPGPU cache hierarchy." In GPGPU-8: General-purpose Processing with Graphics Processing Units 8. New York, NY, USA: ACM, 2015. http://dx.doi.org/10.1145/2716282.2716291.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Yan, Mengjia, Bhargava Gopireddy, Thomas Shull, and Josep Torrellas. "Secure Hierarchy-Aware Cache Replacement Policy (SHARP)." In ISCA '17: The 44th Annual International Symposium on Computer Architecture. New York, NY, USA: ACM, 2017. http://dx.doi.org/10.1145/3079856.3080222.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Kenyon, Samantha, Sonia Lopez Alarcon, and Julio Sahuquillo. "Impact of Partitioning Cache Schemes on the Cache Hierarchy of SMT Processors." In 2015 IEEE 17th International Conference on High-Performance Computing and Communications; 2015 IEEE 7th International Symposium on Cyberspace Safety and Security; and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE, 2015. http://dx.doi.org/10.1109/hpcc-css-icess.2015.127.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Gordon-Ross, Ann, Jeremy Lau, and Brad Calder. "Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy." In the 18th ACM Great Lakes symposium. New York, New York, USA: ACM Press, 2008. http://dx.doi.org/10.1145/1366110.1366200.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Park, Jongsoo, Richard M. Yoo, Daya S. Khudia, Christopher J. Hughes, and Daehyun Kim. "Location-aware cache management for many-core processors with deep cache hierarchy." In SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: ACM, 2013. http://dx.doi.org/10.1145/2503210.2503224.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Gupta, Vishal, Vinod Ganesan, and Biswabandan Panda. "Seclusive Cache Hierarchy for Mitigating Cross-Core Cache and Coherence Directory Attacks." In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021. http://dx.doi.org/10.23919/date51398.2021.9474168.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Srikantaiah, Shekhar, Emre Kultursay, Tao Zhang, Mahmut Kandemir, Mary Jane Irwin, and Yuan Xie. "MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy." In 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2011. http://dx.doi.org/10.1109/hpca.2011.5749732.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography